Fact-checked by Grok 2 weeks ago

Function approximation

Function approximation is the process of estimating an unknown target function by constructing a simpler function that matches it closely under specified criteria, such as minimizing over a or satisfying certain constraints derived from the original function. This is typically built from a predefined class of functions by adjusting parameters based on points, observations, or functional evaluations. The goal is to enable efficient , , or when the exact form of the target function is unavailable or intractable. Classical foundations of function approximation rest on theorems establishing the density of certain function classes in spaces of continuous functions. The Weierstrass approximation theorem, proved in 1885, states that any continuous real-valued function on a closed and bounded interval can be uniformly approximated to arbitrary accuracy by a polynomial function. This result underpins methods like expansions, which approximate smooth functions locally by matching derivatives at a point, and , which represent periodic functions as sums of sines and cosines. More generally, the extends this idea to compact Hausdorff spaces, showing that subalgebras of continuous functions that separate points can approximate any continuous function uniformly. In modern contexts, particularly and , neural networks have emerged as powerful tools for function approximation due to the universal approximation theorem. This theorem, established by Cybenko in 1989, proves that a with a single hidden layer containing a finite number of neurons can uniformly approximate any on a compact subset of \mathbb{R}^n to any desired degree of accuracy, provided the is sigmoidal (e.g., logistic). Independent work by Hornik et al. in the same year generalized this to multilayer networks with arbitrary squashing s. Other contemporary methods include spline approximations for piecewise smooth functions, kernel-based methods like Gaussian processes for probabilistic estimates, and radial basis functions for scattered data . Function approximation plays a central role across disciplines, from solving partial differential equations in scientific computing via finite element methods to , where value functions are approximated to handle large state spaces. In , it enables compression and denoising through transforms, while in optimization, surrogate models accelerate expensive simulations. Challenges include ensuring , managing the curse of dimensionality in high dimensions, and balancing with computational cost, often addressed through regularization techniques or adaptive bases.

Fundamentals

Definition and Scope

Function approximation is a fundamental concept in approximation theory, which involves constructing a simpler function g from a specified class of functions to closely match a target function f on a given , typically by minimizing the \|f - g\| with respect to a chosen , such as the supremum \|h\|_\infty = \sup_{x \in D} |h(x)| or the L^2 \|h\|_2 = \left( \int_D |h(x)|^2 \, dx \right)^{1/2}. This process aims to represent complex or intractable functions using more computationally tractable alternatives, like polynomials or rational functions, while quantifying the deviation through these measures. The scope of function approximation distinguishes between exact representation, which is generally impossible for most target functions due to their infinite-dimensional nature or lack of closed-form expressions, and practical approximations that achieve controlled error on finite domains or discrete point sets. It encompasses both continuous settings, where f and g are defined over intervals like [a, b], and discrete cases, such as approximating values at finite points X_m = \{x_1, \dots, x_m\}. Central to this framework are the hypothesis class of approximators—such as the space of polynomials of degree at most n, denoted P_n—and properties of the target function, including smoothness (e.g., or higher differentiability) and periodicity, which determine the achievable approximation quality and convergence rates. For instance, approximating a discontinuous , like the Heaviside function H(x) = 0 for x < 0 and $1 for x \geq 0, using continuous functions from a hypothesis class such as polynomials highlights the challenges of capturing sharp transitions, often resulting in Gibbs-like oscillations near discontinuities despite minimizing error in the L^2 norm.

Motivations and Applications

Function approximation has roots in the early 19th century, driven by the needs of astronomers and engineers for tractable computational methods to model complex celestial mechanics and mechanical systems, where exact analytical solutions were often infeasible. Pioneering work in interpolation and series expansions, such as those developed for predicting planetary positions and designing machinery, laid the groundwork for approximating irregular functions with simpler, evaluable forms. Key motivations for function approximation include achieving computational efficiency by replacing intricate functions with simpler ones that are faster to evaluate, particularly in iterative numerical algorithms. It also enables noise reduction in empirical data, where approximations like least-squares fits average out measurement errors to reveal underlying trends. Additionally, it addresses non-analytic functions—those lacking closed-form expressions—by constructing surrogate models that capture essential behaviors without requiring exact derivations. Applications span diverse fields, including signal processing for filter design, where rational function approximations synthesize frequency responses to attenuate unwanted noise while preserving signals. In numerical analysis, approximations facilitate solving partial differential equations (PDEs) by discretizing continuous problems into manageable linear combinations of basis functions. Machine learning employs function approximation for model fitting, generalizing from data to estimate value functions in reinforcement learning tasks. In physics, it approximates potential energy surfaces, such as in quantum mechanics, to simplify Schrödinger equation solutions. Microbiology utilizes it for growth curve modeling, fitting sigmoidal functions like the to bacterial population data to predict lag, exponential, and stationary phases. A representative example is fitting a Gaussian function to noisy spectral data for curve smoothing, which minimizes squared errors to estimate peak locations and widths, effectively denoising while preserving shape—common in spectroscopy for identifying molecular transitions.

Classical Approximation Techniques

Taylor Series Expansion

The Taylor series expansion provides a method for approximating a smooth function f(x) locally around a point a using a power series derived from the function's derivatives at that point. This technique, first introduced by the English mathematician Brook Taylor in his 1715 treatise Methodus incrementorum directa et inversa, expresses f(x) as an infinite sum of terms involving powers of (x - a). The finite partial sum up to degree n, known as the Taylor polynomial P_n(x), is given by P_n(x) = \sum_{k=0}^n \frac{f^{(k)}(a)}{k!} (x - a)^k, where f^{(k)}(a) denotes the k-th derivative of f evaluated at a, with f^{(0)}(a) = f(a). This polynomial serves as an approximation to f(x) near a, capturing the function's behavior through its local derivatives. Taylor's theorem formalizes the accuracy of this approximation by stating that for a function f that is n+1 times differentiable on an interval containing a and x, f(x) = P_n(x) + R_n(x), where R_n(x) is the remainder term. One common form of the remainder, the Lagrange form, is R_n(x) = \frac{f^{(n+1)}(\xi)}{(n+1)!} (x - a)^{n+1} for some \xi between a and x. This form arises from applying the mean value theorem repeatedly to the error function f(x) - P_n(x), which has n+1 roots at points related to the derivatives, leading to an expression involving the (n+1)-th derivative at an intermediate point \xi. An alternative integral form of the remainder, R_n(x) = \frac{1}{n!} \int_a^x (x - t)^n f^{(n+1)}(t) \, dt, can be derived via integration by parts applied n times to the fundamental theorem of calculus, starting from f(x) = f(a) + \int_a^x f'(t) \, dt and iteratively incorporating higher derivatives. These remainder estimates quantify the approximation error, which decreases as n increases for sufficiently smooth functions. For analytic functions—those that are infinitely differentiable and equal to their Taylor series within a neighborhood—the infinite Taylor series converges to f(x) pointwise in the disk of convergence in the complex plane, or interval in the real case. Even for non-analytic but smooth functions, the Taylor polynomials provide increasingly accurate local approximations near a, with the error bounded by the remainder term. A classic example is the approximation of \sin x around a = 0 (a Maclaurin series) using the fifth-degree Taylor polynomial: P_5(x) = x - \frac{x^3}{3!} + \frac{x^5}{5!} = x - \frac{x^3}{6} + \frac{x^5}{120}. The derivatives cycle as \sin^{(k)}(0) = 0 for even k > 0 and \sin^{(k)}(0) = (-1)^{(k-1)/2} for odd k, yielding only odd-powered terms; this polynomial approximates \sin x well for small |x|, such as P_5(0.1) \approx 0.0998334166, which matches the exact value \sin(0.1) \approx 0.0998334166 to high precision. Despite its strengths, the Taylor series has limitations for global approximation. For non-entire functions, such as f(x) = \frac{1}{1+x^2}, the series centered at a=0 has a finite R=1, beyond which it diverges, failing to represent the function globally due to singularities in the at x = \pm i. In contrast to global methods like , which fit data points without requiring differentiability, the Taylor approach excels in local fidelity but may exhibit poor behavior far from the expansion point.

Fourier Series and Transforms

Fourier series provide a method for approximating periodic functions by decomposing them into sums of sine and cosine terms, representing the function in the frequency domain. For a periodic function f(x) with period $2\pi, the Fourier series is given by f(x) = \frac{a_0}{2} + \sum_{n=1}^\infty \left( a_n \cos(nx) + b_n \sin(nx) \right), where the coefficients are computed as a_n = \frac{1}{\pi} \int_{-\pi}^\pi f(x) \cos(nx) \, dx, \quad b_n = \frac{1}{\pi} \int_{-\pi}^\pi f(x) \sin(nx) \, dx for n \geq 0, with a_0 = \frac{1}{\pi} \int_{-\pi}^\pi f(x) \, dx. This representation arises from the orthogonality of the trigonometric basis functions over the interval [-\pi, \pi], where \int_{-\pi}^\pi \cos(mx) \cos(nx) \, dx = \begin{cases} 2\pi & m = n = 0 \\ \pi & m = n \geq 1 \\ 0 & m \neq n \end{cases}, \quad \int_{-\pi}^\pi \sin(mx) \sin(nx) \, dx = \pi \delta_{mn} \ (m,n \geq 1), \quad \int_{-\pi}^\pi \cos(mx) \sin(nx) \, dx = 0 for integers m, n \geq 0, with \delta_{mn} the Kronecker delta (equal to 1 if m=n, 0 otherwise). The orthogonality ensures that the coefficients minimize the least-squares error between the function and its approximation, projecting f(x) onto the subspace spanned by these basis functions. Under the Dirichlet conditions—namely, that f(x) is periodic with period $2\pi, piecewise continuous on [-\pi, \pi], and has a finite number of maxima, minima, and discontinuities in one period—the converges to f(x) at points of and to the of the left and right limits at jump discontinuities. These conditions, established by in the , guarantee the series' utility for approximating sufficiently well-behaved periodic functions, such as those encountered in physical systems with repetitive behavior. For non-periodic functions defined on the entire real line, the Fourier series framework extends to the , which replaces the discrete sum with an over continuous frequencies. The of a function f(x) is F(\omega) = \int_{-\infty}^\infty f(x) e^{-i \omega x} \, dx, with the inverse transform f(x) = \frac{1}{2\pi} \int_{-\infty}^\infty F(\omega) e^{i \omega x} \, d\omega, allowing reconstruction of the original function from its frequency components. This formulation arises as the limit of for functions with increasingly large periods, providing a global approximation in the suitable for aperiodic signals. A classic example of Fourier series approximation is the square wave function, defined as f(x) = -1 for -\pi < x < 0 and f(x) = 1 for $0 < x < \pi, extended periodically. Its is the odd sine series f(x) \sim \frac{4}{\pi} \sum_{k=1,3,5,\dots}^\infty \frac{\sin(kx)}{k}, where partial sums approximate the discontinuities but exhibit overshoot near the jumps, known as the Gibbs phenomenon. The overshoot reaches approximately 9% of the jump height (about 0.18 for a jump of 2) and persists regardless of the number of terms, highlighting a limitation in uniform convergence at discontinuities. In applications, Fourier series and transforms enable signal decomposition in audio processing, where periodic waveforms like musical tones are analyzed into harmonic components for synthesis, filtering, and compression. For instance, the vocal tract can be modeled as a filter acting on the frequency spectrum generated by vocal cords, allowing reconstruction of speech sounds from their Fourier representations. This frequency-domain approach facilitates tasks such as noise removal and harmonic analysis in sound engineering.

Interpolation Methods

Polynomial Interpolation

Polynomial interpolation is a method for constructing a polynomial of degree at most n that exactly matches the values of a function f at n+1 distinct points x_0, x_1, \dots, x_n, denoted as (x_i, y_i) where y_i = f(x_i). This approach provides an exact fit at the interpolation nodes but approximates f elsewhere, serving as a foundational technique in for data reconstruction and function evaluation. Unlike , which rely on derivatives at a single point, polynomial interpolation uses only function values and is particularly useful when derivative information is unavailable. The Lagrange form expresses the interpolating polynomial P(x) as P(x) = \sum_{i=0}^n y_i \ell_i(x), where the Lagrange basis polynomials are \ell_i(x) = \prod_{\substack{j=0 \\ j \neq i}}^n \frac{x - x_j}{x_i - x_j}. Each \ell_i(x) is a polynomial of degree n that satisfies \ell_i(x_k) = \delta_{ik} (the Kronecker delta), ensuring P(x_i) = y_i. This explicit formula, attributed to , allows direct computation but can be numerically unstable for large n due to potential ill-conditioning. For a function f that is (n+1)-times continuously differentiable, the interpolation error is given by f(x) - P(x) = \frac{f^{(n+1)}(\xi)}{(n+1)!} \prod_{k=0}^n (x - x_k), where \xi lies between the minimum and maximum of x, x_0, \dots, x_n. This error term, analogous to the remainder in Taylor's theorem, highlights that the approximation improves with smaller higher derivatives and closer node spacing, but diverges if the product grows rapidly. An alternative representation, Newton's divided difference interpolation, expresses P(x) in a nested form: P(x) = f[x_0] + f[x_0, x_1](x - x_0) + f[x_0, x_1, x_2](x - x_0)(x - x_1) + \cdots + f[x_0, \dots, x_n] \prod_{k=0}^{n-1} (x - x_k), where the divided differences are computed recursively: f[x_i] = y_i and f[x_i, \dots, x_{i+k}] = \frac{f[x_{i+1}, \dots, x_{i+k}] - f[x_i, \dots, x_{i+k-1}]}{x_{i+k} - x_i}. This form is computationally efficient, especially for unequally spaced points, as it avoids full matrix inversion and permits incremental updates when adding new nodes without recomputing prior terms. A significant limitation arises with high-degree polynomials and equispaced nodes, known as , where the interpolant exhibits large oscillations near the interval endpoints, leading to poor approximation even for smooth functions. This occurs because the Lebesgue constant for equispaced points grows exponentially with degree n, amplifying errors from the f(x) = 1/(1 + 25x^2) on [-1, 1], as first demonstrated by in 1901. To mitigate this, (clustered near endpoints) are often preferred over equispaced ones. For illustration, consider interpolating f(x) = e^x at five equispaced points x_i = -1 + 0.5i for i = 0, 1, 2, 3, 4 (i.e., -1, -0.5, 0, 0.5, 1), yielding y_0 \approx 0.3679, y_1 \approx 0.6065, y_2 = 1, y_3 \approx 1.6487, y_4 \approx 2.7183. The degree-4 P(x) matches these values exactly and provides a close approximation to e^x across the interval due to the function's analyticity and moderate derivative growth, with maximum error on [-1, 1] below $10^{-3}. This example demonstrates the method's effectiveness for well-behaved functions without pronounced endpoint oscillations.

Spline Interpolation

Spline interpolation constructs an approximating function as a piecewise polynomial, where the pieces are joined at specified points called knots to ensure a desired degree of smoothness. A spline of degree k is a function composed of polynomials of degree at most k on each subinterval defined by the knots, with continuity up to the (k-1)-th derivative across the knots. Cubic splines, with k=3, are particularly common and provide C^2 continuity, meaning the function, its first derivative, and second derivative are continuous at the knots. The construction of a cubic spline interpolant for data points (x_i, y_i) for i = 0, \dots, n with ordered knots x_0 < x_1 < \dots < x_n involves defining a piecewise cubic polynomial S(x) on each [x_i, x_{i+1}] such that S(x_i) = y_i and the smoothness conditions hold. This leads to solving for the second derivatives M_i = S''(x_i) at the knots, which satisfy a tridiagonal linear system derived from the continuity of the first derivative: for i=1,\dots,n-1, h_{i-1} M_{i-1} + 2(h_{i-1} + h_i) M_i + h_i M_{i+1} = 6 \left( \frac{y_{i+1} - y_i}{h_i} - \frac{y_i - y_{i-1}}{h_{i-1}} \right), where h_i = x_{i+1} - x_i. Boundary conditions specify the system; natural splines set M_0 = M_n = 0, while clamped splines specify the first derivatives at the endpoints. The coefficients are solved efficiently using the tridiagonal matrix algorithm, after which the piecewise cubics are formed. The B-spline basis offers an alternative representation, where the spline is a linear combination of B-spline basis functions with coefficients determined by solving a similar system. B-splines, or basis splines, form a partition of unity and provide a stable basis for representing splines. The i-th B-spline basis function of degree p on a knot sequence \mathbf{t} = (t_i) is defined recursively by the Cox-de Boor formula: N_{i,0}(u) = \begin{cases} 1 & t_i \leq u < t_{i+1}, \\ 0 & \text{otherwise}, \end{cases} and for p \geq 1, N_{i,p}(u) = \frac{u - t_i}{t_{i+p} - t_i} N_{i,p-1}(u) + \frac{t_{i+p+1} - u}{t_{i+p+1} - t_{i+1}} N_{i+1,p-1}(u). These functions have local support, nonzero only on [t_i, t_{i+p+1}), which ensures that modifying a coefficient affects only a local portion of the spline. Spline interpolation offers key advantages over global high-degree polynomial interpolation, particularly in avoiding Runge's phenomenon, where high-degree polynomials exhibit large oscillations near the endpoints of equispaced interpolation intervals. By restricting each piece to low degree (typically cubic), splines maintain smoothness globally while improving approximation quality and numerical stability, as the condition number of the associated systems grows more slowly with the number of knots. This makes splines preferable for large datasets or functions with varying behavior. For example, consider interpolating the noisy sine function y_i = \sin(x_i) + 0.05 \epsilon_i at 10 equispaced points x_i = i \pi / 9 for i=0,\dots,9 over [0, \pi]. A natural cubic spline yields a smooth curve that closely follows the underlying sine wave, reducing the impact of noise and avoiding endpoint oscillations that would appear in a degree-9 polynomial interpolant. The maximum error relative to the true \sin(x) is typically under 0.01, demonstrating the method's effectiveness for oscillatory functions.

Least Squares Approximation

Orthogonal Polynomials

Orthogonal polynomials form a class of polynomials that are mutually orthogonal with respect to a specified inner product on a given interval, making them particularly useful for least squares approximations in function approximation theory. These polynomials satisfy the condition that the inner product \langle p_m, p_n \rangle = \int_a^b p_m(x) p_n(x) w(x) \, dx = 0 for m \neq n, where w(x) is a positive weight function, and the polynomials have distinct degrees. Prominent families include the , defined on [-1, 1] with w(x) = 1; the , also on [-1, 1] with w(x) = (1 - x^2)^{-1/2}; and the , defined on (-\infty, \infty) with w(x) = e^{-x^2}. In least squares approximation, the goal is to find the polynomial P_n(x) of degree at most n that minimizes the L^2 error \int_a^b (f(x) - P_n(x))^2 w(x) \, dx for a given function f. Using an orthogonal polynomial basis \{p_k(x)\}_{k=0}^n, the optimal approximation is the orthogonal projection P_n(x) = \sum_{k=0}^n c_k p_k(x), where the coefficients are computed as c_k = \frac{\langle f, p_k \rangle}{\|p_k\|^2} = \frac{\int_a^b f(x) p_k(x) w(x) \, dx}{\int_a^b p_k(x)^2 w(x) \, dx}. This projection ensures efficiency because the orthogonality decouples the coefficient calculations, avoiding the need to solve a full linear system as required for non-orthogonal bases. To derive this, consider the error functional E(P_n) = \int_a^b (f(x) - \sum_{k=0}^n c_k p_k(x))^2 w(x) \, dx. Expanding and differentiating with respect to each c_j yields \frac{\partial E}{\partial c_j} = -2 \int_a^b (f(x) - P_n(x)) p_j(x) w(x) \, dx = 0, which simplifies to \langle f - P_n, p_j \rangle = 0 for j = 0, \dots, n due to orthogonality. Thus, P_n is the unique minimizer in the subspace spanned by the , as the error is orthogonal to that subspace. This approach provides the best approximation in the weighted L^2 norm and converges to f as n \to \infty for square-integrable functions. Beyond L^2 approximation, Chebyshev polynomials possess a minimax property that makes them optimal for uniform (L^\infty) approximation. The scaled Chebyshev polynomial T_n^*(x) = 2^{1-n} T_n(x) of degree n has the smallest maximum norm among all monic polynomials on [-1, 1], equal to $2^{1-n}. For best uniform approximation by polynomials of degree at most n, the error f(x) - P_n(x) equioscillates—attains its maximum absolute value at n+2 points with alternating signs—by the equioscillation theorem, and expansions in Chebyshev polynomials often yield near-minimax approximations due to this property. A representative example is the Legendre polynomial approximation of f(x) = \frac{1}{1 + x^2} on [-1, 1], where w(x) = 1. The coefficients are c_k = \frac{2k + 1}{2} \int_{-1}^1 \frac{P_k(x)}{1 + x^2} \, dx, which can be evaluated analytically using the orthogonality and known integrals involving Legendre polynomials. For instance, the zeroth-order approximation is c_0 P_0(x) \approx 0.7854, and higher-degree terms improve the L^2 fit, with the partial sum up to n=4 reducing the error significantly near the endpoints while converging uniformly on compact subsets away from singularities outside [-1, 1]. This discrete analog appears in regression analysis for data fitting.

Regression Analysis

Regression analysis serves as a cornerstone in function approximation by providing statistical methods to estimate unknown functions from discrete, often noisy data points. Unlike deterministic interpolation, regression incorporates error terms to model observational inaccuracies, focusing on minimizing the sum of squared residuals between observed and predicted values. This approach is particularly suited for scenarios where data arises from experiments or measurements subject to variability, enabling both prediction and inference about the underlying function. Seminal developments trace back to the early 19th century, with formalizing the least squares method in 1805 for orbit determination, followed by 's theoretical justification in 1809 linking it to maximum likelihood under Gaussian errors. In linear regression, the model assumes a linear relationship between predictors and the response: y = X\beta + \epsilon, where y \in \mathbb{R}^n is the vector of observations, X \in \mathbb{R}^{n \times p} is the design matrix, \beta \in \mathbb{R}^p are the unknown parameters, and \epsilon represents independent errors with zero mean and constant variance. The ordinary least squares (OLS) estimator \hat{\beta} = (X^T X)^{-1} X^T y minimizes \| y - X\beta \|^2_2, yielding the best linear unbiased estimator under . For polynomial regression, which approximates smooth functions with low-degree polynomials, the design matrix X incorporates columns of successive powers of the input variable, such as $1, x_i, x_i^2, \dots, x_i^d for degree d; however, high-degree polynomials can lead to ill-conditioned X^T X due to . To mitigate this, adds a penalty term, solving \hat{\beta} = (X^T X + \lambda I)^{-1} X^T y for regularization parameter \lambda > 0, which shrinks coefficients and improves stability at the cost of slight bias. For nonlinear function approximation, where the model takes the form y = f(x, \beta) + \epsilon with f nonlinear in \beta, direct closed-form solutions are unavailable, necessitating iterative optimization. The Gauss-Newton algorithm linearizes the residual function around current parameter estimates and solves successive problems, updating \beta via \Delta \beta = (J^T W J)^{-1} J^T r, where J is the matrix of partial derivatives, W a weight matrix, and r the vector; convergence is typically quadratic near the solution. This method excels in applications like fits, f(x, \beta) = \beta_1 e^{-\beta_2 x}, common in . Regression analysis also emphasizes statistical inference to quantify uncertainty and prevent overfitting. Assuming normally distributed errors, confidence intervals for \beta are derived from the covariance matrix \hat{\sigma}^2 (X^T X)^{-1}, where \hat{\sigma}^2 estimates error variance, enabling hypothesis tests on parameters. Model selection criteria address overfitting by balancing goodness-of-fit and complexity: the Akaike Information Criterion (AIC) penalizes with $2k (where k is the number of parameters), while the Bayesian Information Criterion (BIC) uses \log n \cdot k for n observations, favoring parsimonious models asymptotically. For instance, fitting a quadratic y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \epsilon_i to noisy experimental data—such as temperature-dependent reaction rates—yields an approximating function that captures curvature while AIC or BIC guides against unnecessary higher terms. This discrete, probabilistic framework complements continuous orthogonal projections in deterministic settings by incorporating noise and enabling predictive uncertainty.

Modern and Advanced Methods

Wavelet Approximation

Wavelet approximation leverages bases to achieve multi-resolution representations of functions, enabling efficient analysis and compression by capturing both global and local features. Central to this approach is the concept of a mother \psi(x), which generates a family of basis functions through dilations and translations: \psi_{j,k}(x) = 2^{j/2} \psi(2^j x - k), where j \in \mathbb{Z} controls the scale and k \in \mathbb{Z} the position. This construction forms an for L^2(\mathbb{R}), allowing a f to be decomposed as f(x) = \sum_{j,k} c_{j,k} \psi_{j,k}(x), with coefficients c_{j,k} = \langle f, \psi_{j,k} \rangle. The framework is underpinned by multiresolution analysis (MRA), which decomposes the function space into nested subspaces V_j spanned by scaling functions \phi_{j,k}(x) = 2^{j/2} \phi(2^j x - k), where finer details are captured by wavelet subspaces W_j such that L^2(\mathbb{R}) = \bigoplus_j W_j \oplus V_0. Approximation proceeds by projecting f onto a finite-dimensional subspace, often at a specific resolution level, or by nonlinear methods like thresholding the coefficients: small |c_{j,k}| below a threshold \lambda are set to zero, minimizing the L^2 error \|f - \hat{f}\|_2 while promoting sparsity. This thresholding yields near-optimal approximations in terms of mean squared error for functions in Besov spaces. Key properties enhance wavelet approximation's effectiveness: compact support ensures locality in space, allowing precise capture of transient features, while vanishing moments—satisfying \int x^m \psi(x) \, dx = 0 for m = 0, \dots, p-1—enable exact representation of polynomials up to degree p-1, leading to sparse expansions for smooth functions. Daubechies wavelets exemplify orthogonal bases with these traits, constructed via finite filter coefficients to achieve arbitrary regularity and support width. Compared to methods, wavelets excel in approximating non-periodic functions with discontinuities, avoiding global oscillations like the and providing sparser representations localized near singularities. A simple example is the , the prototypical case with \psi(x) = 1 for $0 \leq x < 1/2 and -1 for $1/2 \leq x < 1, zero elsewhere, possessing one vanishing moment and compact support on [0,1). For a piecewise constant function like the step function f(x) = 1 for $0 \leq x < 1/2 and $0 otherwise on [0,1), the Haar basis yields an exact representation with few non-zero coefficients at coarse scales, demonstrating efficient approximation of discontinuities.

Neural Network Approximation

Neural networks serve as powerful tools for function approximation, particularly for complex, high-dimensional, and nonlinear functions that traditional methods struggle to model efficiently. A foundational result in this domain is the universal approximation theorem, which establishes that a feedforward neural network with a single hidden layer containing a finite number of neurons, using continuous sigmoidal activation functions, can approximate any continuous function on compact subsets of \mathbb{R}^n to any desired degree of accuracy, provided the network is sufficiently wide. This theorem, proved by Cybenko in 1989, underscores the expressive capacity of neural networks as universal approximators, shifting focus from whether approximation is possible to how architecture and training influence practical performance. The core architecture for function approximation is the multilayer perceptron (MLP), consisting of an input layer, one or more hidden layers, and an output layer, where each neuron computes a weighted sum of inputs passed through a nonlinear activation function. Common activation functions include the sigmoid, defined as \sigma(z) = \frac{1}{1 + e^{-z}}, which maps inputs to (0,1) and was central to early universal approximation results, and the rectified linear unit (ReLU), f(z) = \max(0, z), which introduces sparsity and mitigates vanishing gradients in deep networks. Trade-offs between network depth (number of layers) and width (number of neurons per layer) are critical: deeper networks can approximate certain natural functions, such as those with hierarchical compositions, more efficiently with fewer parameters than shallow wide networks, though excessive depth risks optimization challenges like vanishing gradients. Training neural networks for function approximation involves optimizing the network parameters to minimize an empirical loss, typically the mean squared error, via backpropagation, an efficient algorithm that computes gradients by propagating errors backward through the network using the chain rule. Introduced by Rumelhart, Hinton, and Williams in 1986, backpropagation enables scalable training of multilayer networks by iteratively updating weights through gradient descent. In modern deep networks, overparameterization—employing far more parameters than training samples—paradoxically aids optimization by smoothing the loss landscape and facilitating convergence to global minima, despite theoretical risks of overfitting, often mitigated implicitly through techniques like implicit regularization in stochastic gradient descent. Advancements in neural architectures have extended function approximation to specialized domains: convolutional neural networks (CNNs), pioneered by LeCun et al. in 1989 for image recognition, incorporate convolutional layers and pooling to exploit spatial hierarchies in grid-like data, enabling efficient approximation of image-to-function mappings such as classification or regression tasks. More recently, transformers, introduced by Vaswani et al. in 2017, rely on self-attention mechanisms to model sequential dependencies without recurrence, providing superior approximation for sequence-to-sequence functions in areas like natural language processing and time series forecasting. A representative example of neural network approximation is fitting a multilayer perceptron to the MATLAB peaks function, a 2D nonlinear surface defined as f(x,y) = 3(1-x)^2 e^{-x^2 - (y+1)^2} - 10(\frac{x}{5} - x^3 - x^5) e^{-x^2 - y^2} - \frac{1}{3} e^{-(x+1)^2 - y^2}, which exhibits multiple local maxima and minima and has been used to demonstrate approximation in the presence of noise.

Error Analysis and Convergence

Types of Approximation Errors

In function approximation, the discrepancy between a target function f and its approximation g is quantified using various error measures that emphasize different characteristics, such as maximum deviation, average behavior, or sensitivity to scale and computation. These measures enable rigorous assessment of approximation quality across pointwise, integral, and computational perspectives, guiding the selection of methods for specific applications. The uniform error, or supremum norm, captures the worst-case pointwise deviation and is defined as \|f - g\|_\infty = \sup_{x \in D} |f(x) - g(x)|, where D is the domain of approximation. This norm is fundamental in classical approximation theory for ensuring bounded errors over the entire domain, particularly for continuous functions on compact sets. In contrast, L_p norms provide integral-based measures of error magnitude, given by \|f - g\|_p = \left( \int_D |f(x) - g(x)|^p \, d\mu \right)^{1/p} for $1 \leq p < \infty and measure \mu, with the L_2 norm \|f - g\|_2 being prominent due to its interpretation as the root-mean-square error, akin to signal energy in engineering contexts. The L_\infty norm relates to the limit of L_p norms as p \to \infty, bridging pointwise and average errors. Pointwise errors assess the approximation at specific locations, offering local insights, whereas integral errors aggregate discrepancies over the domain to reflect global performance. For discrete settings, such as sampled data, the mean squared error (MSE) serves as a discrete analog to the L_2 norm: \mathrm{MSE} = \frac{1}{n} \sum_{i=1}^n (f(x_i) - g(x_i))^2, widely used to evaluate regression-based approximations by penalizing larger deviations quadratically. Relative error normalizes the absolute difference by the function's magnitude, defined as \frac{|f(x) - g(x)|}{|f(x)|} (or its supremum/integral variants), making it scale-invariant and suitable for comparing approximations of functions with varying amplitudes. In frequency-domain methods like Fourier or wavelet approximations, aliasing introduces errors where high-frequency components fold into lower frequencies due to undersampling, distorting the reconstructed signal and violating the Nyquist criterion. Conditioning errors stem from numerical instability in computations, where small input perturbations or rounding amplify output discrepancies; these are quantified by the condition number \kappa, which bounds the relative error magnification as \frac{\| \Delta y \| / \| y \|}{\| \Delta x \| / \| x \|} \approx \kappa. Visualization aids understanding of these errors; for instance, error plots for Taylor series (centered expansions) versus Chebyshev polynomial approximations to functions like e^x on [-1, 1] reveal that Taylor errors concentrate near interval endpoints, while Chebyshev errors oscillate equi-rippled, achieving smaller maximum deviations for the same degree. Such plots underscore the minimax property of Chebyshev approximations in uniform norm. In polynomial interpolation contexts, phenomena like Runge's exhibit boundary error spikes for equispaced points, motivating shifted nodes.

Theorems on Convergence

The Stone–Weierstrass theorem provides a foundational result in function approximation, stating that if A is a subalgebra of the space of continuous real-valued functions C(K) on a compact Hausdorff space K that contains the constant functions and separates points (meaning for any distinct x, y in K there exists f in A with f(x) ≠ f(y)), then A is dense in C(K) with respect to the uniform norm. This generalizes the Weierstrass approximation theorem, which asserts that polynomials are dense in C[a, b] under the uniform norm for any closed bounded interval [a, b], ensuring that any continuous function on [a, b] can be uniformly approximated by polynomials to arbitrary accuracy: for any \epsilon > 0, there exists a polynomial g such that \|f - g\|_\infty < \epsilon. A constructive proof of the Weierstrass theorem is given by Bernstein polynomials, defined for a continuous function f on [0, 1] as B_n(f; x) = \sum_{k=0}^n \binom{n}{k} f\left(\frac{k}{n}\right) x^k (1 - x)^{n - k}, which converge uniformly to f as n → ∞, with the rate depending on the modulus of continuity of f. Jackson's theorem refines these results by quantifying the approximation rate: for a k-times continuously differentiable function f on [−1, 1], the best uniform approximation error by polynomials of degree at most n satisfies E_n(f) ≤ C_k ω(f^{(k)}, 1/n) / n^k, where ω denotes the modulus of smoothness and C_k is a constant depending on k; in particular, for smooth functions, this yields a rate of O(1/n^k). For trigonometric polynomial approximation via Fourier series, convergence rates follow from the smoothness of the periodic function: if f has r continuous derivatives on the circle, the partial sum approximation error in the uniform norm is O(\log n / n^r) as the degree n → ∞. Similarly, for cubic spline interpolation on a uniform mesh of size h, the error in the uniform norm is O(h^4) provided f is four times continuously differentiable. In modern contexts, the Kolmogorov–Arnold representation theorem establishes that any continuous multivariate function f: [0,1]^n → ℝ can be exactly represented as a finite superposition of continuous univariate functions, f(x_1, \dots, x_n) = \sum_{q=1}^{2n+1} \Phi_q \left( \sum_{p=1}^n \phi_{q,p}(x_p) \right), which supports the universal approximation capabilities of neural networks by demonstrating the sufficiency of univariate compositions for multivariate representation.

References

  1. [1]
    Approximation Function - an overview | ScienceDirect Topics
    Function approximation refers to the process of estimating an unknown function, typically by using a model or estimator that assigns expected values to ...
  2. [2]
    General Approach to Function Approximation - MDPI
    Often, an approximation of a function f can be understood as another function A f which meets a number of constraints (conditions) true for f. A f is usually ...
  3. [3]
  4. [4]
    Weierstrass Approximation Theorem -- from Wolfram MathWorld
    Any continuous function on a closed and bounded interval can be uniformly approximated on that interval by polynomials to any degree of accuracy.
  5. [5]
    11.7 The Stone–Weierstrass theorem
    The idea of the proof is a very common approximation or “smoothing” idea (convolution with an approximate delta function) that has applications far beyond pure ...
  6. [6]
    Approximation by superpositions of a sigmoidal function
    Feb 17, 1989 · The paper discusses approximation properties of other possible types of nonlinearities that might be implemented by artificial neural networks.
  7. [7]
    Multilayer feedforward networks are universal approximators
    This paper rigorously establishes that standard multilayer feedforward networks with as few as one hidden layer using arbitrary squashing functions are capable ...
  8. [8]
    A Function Approximation Approach for Parametric Optimization
    Dec 12, 2022 · We present a novel approach for approximating the primal and dual parameter-dependent solution functions of parametric optimization problems ...
  9. [9]
  10. [10]
    [PDF] A Short Course on Approximation Theory
    In the present context, the focus is primarily on the approximation of real-valued continuous functions by some simpler class of functions, such as algebraic or ...
  11. [11]
  12. [12]
  13. [13]
    [PDF] A Chronology of Interpolation: From Ancient Astronomy to Modern ...
    Abstract—This paper presents a chronological overview of the developments in interpolation theory, from the earliest times to the present date.
  14. [14]
    [PDF] A chronology of interpolation: from ancient astronomy to modern ...
    This paper presents a chronological overview of the develop- ments in interpolation theory, from the earliest times to the present.
  15. [15]
    [PDF] Approximation of functions - Hans Petter Langtangen
    Many successful numerical solution methods for differential equations, including the finite element method, aim at approximating the unknown function by a sum.
  16. [16]
    Polynomial approximation of noisy functions | Numerische Mathematik
    Jul 9, 2025 · When the function evaluations come with noise, a least-squares fit is known to reduce the effect of noise as more samples are taken. The generic ...
  17. [17]
    Function Approximation - an overview | ScienceDirect Topics
    Function-approximation techniques in RL DP and TD methods are introduced as approaches suitable for problems with discrete and limited state-action spaces.
  18. [18]
    Application of approximation theory methods to recursive digital filter ...
    There are two advantages in these methods: 1) we only need to solve linear systems of algebraic equations to obtain the filter coefficients; and 2) the phases ...
  19. [19]
    [PDF] A Brief Introduction to the Numerical Analysis of PDEs - People
    If the solution to the underlying PDE is a smooth function, a spectral method will provide a highly accurate numerical approximation to it. 22. Page 23 ...
  20. [20]
    8. Generalization and Function Approximation
    Function approximation is an instance of supervised learning, the primary topic studied in machine learning, artificial neural networks, pattern recognition, ...
  21. [21]
    [PDF] 6. Approximation Methods - DAMTP
    We may imagine that in this part of the potential, we can approximate the wavefunction by the plane wave (x) ⇠ eip(x) x. However, the wave- function also ...
  22. [22]
    Modeling of the Bacterial Growth Curve - PMC - NIH
    Several sigmoidal functions (logistic, Gompertz, Richards, Schnute, and Stannard) were compared to describe a bacterial growth curve.
  23. [23]
    smoothdata - Smooth noisy data - MATLAB - MathWorks
    Create a matrix whose rows represent three noisy signals. Smooth the three signals using a moving average, and plot the smoothed data. x = 1:100; rng(0, ...
  24. [24]
    Brook Taylor - Biography - MacTutor - University of St Andrews
    Taylor initially derived the version which occurs as Proposition 11 as a generalisation of Halley's method of approximating roots of the Kepler equation, but ...
  25. [25]
    [PDF] Taylor's Series of sin x - MIT OpenCourseWare
    In order to use Taylor's formula to find the power series expansion of sin x we have to compute the derivatives of sin(x): sin (x) = cos(x) sin (x) = − sin(x) ...
  26. [26]
    CC Taylor Series
    Find the radius of convergence R of the series. 🔗. R = 🔗. 11. Interval of convergence of Taylor series. Based on the examples we have seen, we might expect ...
  27. [27]
    [PDF] CHAPTER 4 FOURIER SERIES AND INTEGRALS
    In words, the constant function 1 is orthogonal to cosnx over the interval [0,π]. The other cosine coefficients ak come from the orthogonality of cosines.
  28. [28]
    [PDF] 18.085: Fourier Series - MIT Mathematics
    Around the midpoints, the series behaves in a bizarre way that is called the Gibbs phenomenon. Near the jump, the Fourier series creates an overshoot that ...
  29. [29]
    [PDF] 23. The Finite Fourier Transform and the Fast ... - MIT Mathematics
    Given a function f defined for all real arguments, we can give an alternative representation to it as an integral rather than as an infinite series, as follows.Missing: non- | Show results with:non-
  30. [30]
    Gibbs Phenomenon -- from Wolfram MathWorld
    The Gibbs phenomenon is an overshoot (or "ringing") of Fourier series and other eigenfunction series occurring at simple discontinuities.
  31. [31]
    Lecture 15: Fourier Series | Signals and Systems
    Today's lecture discusses an application of Fourier series, exploring how the vocal tract filters frequencies generated by the vocal cords.
  32. [32]
  33. [33]
    Lagrange Interpolating Polynomial -- from Wolfram MathWorld
    Lagrange interpolating polynomials give no error estimate. A more conceptually straightforward method for calculating them is Neville's algorithm.
  34. [34]
    Newton's Divided Difference Interpolation Formula
    Let pi_n(x)=product_(k=0)^n(x-x_k), then f(x)=f_0+sum_(k=1)^npi_(k-1)(x)[x_0,x_1,...,x_k]+R_n
  35. [35]
  36. [36]
    [PDF] Unit I: Data Fitting Chapter I.2: Polynomial Interpolation
    Vandermonde columns become nearly linearly-dependent. =⇒ ill-conditioned matrix! −1. −0.5. 0. 0.5. 1. −1.5. −1. − ...
  37. [37]
    [PDF] Cubic Spline Interpolation - MATH 375, Numerical Analysis
    Cubic spline interpolation uses a piecewise cubic polynomial, which is twice continuously differentiable, and is defined on segments between nodes.
  38. [38]
    [PDF] B(asic)-Spline Basics Carl de Boor∗ 1. Introduction This essay ...
    Apr 21, 2009 · The essay deals with splines for an arbitrary knot sequence and does rarely become more specific. In particular, the B(ernstein-Bézier)-net for ...
  39. [39]
    [PDF] Research on Interpolation and Data Fitting - arXiv
    Aug 25, 2022 · Runge's phenomenon states that when the order of the polynomial is too high, it would cause high errors, which might diverge to infinity.
  40. [40]
    [PDF] Lecture 11: Interpolation by Cubic Splines - cs.wisc.edu
    Oct 19, 2010 · For example, to interpolate the function sin(t) using 10 interpolation points over the interval. [0,3.1] then evaluate the result over a ...
  41. [41]
    AMS eBooks: Colloquium Publications
    Orthogonal Polynomials. About this Title. G. Szegő. Publication: Colloquium Publications Publication Year: 1939; Volume 23 ISBNs: 978-0-8218-1023-1 (print); ...
  42. [42]
    [PDF] Introduction to Approximation Theory
    E. W. CHENEY. Professor of Mathematics. University of Texas. McGraw-Hill Book ... Introduction to Approximation Theory. Copyright © 1966 by McGraw-Hill ...
  43. [43]
    [PDF] Orthogonal Polynomials and Least Squares Approximation
    ... orthogonal set of functions on [a, b] with respect to weight function w, then the least squares approximation to f(x) on [a, b] is. P(x) = n. X j=0 aj ϕj (x).
  44. [44]
    [PDF] 8.2 - Orthogonal Polynomials and Least Squares Approximation
    Then any polynomial in Qn can be written uniquely as a linear combination of φ0(x),··· ,φn(x). 8.2 - Orthogonal Polynomials and Least Squares Approximation ...
  45. [45]
    [PDF] MATH2070: LAB 9: Legendre Polynomials and L2 Approximation
    Oct 31, 2016 · In this lab we will consider four different selections of basis functions in the space L2([−1, 1]). The first is the usual monomials 1, x, x2, ...
  46. [46]
    Elements of Statistical Learning: data mining, inference, and ...
    The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition February 2009. Trevor Hastie, Robert Tibshirani, Jerome Friedman.
  47. [47]
    Ridge Regression: Biased Estimation for Nonorthogonal Problems
    Ridge Regression: Biased Estimation for Nonorthogonal Problems. Arthur E. Hoerl University of Delaware and E. 1. du Pont de Nemours & Co. &. Robert W. Kennard ...
  48. [48]
    A new look at the statistical model identification - IEEE Xplore
    A new estimate minimum information theoretical criterion (AIC) estimate (MAICE) which is designed for the purpose of statistical identification is introduced.
  49. [49]
    Estimating the Dimension of a Model - Project Euclid
    Abstract. The problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms ...
  50. [50]
    Orthonormal bases of compactly supported wavelets
    Abstract. We construct orthonormal bases of compactly supported wavelets, with arbitrarily high regularity. The order of regularity increases linearly with the ...
  51. [51]
    [PDF] Orthonormal bases of compactly supported wavelets
    We conclude this paper with the plots of a few of the compactly supported wavelets constructed here. 2. Multiresolution Analysis and Image Decomposition and ...
  52. [52]
    A theory for multiresolution signal decomposition: the wavelet ...
    Abstract: Multiresolution representations are effective for analyzing the information content of images. The properties of the operator which approximates a ...
  53. [53]
    Ideal spatial adaptation by wavelet shrinkage - Oxford Academic
    SUMMARY. With ideal spatial adaptation, an oracle furnishes information about how best to adapt a spatially variable estimator, whether piecewise constant,
  54. [54]
    Density estimation by wavelet thresholding - Project Euclid
    This paper explores density estimation using thresholding of wavelet coefficients, studying minimax rates of convergence and a single wavelet threshold ...
  55. [55]
    Ten Lectures on Wavelets | SIAM Publications Library
    This monograph contains 10 lectures presented by Dr. Daubechies as the principal speaker at the 1990 CBMS-NSF Conference on Wavelets and Applications.
  56. [56]
    [PDF] Wavelets, approximation, and compression - User pages
    If a function is piecewise smooth, with isolated discontinuities, then Fourier approximation is poor because of the disconti- nuities. In the wavelet case with ...
  57. [57]
    5.1. Multilayer Perceptrons - Dive into Deep Learning
    Sigmoids are still widely used as activation functions on the output units when we want to interpret the outputs as probabilities for binary classification ...
  58. [58]
    [PDF] Rectified Linear Units Improve Restricted Boltzmann Machines
    We call a unit that uses this approximation a N oisyRectified Linear U nit. (NReLU) and this paper shows that NReLUs work better than binary hidden units for ...
  59. [59]
    [PDF] Depth-Width Tradeoffs in Approximating Natural Functions with ...
    We provide several new depth-based separation results for feed-forward neural networks, proving that various types of simple and natural functions can be better ...
  60. [60]
    Learning representations by back-propagating errors - Nature
    Oct 9, 1986 · We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in ...
  61. [61]
    Learning and Generalization in Overparameterized Neural Networks ...
    Nov 12, 2018 · In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer ...
  62. [62]
    [PDF] Backpropagation Applied to Handwritten Zip Code Recognition
    Previous work performed on recognizing simple digit images (LeCun. 1989) showed that good generalization on complex tasks can be obtained by designing a network ...
  63. [63]
    [1706.03762] Attention Is All You Need - arXiv
    Jun 12, 2017 · We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.Missing: approximation | Show results with:approximation
  64. [64]
    (PDF) An MLP Neural Network for Approximation of a Functional ...
    May 16, 2025 · ... Matlab “peaks” function. This function (4) [10] is used by the Matlab documentation for illustrative exam-. ples, such as graphs. For a neural ...
  65. [65]
    [PDF] Nonlinear approximation - University of South Carolina
    In approximation theory, one usually assumes that the values of certain simple linear functionals applied to the target function are known. This information is ...<|control11|><|separator|>
  66. [66]
    [PDF] Sharp Asymptotics of the Lp Approximation Error for Interpolation on ...
    We recall that the approximation error is measured in Lp norm, where the exponent p is fixed and ... measuring the approximation error plays a rather ... ), ...Missing: L_p | Show results with:L_p
  67. [67]
    [PDF] Universal approximation bounds for superpositions of a sigmoidal ...
    An index of resolvability provides a bound to the total mean squared error in terms of the approximation error and the model complexity according to a theorem ...
  68. [68]
    9.1.5 Mean Squared Error (MSE) - Probability Course
    The mean squared error (MSE) of this estimator is defined as E[(X−ˆX)2]=E[(X−g(Y))2]. The MMSE estimator of X, ...
  69. [69]
    8.1 Sampling Theory - PBR Book
    Aliasing causes the high-frequency information in the original function to be lost and to reappear as lower-frequency error. A possible solution to the problem ...<|separator|>
  70. [70]
    [PDF] Numerical Conditioning
    Aug 29, 2022 · Numerical conditioning involves forward and backward error, multiple roots, and the condition number, which is the error magnification factor.
  71. [71]
    A Comparison of Some Taylor and Chebyshev Series - jstor
    In this paper we consider the use of the four series given below for approximating a function f in the interval -1 < x < 1: (1) (i) f(x)=ao+alx+a2x2 + a3x3 ...
  72. [72]
    Chebyshev Approximation and How It Can Help You Save Money ...
    Sep 30, 2012 · A technique for figuring out how you can come as close as possible to computing the result of a mathematical function, with a minimal amount of design effort ...Missing: theory | Show results with:theory<|separator|>
  73. [73]
    [PDF] Weierstrass and Approximation Theory
    In this section we review what Weierstrass did in this paper. Weierstrass starts his original paper with the statement that if f is continuous and bounded ...
  74. [74]
    Bernstein approximation and beyond: proofs by means of ... - arXiv
    Jul 21, 2023 · Bernstein polynomials provide a constructive proof for the Weierstrass approximation theorem, which states that every continuous function on a closed bounded ...
  75. [75]
    [PDF] Convergence of Fourier Series - UChicago Math
    Aug 26, 2012 · Fourier series are useful approximations for functions because, like Taylor series, they are infinitely differentiable and easy to (formally) ...
  76. [76]
    [PDF] Optimal Error Bounds for Cubic Spline Interpolation
    The error bounds considered are of the form jjf”' - scr) /jm < C, jifi4' /loo h4-', where s is a cubic spline interpolant off o C*[a, b], matching f in ...
  77. [77]
    The Kolmogorov–Arnold representation theorem revisited
    There is a longstanding debate whether the Kolmogorov–Arnold representation theorem can explain the use of more than one hidden layer in neural networks.