Fact-checked by Grok 2 weeks ago

Universal approximation theorem

The universal approximation theorem is a foundational result in the theory of artificial neural networks, stating that a feedforward network with a single hidden layer and a sufficiently large number of neurons can approximate any continuous function defined on a compact subset of \mathbb{R}^n to arbitrary accuracy in the uniform norm, provided the activation function is continuous, non-constant, and satisfies certain mild conditions such as being sigmoidal (bounded, monotonically increasing, and approaching limits at \pm \infty).^[1] This theorem, first rigorously established for sigmoidal activation functions by George Cybenko in 1989, demonstrates the expressive power of shallow neural networks as universal function approximators, resolving key questions about their ability to model complex decision boundaries and mappings in finite-dimensional spaces.^[1] Subsequent work by Kurt Hornik, Maxwell Stinchcombe, and Halbert White in 1989 extended the result to show that standard multilayer feedforward networks with at least one hidden layer and arbitrary "squashing" (nonlinear) activation functions can approximate any Borel measurable function from one finite-dimensional Euclidean space to another with arbitrary precision, given enough hidden units.^[2] These theorems build on earlier approximation results, such as the Weierstrass approximation theorem (1885), which guarantees that polynomials can densely approximate continuous functions on compact intervals, and the Kolmogorov-Arnold representation theorem (1957), which decomposes multivariate continuous functions into sums of univariate compositions. The universal approximation property underpins the practical success of neural networks in machine learning tasks, justifying their use for regression, classification, and other function-fitting problems despite the curse of dimensionality and training challenges in high dimensions. Variants of the theorem have since addressed deeper architectures, non-smooth activations like ReLU, and quantitative bounds on approximation error. For instance, results for rectified linear units confirm universal approximation for networks with two hidden layers, while depth-specific theorems, such as those by Zhou (2020), show that deeper networks can achieve better approximation rates for certain function classes.^[3] However, the theorems do not prescribe how to find the approximating network or guarantee efficient training, highlighting a gap between theoretical expressivity and practical optimization.^[4] Overall, these results affirm neural networks' role as flexible tools for approximating nonlinear mappings, influencing advancements in deep learning while motivating ongoing research into approximation guarantees for modern architectures like transformers and recent extensions to complex-valued networks and rough paths as of 2025.^[5]^[6]

Fundamentals

Setup

The universal approximation theorem addresses the representational capacity of feedforward neural networks to mimic continuous functions on bounded domains. Central to this framework is the definition of a feedforward neural network with a single hidden layer and a finite number of neurons. Such a network computes a function \Phi: \mathbb{R}^n \to \mathbb{R} of the form

\Phi(\mathbf{x}) = \sum_{j=1}^N \alpha_j \sigma \left( \mathbf{w}_j^\top \mathbf{x} + b_j \right),

where \mathbf{x} \in \mathbb{R}^n is the input vector, N < \infty is the number of hidden neurons, \mathbf{w}_j \in \mathbb{R}^n and b_j, \alpha_j \in \mathbb{R} are real-valued weights and biases, and \sigma: \mathbb{R} \to \mathbb{R} is a fixed sigmoidal activation function applied elementwise to the pre-activations. A sigmoidal activation function \sigma is defined as a continuous, non-constant function satisfying \lim_{t \to -\infty} \sigma(t) = \alpha and \lim_{t \to +\infty} \sigma(t) = \beta, where \alpha < \beta are finite real constants; a canonical example is the logistic sigmoid \sigma(t) = \frac{1}{1 + e^{-t}}, which maps to (0, 1). These networks operate under the assumption of real-valued parameters and a single hidden layer, with no restrictions on the magnitude of weights beyond finiteness. The theorem applies to approximating continuous target functions f: K \to \mathbb{R}, where K \subset \mathbb{R}^n is a compact (closed and bounded) subset, ensuring the function space C(K) of continuous functions on K is well-behaved for uniform convergence analysis. Uniform approximation in this context means that, for any \varepsilon > 0, there exists a network \Phi such that

\sup_{\mathbf{x} \in K} \left| \Phi(\mathbf{x}) - f(\mathbf{x}) \right| < \varepsilon.

This supremum norm criterion guarantees pointwise closeness across the entire domain K, leveraging the density properties of sigmoidal superpositions in C(K).

Statement

The universal approximation theorem asserts that a feedforward neural network with a single hidden layer containing a sufficiently large number of neurons can approximate any continuous function defined on a compact subset of Euclidean space to arbitrary accuracy, provided the activation function is appropriately chosen, such as a continuous sigmoidal function.^[1] This result highlights the expressive power of shallow neural networks in function approximation tasks.^[7] Formally, let f: K \to \mathbb{R} be a continuous function where K \subset \mathbb{R}^n is compact, let \sigma: \mathbb{R} \to \mathbb{R} be a continuous sigmoidal activation function, and let \epsilon > 0. Then there exist an integer m > 0 and real numbers w_{ij}, v_j, b_j (for i=1,\dots,n and j=1,\dots,m) such that

\left| \sum_{j=1}^m v_j \sigma\left( \sum_{i=1}^n w_{ij} x_i + b_j \right) - f(x) \right| < \epsilon

for all x \in K.^[1] This statement applies to multivariable input functions from \mathbb{R}^n to \mathbb{R}.^[7] The theorem extends to vector-valued outputs f: K \to \mathbb{R}^q by applying the scalar approximation component-wise to each of the q output dimensions.

Historical Development

Arbitrary-Width Results

In 1989, foundational results established the universal approximation properties of single-hidden-layer neural networks with unbounded width, demonstrating their ability to densely approximate continuous functions on compact sets. These developments focused on the expressive power of wide shallow networks using nonlinear activation functions, laying the groundwork for understanding neural networks as universal function approximators. George Cybenko's theorem showed that finite superpositions of a continuous sigmoidal activation function applied to affine functionals form a dense subset of the space of continuous functions C(K) on any compact subset K of \mathbb{R}^n, enabling uniform approximation of any continuous function on K.^[1] This result highlighted the role of sigmoidal nonlinearities in achieving density, with the network structure consisting of a single hidden layer whose width can be increased arbitrarily to meet approximation requirements. Independently, Kurt Hornik, Maxwell Stinchcombe, and Halbert White extended this to multilayer feedforward networks with a single hidden layer, proving they are universal approximators for any continuous function from \mathbb{R}^n to \mathbb{R}, regardless of the input dimension n, provided the hidden-layer activation is a bounded nonconstant continuous function.^[2] Their proof emphasized that the approximation capability holds uniformly on compact sets and is independent of the specific squashing function, as long as it satisfies mild conditions like monotonicity or bounded variation. An independent result by Ken-ichi Funahashi in 1989 confirmed that a single hidden layer with sigmoidal activations can approximate continuous functions on compact subsets arbitrarily closely.^[8] Both theorems are existential, relying on the Stone-Weierstrass theorem to establish density without providing explicit constructions, and imply that the hidden layer width m must grow as the target approximation error ε decreases, though no quantitative bound on the growth rate of m with respect to ε or the function's complexity is derived.^[1]^[2]

Arbitrary-Depth Results

Building on this, Kurt Hornik's 1991 work established that multilayer perceptrons (MLPs) with non-constant, bounded, and nonlinear activation functions—such as sigmoids—possess universal approximation properties, allowing them to approximate any Borel measurable function from a compact subset of \mathbb{R}^n to \mathbb{R}^m to any desired degree of accuracy, provided the hidden layers use sufficiently many units while maintaining fixed width across layers.^[9] This result emphasized the role of depth in enabling robust approximation without relying solely on expansive width, highlighting the compositional power of multiple layers for capturing complex mappings. During the 1990s, further extensions explored how arbitrary depth in fixed-width networks improved approximation rates for specific function classes, such as those with bounded smoothness or variation. For instance, H.N. Mhaskar's analysis showed that multilayer networks with sigmoidal activations of order k achieve approximation rates of O(N^{-k/d}) for functions in Sobolev spaces on [0,1]^d, where N is the total number of parameters and d is the input dimension, outperforming single-layer networks for high-dimensional or smooth targets by leveraging hierarchical compositions. These developments underscored that increasing depth, even with constrained width, facilitates more efficient representations for functions requiring localized or hierarchical features, such as those in signal processing or dynamical systems. Post-1990, theoretical research shifted emphasis from purely width-based single-layer models to the benefits of depth in MLPs, driven by empirical successes in training deeper architectures via backpropagation and the need to address the curse of dimensionality in approximation theory.^[10] This transition marked a pivotal evolution, positioning arbitrary-depth networks as foundational for advancing neural network expressivity in practical applications.

Bounded Depth and Width

In the 1990s, researchers began investigating the universal approximation capabilities of neural networks under constraints on both depth and width, revealing dependencies on approximation error and function properties. Vladimir Maiorov and Allan Pinkus demonstrated that for multilayer perceptrons (MLPs) with fixed depth L \geq 2 and bounded width m, effective approximation of continuous functions requires m to scale with the desired error \epsilon and the smoothness of the target function, such as in Sobolev or Besov spaces. Their analysis established lower bounds on the approximation error, showing that fixed architectures cannot universally approximate without adjusting width based on these factors.^[11] Subsequent work in the 2000s extended these insights to specific function classes, highlighting necessary trade-offs in network size. For instance, results for Lipschitz continuous functions indicated that bounded depth networks require the width to grow polynomially with $1/\epsilon to achieve approximation error \epsilon, with the degree of the polynomial depending on the input dimension d.^[12] This polynomial scaling underscores the practical limitations of shallow architectures in high dimensions, where width expansion becomes computationally demanding. A fundamental limitation of these bounded configurations is that networks with fixed depth and width cannot attain arbitrary accuracy for all continuous functions; instead, improving precision necessitates increasing either depth or width, thereby introducing explicit trade-offs between architectural constraints and representational power.^[12] These constraints contrast with unbounded-width results for multi-layer networks, where depth alone can suffice for density in function spaces. A notable historical milestone in understanding these trade-offs came in 2017 with Zhou Lu and colleagues, who proved that ReLU-activated networks with bounded width—specifically width n+4 where n is the input dimension—and arbitrary depth are universal approximators for continuous functions on compact subsets of \mathbb{R}^n. This result emphasized how modest width, combined with sufficient depth, mitigates the need for excessive parameter growth while preserving approximation guarantees.^[12]

Quantitative Bounds

One of the earliest quantitative results building on qualitative universal approximation theorems, such as Cybenko's 1989 result for sigmoidal networks, provides explicit error bounds in terms of network width. In 1993, Andrew Barron established that for functions whose Fourier transform has a finite first moment, a single-hidden-layer neural network with sigmoidal activation and width m can approximate the target function with an L^2 error bounded by O(1/\sqrt{m}), independent of the input dimension.^[13] This bound highlights the efficiency of such networks for functions in Barron's class, where the approximation rate avoids exponential dependence on dimension. In the 1990s, Vitaly Maiorov and collaborators extended these efforts by deriving tighter bounds that incorporate the function's Fourier moments more precisely, particularly for smooth functions in C(\mathbb{R}^d). For instance, Maiorov, Ron, and Meir (1996) showed that the approximation error by neural networks with ridge functions depends on higher-order Fourier characteristics, yielding rates that improve for functions with rapidly decaying Fourier spectra but still scale with smoothness parameters. These results refined Barron's framework by providing lower bounds and constructive estimates, emphasizing the role of the function's spectral properties in determining achievable rates. A comprehensive survey by Allan Pinkus in 1999 synthesized these developments, focusing on approximation rates for sigmoid-based networks. Pinkus detailed how error bounds for sigmoids often achieve polynomial rates in the number of hidden units for Lipschitz or Hölder continuous functions, with specific examples showing O(1/m^{1/2}) to O(1/m) convergence depending on the activation's smoothness. The survey underscored that while sigmoids enable strong rates for univariate projections, multivariate extensions require careful handling of basis selection. Despite these advances, many quantitative bounds suffer from the curse of dimensionality, where the error scales exponentially with the input dimension d unless the target function belongs to a favorable class like Barron's. For general continuous functions, achieving sub-exponential rates necessitates widths or depths that grow rapidly with d, limiting practical scalability in high dimensions.

Kolmogorov-Arnold Representation

The Kolmogorov-Arnold representation theorem, established in 1957, asserts that any continuous multivariate function can be expressed as a finite superposition of continuous univariate functions and addition, without relying on neural network architectures. Specifically, Andrey Kolmogorov proved that for any continuous function f: [0,1]^n \to \mathbb{R} and any integer n \geq 2, there exist continuous univariate functions \psi_{q,p}: [0,1] \to \mathbb{R} for q = 1, \dots, 2n+1 and p = 1, \dots, n, and continuous univariate functions \Phi_q: \mathbb{R} \to \mathbb{R} for q = 1, \dots, 2n+1, such that

f(x_1, \dots, x_n) = \sum_{q=1}^{2n+1} \Phi_q \left( \sum_{p=1}^n \psi_{q,p}(x_p) \right).

This representation decomposes the multivariate function into linear projections followed by univariate nonlinearities and a summation, providing an exact decomposition rather than an approximation. In the same year, Vladimir Arnold refined Kolmogorov's result by demonstrating that functions of three variables can be represented as superpositions of continuous functions of two variables, bridging the gap toward the full multivariate case and confirming the superposition principle for lower dimensions. Arnold's contribution, titled "On the representation of functions of three variables by superpositions of functions of two variables," provided a constructive insight into the theorem's applicability for n=3, emphasizing the role of reduced-dimensional compositions in multivariate representations. This theorem serves as a foundational existence proof for the universal approximation theorem in neural networks, demonstrating that continuous functions on compact sets can be represented through compositions of univariate functions and linear combinations, thereby motivating the sufficiency of a single hidden layer with nonlinear activations for approximation capabilities. Unlike neural network-based approximations, the Kolmogorov-Arnold representation does not involve trainable parameters or optimization but highlights the theoretical possibility of such decompositions. However, the theorem's practical utility is limited by its computational inefficiency: the univariate functions \psi_{q,p} grow exponentially complex with n, involving highly oscillatory behaviors that render explicit construction infeasible for high dimensions.^[14]

Reservoir Computing Extensions

Reservoir computing represents a significant extension of the universal approximation theorem to recurrent neural network architectures, where a fixed, randomly initialized recurrent layer—known as the reservoir—processes temporal inputs, and only the linear readout layer is trained.^[15] This approach leverages the echo state property, ensuring that the reservoir's state depends only on the current and past inputs, enabling efficient training while maintaining expressive power.^[15] In 2001, Herbert Jaeger introduced echo state networks (ESNs), a foundational reservoir computing model, demonstrating that such networks can universally approximate any fading memory input/output system to arbitrary accuracy.^[15] Fading memory systems are those where the influence of past inputs diminishes over time, common in time-series processing tasks.^[15] The proof relies on the reservoir's ability to embed input histories into a high-dimensional state space, allowing a simple linear readout to capture the desired mapping, provided the reservoir weights are chosen to satisfy the echo state condition (spectral radius less than or equal to 1).^[15] This result builds on earlier universal approximation guarantees for static functions by adapting them to dynamical, history-dependent computations without requiring gradient-based training of the recurrent weights.^[15] Building on this framework, Pathak et al. (2018) extended reservoir computing's universal approximation capabilities to large-scale spatiotemporal dynamical systems, such as chaotic fluid flows, by using an ensemble of reservoirs to predict unobserved states from partial observations.^[16] Their approach demonstrates that reservoir models can approximate the evolution of complex, high-dimensional attractors with low computational cost, achieving accurate short-term forecasts comparable to full-system simulations. This extension highlights the theorem's applicability to model-free prediction in physics and engineering, where traditional numerical methods are prohibitive. More recently, Monzani and Prati (2024) generalized these ideas to quantum reservoir computing, proving universal approximation for fading memory functionals on quantum states using unitary dynamics in the reservoir.^[17] In this quantum extension, the reservoir consists of a fixed random quantum circuit that maps input-driven quantum states into an expressive Hilbert space, with a trainable measurement readout providing the output.^[17] The universality holds under minimal conditions on the reservoir's capacity and contraction properties, analogous to classical ESNs, enabling applications in quantum machine learning for time-dependent quantum processes.^[17] This work unifies classical and quantum reservoirs under a common theoretical framework, confirming the theorem's robustness across computational paradigms.^[17]

Core Theorems

Arbitrary-Width Theorem

The arbitrary-width theorem, also known as the shallow network universal approximation theorem, establishes that a single-hidden-layer feedforward neural network with sufficiently many neurons can approximate any continuous function on a compact subset of Euclidean space to arbitrary accuracy. Formally, as proved by Cybenko in 1989, let K be a compact subset of \mathbb{R}^n and \sigma: \mathbb{R} \to \mathbb{R} a continuous sigmoidal activation function, meaning \sigma is non-constant, bounded, and monotonically increasing (or satisfies \lim_{t \to -\infty} \sigma(t) = 0 and \lim_{t \to \infty} \sigma(t) = 1). Then, the set of functions of the form

\Phi(x) = \sum_{j=1}^m v_j \sigma\left( \sum_{i=1}^n w_{ij} x_i + \theta_j \right),

where m \in \mathbb{N}, v_j, w_{ij}, \theta_j \in \mathbb{R}, and x = (x_1, \dots, x_n) \in K, is dense in the space C(K) of continuous functions on K equipped with the supremum norm.^[1] This means that for any f \in C(K) and \epsilon > 0, there exist parameters m, v_j, w_{ij}, \theta_j such that \sup_{x \in K} |\Phi(x) - f(x)| < \epsilon.^[1] Cybenko's proof relies on the Stone-Weierstrass theorem, which guarantees density of subalgebras of C(K) that separate points and contain constants. The key insight is that sigmoidal perceptrons \sigma(w \cdot x + b) form a discriminatory family—able to distinguish distinct points in K—and their finite linear combinations generate a dense subspace under the given conditions.^[1] This construction highlights the role of arbitrary width m in achieving universality, as increasing the number of hidden units allows finer partitioning of the input space to match the target function.^[1] Extensions of this theorem to non-sigmoidal activations have been explored, particularly in surveys from the late 1990s. For instance, Scarselli and Tsoi (1998) review approximation results for feedforward networks using piecewise linear or other non-smooth activations, such as the rectified linear unit (ReLU), \sigma(t) = \max(0, t), under conditions like sufficient width and appropriate input scaling to ensure the network spans a dense set of piecewise linear functions capable of approximating continuous targets.^[18] These generalizations maintain the core density property but often require modifications to handle unbounded or non-monotonic behaviors, preserving the arbitrary-width requirement for single-layer architectures.^[18]

Arbitrary-Depth Theorem

The arbitrary-depth theorem establishes that multi-layer feedforward neural networks, with an arbitrary number of hidden layers, possess universal approximation capabilities for continuous functions on compact subsets of Euclidean space. Specifically, Hornik, Stinchcombe, and White (1989) proved that standard multi-layer networks using a single hidden layer with continuous, bounded, and non-constant activation functions—such as sigmoidal functions—can uniformly approximate any continuous real-valued function defined on a compact subset of \mathbb{R}^n to any desired degree of accuracy, provided the hidden layer has sufficiently many neurons; this result extends naturally to networks with multiple hidden layers of fixed depth by composition.^[2] The theorem applies to networks where activations are applied layer-wise, enabling the representation of functions through nested nonlinear transformations. A key aspect of the theorem is the compositional structure of multi-layer networks, which can be expressed as nested summations of activation functions. For a network with L hidden layers, the output function takes the form

f(\mathbf{x}) = \sum_{k=1}^{N_L} c_k \sigma \left( \sum_{j_{L-1}=1}^{N_{L-1}} w_{k j_{L-1}}^{(L)} \sigma \left( \cdots \sigma \left( \sum_{j_1=1}^{N_1} w_{j_1}^{(1)T} \mathbf{x} + b_{j_1}^{(1)} \right) \cdots + b_{j_{L-1}}^{(L-1)} \right) + b_k^{(L)} \right),

where \sigma is the activation function, N_\ell denotes the number of neurons in layer \ell, and \mathbf{w}, b are weights and biases.^[2] This nested form highlights how depth facilitates hierarchical feature extraction, where lower layers capture simpler patterns that higher layers combine into more abstract representations, though the existence of the approximation guarantee mirrors that of single-layer cases in relying on density arguments from functional analysis. While the core result holds for compact domains, extensions to non-compact domains like \mathbb{R}^n employ localization methods, decomposing the target function over a countable cover of compact sets using a partition of unity subordinate to the cover, with each local piece approximated by a multi-layer network.^[19] This approach leverages the sigma-compactness of \mathbb{R}^n to ensure uniform approximation on the entire space for suitable continuous functions. The theorem builds on foundational single-layer results, such as those by Funahashi (1989), which demonstrated analogous approximation properties for shallower architectures.^[20]

Bounded Approximation Theorem

In the context of neural networks with constraints on both depth and width, the bounded approximation theorem examines the limitations imposed by fixed architectural sizes, revealing scenarios where universal approximation does not hold and trade-offs between depth L and width m become necessary for achieving desired accuracy levels. Unlike unbounded cases, networks with fixed L and m cannot densely span the space of all continuous functions C(K) on a compact set K \subseteq \mathbb{R}^d, as their representable function class forms a proper subspace with finite capacity, leading to inherent approximation gaps for general targets. This limitation underscores the practical need for scaling network dimensions to balance expressivity and computational feasibility. A foundational result in this area is due to Maiorov and Pinkus (1999), who derived lower bounds on the approximation error for multilayer perceptron (MLP) networks with bounded neurons per layer. For a network of fixed depth L and fixed width m using an analytic, strictly increasing sigmoidal activation function, the approximable functions are confined to specific subspaces analogous to superpositions of ridge functions, with the worst-case uniform approximation error over C(K) scaling inversely with the network size, particularly depending on the factor m^L to quantify the achievable precision for smooth targets. These bounds demonstrate that fixed-size networks attain near-optimal approximation only for restricted function classes, such as those with bounded variation in Fourier domain, but fail to achieve arbitrary accuracy for arbitrary continuous functions without increasing L or m. A key implication of these results is that, for fixed network size, the closure of the approximation class is not dense in C(K), implying an irreducible error floor for certain functions regardless of weight optimization. This non-denseness arises from the parametric form of the network, where the total number of free parameters scales as O(L m^2) (for fully connected layers), limiting the VC dimension and thus the covering of the function space. More recent analyses since 2017 have refined these insights through explicit depth-width product bounds for \varepsilon-approximation. For instance, results establish that a product L \times m = \Omega\left( \frac{d \log(1/\varepsilon)}{\varepsilon} \right) (where d is input dimension) is necessary and sufficient for \varepsilon-uniform approximation of Hölder-smooth functions using ReLU activations, highlighting minimal scaling requirements to overcome bounded limitations. Complementing earlier works, Lu et al. (2017) provide trade-off characterizations showing that increasing depth can reduce required width exponentially for certain compositions, though fixed bounds still enforce error dependencies on the product.^[21]

Modern Variants and Extensions

Deep Learning Applications

The universal approximation theorem (UAT) provides a theoretical foundation for the empirical success of overparameterized deep neural networks, demonstrating that sufficiently wide or deep architectures possess the expressive capacity to approximate complex target functions arbitrarily well on compact domains. This justifies the use of models with millions or billions of parameters, as their approximation power ensures they can capture intricate data patterns without inherent representational limitations, aligning with observations that larger networks often yield better performance in practice. In convolutional neural networks (CNNs), pioneered by LeCun et al. in 1989 for tasks like handwritten digit recognition, the UAT extends to show that deep CNNs can universally approximate continuous functions on images by leveraging convolutional layers to efficiently capture spatial hierarchies. Similarly, graph neural networks (GNNs) apply UAT principles to approximate functions defined on graph-structured data, such as social networks or molecular structures, where message-passing mechanisms enable the representation of permutation-invariant mappings over irregular topologies.^[22] Despite these strengths, the UAT has practical limitations in deep learning, as it focuses solely on static approximation capabilities and overlooks dynamic aspects like training optimization and generalization from finite data. For instance, the double descent phenomenon reveals that while overparameterization enhances expressivity, it can lead to initial overfitting before generalization improves with even larger models, a behavior not predicted by UAT alone. Furthermore, UAT helps bridge theoretical gaps in architectures like transformers by confirming their ability to approximate permutation-equivariant sequence-to-sequence functions, supporting their widespread adoption in natural language processing.^[23]

Recent Theoretical Advances

In recent years, theoretical work has refined the understanding of depth-width trade-offs in neural networks through kernel methods and spectral analysis. Bietti and Mairal (2021) analyzed deep convolutional models using a kernel perspective, deriving spectral bounds that quantify how increasing depth allows for better approximation of functions with local structure, while wider networks capture global features more efficiently; this highlights a trade-off where deeper architectures reduce the required width for achieving low approximation error in high-dimensional spaces like images.^[24] Advances in approximation guarantees for fixed architectures have also emerged, particularly for specialized classes of functions. For instance, Mikulincer and Reichman (2022) demonstrated that monotone neural networks with fixed depth (e.g., four layers) and polynomial size can universally approximate continuous monotone functions on compact domains, providing explicit bounds on the network size needed for epsilon-approximation without relying on infinite-width limits.^[25] Further progress addresses expressivity hierarchies in deep networks, comparing their representational power across architectures. Gühring et al. (2020, with extensions in subsequent works up to 2023) explored how depth induces hierarchies of function classes, showing that deeper ReLU networks can express exponentially more complex functions than shallower ones for certain compositional tasks, filling gaps in prior universal approximation results by quantifying separation in terms of approximation rates.^[26] Key developments in non-sigmoidal activations, such as ReLU, have tightened minimal width requirements for universal approximation. Recent analyses confirm that ReLU networks with width greater than 2n + 1 (where n is the input dimension) suffice for dense approximation of continuous functions on compact sets, with improvements showing that width max{2, d_x, d_y} is often minimal and sufficient for L_p approximation, enabling more efficient architectures without sacrificing expressivity.^[27] As of 2025, extensions include universal approximation theorems for spiking neural networks, which approximate continuous functions on compact domains using temporal coding, bridging neuromorphic computing with theoretical guarantees.^[28]

Quantum and Compositional Variants

Work on quantum convolutional neural networks by Cong et al. (2019) has inspired theoretical results establishing that quantum neural networks, composed of parameterized quantum circuits, can universally approximate continuous maps from classical inputs to quantum outputs, such as preparing arbitrary pure states or implementing target unitaries with high fidelity.^[29]^[30] This analogy to the classical UAT highlights the expressive power of shallow quantum circuits, where the depth required scales favorably for certain target functions, enabling efficient quantum machine learning tasks like state preparation and operator learning. Building on quantum reservoir computing paradigms, which draw from classical reservoir extensions but leverage quantum dynamics for enhanced computational capacity, recent advances have formalized universality for approximating quantum states. Monzani and Prati provided a unified framework for classical and quantum reservoir computing, deriving minimal sufficient conditions—such as the presence of fading memory and a sufficiently rich state space—for universal approximation of time-series functionals, including those mapping to quantum density operators.^[17] Their 2024 results ensure that quantum reservoirs, driven by non-dissipative or dissipative quantum evolutions, can approximate any continuous fading-memory map to arbitrary precision, with error bounds scaling inversely with reservoir size, thus supporting applications in quantum time-series prediction and chaotic system emulation.^[17] Compositional variants refine the UAT by emphasizing function approximation through compositions of a finite set of primitive mappings, akin to linguistic structures with a limited vocabulary. Cai proved the existence of such a finite "vocabulary"—a compact set of univariate functions whose sequential compositions can approximate any continuous multivariate function on compact domains to any desired accuracy.^[31] This 2023 construction leverages the Stone-Weierstrass theorem to guarantee density in the space of continuous functions, offering a theoretically grounded alternative to infinite-parameter models and facilitating more interpretable compositional architectures in neural networks.^[31] A practical embodiment of Kolmogorov-Arnold-inspired compositional approximation is the Kolmogorov-Arnold Network (KAN), which replaces traditional multilayer perceptron activations with learnable univariate functions along network edges. Liu et al. demonstrated in 2024 that KANs achieve universal approximation for continuous functions, mirroring the Kolmogorov-Arnold representation theorem, while providing superior interpretability through symbolic regression capabilities that uncover analytical expressions from data.^[32] Empirical evaluations show KANs outperforming MLPs in accuracy and scaling efficiency for tasks like function fitting and PDE solving, with parameter counts reduced by up to 50% for comparable performance, underscoring their potential in interpretable machine learning.^[32]

Proof Techniques

Outline for Arbitrary Width

One prominent proof strategy for the arbitrary-width universal approximation theorem employs the Stone-Weierstrass theorem, which asserts that a subalgebra of continuous functions on a compact set that separates points and contains constants is dense in the space of all continuous functions under the uniform norm.^[33] In this framework, the algebra generated by a single-layer neural network—consisting of all finite linear combinations of the activation function \sigma composed with affine transformations of the input—separates distinct points in the input space because, for any x \neq y, there exists an affine function mapping x and y to values where \sigma takes different outputs, given \sigma's continuity and non-constancy.^[33] This algebra also vanishes nowhere if \sigma is chosen appropriately (e.g., sigmoids bounded away from zero in certain regimes), ensuring density and thus universal approximation by networks of arbitrary width.^[33] Cybenko's approach provides an alternative existential proof tailored to sigmoidal activations \sigma, which satisfy \lim_{t \to -\infty} \sigma(t) = 0 and \lim_{t \to +\infty} \sigma(t) = 1.^[34] The key insight is that scaled sigmoids \sigma(w \cdot (a^T x + b)) with large w approximate indicator functions of half-spaces \mathbf{1}_{\{a^T x + b \geq 0\}} uniformly on compact sets, as the sigmoid transitions sharply from 0 to 1.^[34] Any continuous function on a compact set can then be approximated by first partitioning the domain into measurable sets via Lusin's theorem and weighting these step-like approximations to form a partition-of-unity subordinate to the function's level sets, yielding arbitrary uniform approximation by finite sums.^[34] This relies on the discriminatory property of sigmoids, ensuring no non-trivial functional annihilates the span.^[34] A constructive variant reduces the multivariate approximation problem to the univariate case through variable substitution, such as parameterizing the input along rays or using one-dimensional projections to build multivariate approximants incrementally.^[35] The foundational result underpinning these proofs is the density of the span of \{\sigma(a x + b) \mid a, b \in \mathbb{R}\} in C[0,1], established by showing that such combinations can approximate step functions (via scaling) and, through integration against partitions, any continuous univariate function.^[34] This one-dimensional density extends to higher dimensions by composing with affine maps that separate coordinates.^[33]

Outline for Depth and Bounded Cases

The proofs for the arbitrary-depth and bounded variants of the universal approximation theorem build upon the foundational single-layer results but incorporate additional structure to handle layering and resource constraints. For arbitrary depth, the argument proceeds inductively over the number of layers, leveraging the density of single-layer networks as the base case and demonstrating that function composition preserves this density property. Specifically, assuming that networks with k layers can approximate any continuous function on a compact set to arbitrary precision in the uniform norm, one extends this to k+1 layers by showing that the composition of a dense class with itself remains dense in the space of continuous functions, often relying on algebraic properties and the Stone-Weierstrass theorem in the Banach space C(K) of continuous functions on compact K. This inductive step ensures uniform convergence and applies to sigmoidal activations, as established in the seminal work on multilayer networks.^[36] In the bounded case, where network width W and depth L are fixed or grow slowly, approximation error bounds are derived using covering numbers or VC dimension to quantify the function class complexity relative to the parameter count. Covering numbers N(\epsilon, \mathcal{F}, \|\cdot\|_p) measure the smallest number of \epsilon-balls needed to cover the hypothesis class \mathcal{F} of networks with at most W neurons per layer and depth L, yielding logarithmic bounds like \log N(\epsilon) \leq C W^2 L \log((W+1)^4 B^4 / \epsilon) for ReLU networks with bounded weights in [-B, B], which directly upper-bound the minimax approximation error over Lipschitz or Sobolev function classes. Similarly, VC dimension bounds, such as O(W L \log W) for deep ReLU networks, limit the shattering capacity and thus the ability to approximate high-complexity functions, providing error guarantees via empirical risk minimization ties. These tools highlight that approximation fails when the effective dimension of the target function space exceeds the network's capacity, as captured by dimension counting: shallow networks with parameters scaling as O(W L) cannot approximate certain high-dimensional functions without parameter explosion, since the number of effective compositions L must grow linearly with d to overcome the curse of dimensionality.^[37]^[38]^[39] An important extension for ReLU activations, which produce piecewise linear functions, relies on constructing networks that realize fine-grained piecewise linear partitions of the domain to approximate continuous targets. The proof involves inductively building subnetworks for univariate line segments via composite ReLUs (e.g., differences of ReLUs to form ramps), then extending to multivariate cases through triangulation of compact domains into simplices, where each simplex is approximated linearly and combined affinely; this yields uniform approximation on compact sets, with network size scaling with the partition refinement. Key functional analysis techniques, such as those in Banach spaces for uniform convergence, underpin the convergence guarantees across these variants.^[40]

References

[1]
Approximation by superpositions of a sigmoidal function
Feb 17, 1989 · The paper discusses approximation properties of other possible types of nonlinearities that might be implemented by artificial neural networks.
[2]
Multilayer feedforward networks are universal approximators
This paper rigorously establishes that standard multilayer feedforward networks with as few as one hidden layer using arbitrary squashing functions are capable ...
[3]
Learning nonlinear operators via DeepONet based on the universal ...
Mar 18, 2021 · The universal approximation theorem only guarantees a small approximation error for a sufficiently large network, but it does not consider the ...Deeponet Theory And Network... · Deeponet Learns Fast · Learning Stochastic...
[4]
[PDF] Lecture 10 Expressivity and Universal Approximation Theorems Part 1
The universal approximation theorem states that any continuous function f : [0, 1]n −→ [0, 1] can be approximated arbitrarily well by a neural network with at ...
[5]
Approximation capabilities of multilayer feedforward networks
Abstract. We show that standard multilayer feedforward networks with as few as a single hidden layer and arbitrary bounded and nonconstant activation function ...
[6]
https://www.researchgate.net/publication/382363864_A_Survey_on_Universal_Approximation_Theorems
[7]
[PDF] Universal approximation bounds for superpositions of a sigmoidal ...
BARRON: UNIVERSAL APPROXIMATION BOUNDS FOR SUPERPOSITIONS OF A SIGMOIDAL FUNCTION ... For functions ƒ in гc, в, Theorem 1 bounds the error in the.Missing: quantitative | Show results with:quantitative
[8]
The Kolmogorov–Arnold representation theorem revisited
The original proof of the KA representation in Kolmogorov (1957) and some later versions are non-constructive providing very little insight on how the function ...
[9]
[PDF] The “echo state” approach to analysing and training recurrent neural ...
Jan 25, 2010 · Jaeger(2001): The ”echo state” approach to analysing and training recurrent neural networks. GMD Report. 148, German National Research Center ...
[10]
Universality conditions of unified classical and quantum reservoir ...
Jan 26, 2024 · As widely known, classes of reservoir computers serve as universal approximators of functionals with fading memory. The construction of such ...Missing: approximation | Show results with:approximation
[11]
Universal approximation using feedforward neural networks
Universal approximation using feedforward neural networks: a survey of some existing methods, and some new results. Authors: Franco Scarselli.
[12]
[2308.03812] Noncompact uniform universal approximation - arXiv
Aug 7, 2023 · The universal approximation theorem is generalised to uniform convergence on the (noncompact) input space \mathbb{R}^n. All continuous functions ...Missing: extension domains
[13]
Universality of deep convolutional neural networks - ScienceDirect
Here we show that a deep convolutional neural network (CNN) is universal, meaning that it can be used to approximate any continuous function to an arbitrary ...
[14]
Are Transformers universal approximators of sequence-to ... - arXiv
Dec 20, 2019 · In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with ...
[15]
Approximation and Learning with Deep Convolutional Models - arXiv
Feb 19, 2021 · In this paper, we study this through the lens of kernel methods, by considering simple hierarchical kernels with two or three convolution and pooling layers.
[16]
[PDF] Size and depth of monotone neural networks: interpolation and ...
Universal approximation bounds for superpositions of a sigmoidal function. ... [13] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural ...
[17]
[2007.04759] Expressivity of Deep Neural Networks - arXiv
Jul 9, 2020 · This paper reviews approximation results for neural networks, discussing benefits of deep networks over shallow ones, and covers feedforward, ...
[18]
Minimum width for universal approximation using ReLU networks on ...
Sep 19, 2023 · It has been shown that deep neural networks of a large enough width are universal approximators but they are not if the width is too small.Missing: 2n+ | Show results with:2n+
[19]
Vocabulary for Universal Approximation: A Linguistic Perspective of ...
This paper proves a finite vocabulary of mappings exists for universal approximation, where a sequence of these mappings can approximate any continuous mapping.
[20]
[2404.19756] KAN: Kolmogorov-Arnold Networks - arXiv
Apr 30, 2024 · Inspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-Arnold Networks (KANs) as promising alternatives to Multi-Layer Perceptrons ( ...Missing: universal approximation
[21]
[PDF] Multilayer feedforward networks are universal approximators
In other words, an element of S can approximate an element of T to any desired degree of accuracy. In our theorems below, T and X correspond to C' or M', S ...
[22]
[PDF] Approximation by superpositions of a sigmoidal function - NJIT
Feb 17, 1989 · G. Cybenkot. Abstr,,ct. In this paper we demonstrate that finite linear combinations of com- positions of a fixed, univariate function and a ...
[23]
Constructive Approximation by Superposition of Sigmoidal Functions
Aug 10, 2025 · In this paper, a constructive theory is developed for approximating functions of one or more variables by superposition of sigmoidal ...
[24]
[PDF] Multilayer Feedforward Networks are Universal Approximators
Multilayer feedforward networks with one hidden layer and squashing functions can approximate any measurable function to any desired accuracy, making them ...
[25]
https://proceedings.neurips.cc/paper_files/paper/2022/file/24c523085d10743633f9964e0623dbe0-Paper-Conference.pdf
[26]
[PDF] Nearly-tight VC-dimension and Pseudodimension Bounds for ...
The main contribution of this paper is to prove nearly-tight bounds on the VC-dimension of deep neural networks in which the non-linear activation function is ...
[27]
https://arxiv.org/abs/2309.10402
[28]
ReLU Networks Are Universal Approximators via Piecewise Linear ...
For multivariate function f(x), ReLU networks are constructed to approximate a piecewise linear function derived from triangulation methods approximating f(x).