Universal approximation theorem

In the mathematical theory of artificial neural networks, universal approximation theorems are results[1] that establish the density of an algorithmically generated class of functions within a given function space of interest. Typically, these results concern the approximation capabilities of the feedforward architecture on the space of continuous functions between two Euclidean spaces, and the approximation is with respect to the compact convergence topology. However, there are also a variety of results between non-Euclidean spaces and other commonly used architectures and, more generally, algorithmically generated sets of functions, such as the convolutional neural network (CNN) architecture,[2][3] radial basis-functions,[4] or neural networks with specific properties.[5] Most universal approximation theorems can be parsed into two classes. The first, line of research quantifies the approximation capabilities of neural networks with an arbitrary number of artificial neurons ("arbitrary width" case) and the later line of research focuses on the case where there is an arbitrary number of hidden layers, each containing a limited number of artificial neurons ("arbitrary depth" case).

Such theorems thus imply that neural networks can represent a wide variety of interesting functions when given appropriate weights. On the other hand, they typically do not provide a construction for the weights, but merely state that such a construction is possible.

History

One of the first versions of the arbitrary width case was proved by George Cybenko in 1989 for sigmoid activation functions.[6] Kurt Hornik showed in 1991[7] that it is not the specific choice of the activation function, but rather the multilayer feed-forward architecture itself which gives neural networks the potential of being universal approximators. Moshe Leshno et al in 1993[8] and later Allan Pinkus in 1999[9] showed that the universal approximation property, as defined in [10], is equivalent to having a nonpolynomial activation function.

The arbitrary depth case was also studied by number of authors, such as Zhou Lu et al in 2017,[11] Boris Hanin and Mark Sellke in 2018,[12] and Patrick Kidger and Terry Lyons in 2020.[13]

Several extensions of the theorem exist, such as to discontinuous activation functions, alternative network architectures, other topologies, and noncompact domains.[8][13][14]

Arbitrary Width Case

The classical form of the universal approximation theorem for arbitrary width is as follows:[6][7][15][16] The following formulation, due to Allan Pinkus,[9] extends the classical results of George Cybenko and Kurt Hornik.

Fix a continuous function ${\displaystyle \sigma$ and positive integers $d,D$ . The function $\sigma$ is not a polynomial if and only if, for every continuous function $f:\mathbb {R} ^{d}\to \mathbb {R} ^{D}$ , every compact subset $K$ of $\mathbb {R} ^{d}$ , and every $\epsilon >0$ there exists a continous function $f_{\epsilon }:\mathbb {R} ^{d}\to \mathbb {R} ^{D}$ with representation

$f_{\epsilon }=W_{2}\circ \sigma \bullet W_{1},$

where $W_{2},W_{1}$ are composable affine maps and $\bullet$ denotes component-wise composition, such that the approximation bound

$\sup _{x\in K}\,\|f(x)-f_{\epsilon }(x)\|<\varepsilon$

holds.

This theorem extends straightforwardly to networks with any fixed number of hidden layers: the theorem implies that the first layer can approximate any desired function and that later layers can approximate the identity function. Thus any fixed-depth network may approximate any continuous function, and this version of the theorem applies to networks with bounded depth and arbitrary width.

Arbitrary Depth Case

The 'dual' versions of the theorem consider networks of bounded width and arbitrary depth. A variant of the universal approximation theorem was proved for the arbitrary depth case by Zhou Lu et all in 2017.[11] They showed that networks of width n+4 with ReLU activation functions can approximate any Lebesgue integrable function on n-dimensional input space with respect to $L^{1}$ distance if network depth is allowed to grow. It was also shown that there was the limited expressive power if the width was less than or equal to n. All Lebesgue integrable functions except for a zero measure set cannot be approximated by ReLU networks of width n. In the same paper[11] it was shown that ReLU networks with width n+1 were sufficient to approximate any continuous function of n-dimensional input variables:[17]

Universal approximation theorem (L1 distance, ReLU activation, arbitrary depth). For any Lebesgue-integrable function $f:\mathbb {R} ^{n}\rightarrow \mathbb {R}$ and any $\epsilon >0$ , there exists a fully-connected ReLU network ${\mathcal {A}}$ with width $d_{m}\leq {n+4}$ , such that the function $F_{\mathcal {A}}$ represented by this network satisfies

$\int _{\mathbb {R} ^{n}}\left|f(x)-F_{\mathcal {A}}(x)\right|\mathrm {d} x<\epsilon$

Another variant was given by Patrick Kidger and Terry Lyons in 2020:[13]

Universal approximation theorem (nonaffine activation, arbitrary depth). Let ${\displaystyle \varphi$ be any nonaffine continuous function which is continuously differentiable at at least one point, with nonzero derivative at that point. Let $K\subseteq \mathbb {R} ^{n}$ be compact. The space of real vector-valued continuous functions on $K$ is denoted by $C(K;\mathbb {R} ^{m})$ . Let ${\mathcal {N}}$ denote the space of feedforward neural networks with $n$ input neurons, $m$ output neurons, and an arbitrary number of hidden layers each with $n+m+2$ neurons, such that every hidden neuron has activation function $\varphi$ and every output neuron has the identity as its activation function. Then given any $\varepsilon >0$ and any $f\in C(K;\mathbb {R} ^{m})$ , there exists $F\in {\mathcal {N}}$ such that

$|F(x)-f(x)|<\varepsilon$

for all $x\in K$ .

In other words, ${\mathcal {N}}$ is dense in $C(K;\mathbb {R} ^{m})$ with respect to the uniform norm.

Certain necessary conditions for the bounded width, arbitrary depth case have been established, but there is still a gap between the known sufficient and necessary conditions.[11][12][18]

gollark: They have very many good people working on various aspects of their security.

gollark: It goes up and down constantly.

gollark: Isn't that only safeish because you would normally ~~blink~~ close your eyes faster than that?

gollark: No, it's for cool laser things™.

gollark: Possibly not Chinese and whatever, since those have more information per character.

References

Balázs Csanád Csáji (2001) Approximation with Artificial Neural Networks; Faculty of Sciences; Eötvös Loránd University, Hungary
Zhou, Ding-Xuan (2020) Universality of deep convolutional neural networks; Applied and computational harmonic analysis 48.2 (2020): 787-794.
A. Heinecke, J. Ho and W. Hwang (2020); Refinement and Universal Approximation via Sparsely Connected ReLU Convolution Nets; IEEE Signal Processing Letters, vol. 27, pp. 1175-1179.
Park, Jooyoung, and Irwin W. Sandberg (1991); Universal approximation using radial-basis-function networks; Neural computation 3.2, 246-257.
Yarotsky, Dmitry (2018); Universal approximations of invariant maps by neural networks.
Cybenko, G. (1989) "Approximations by superpositions of sigmoidal functions", Mathematics of Control, Signals, and Systems, 2(4), 303–314. doi:10.1007/BF02551274
Kurt Hornik (1991) "", Neural Networks, 4(2), 251–257. doi:10.1016/0893-6080(91)90009-T
Leshno, Moshe; Lin, Vladimir Ya.; Pinkus, Allan; Schocken, Shimon (January 1993). "Multilayer feedforward networks with a nonpolynomial activation function can approximate any function". Neural Networks. 6 (6): 861–867. doi:10.1016/S0893-6080(05)80131-5.
Pinkus, Allan (January 1999). "Approximation theory of the MLP model in neural networks". Acta Numerica. 8: 143–195. doi:10.1017/S0962492900002919.
Kratsios, Anastasis (August 7, 2020). "Characterizing the Universal Approximation Property". arXiv:1910.03344.
Lu, Zhou; Pu, Homgming; Wang, Feicheng; Hu, Zhiqiang; Wang, Liwei. "The Expressive Power of Neural Networks: A View from the Width". Advances in Neural Information Processing Systems 30. Curran Associates, Inc.: 6231–6239.
Hanin, Boris; Sellke, Mark (March 2019). "Approximating Continuous Functions by ReLU Nets of Minimal Width". Mathematics, Special Issue on Computational Mathematics, Algorithms, and Data Processing. MDPI AG.
Kidger, Patrick; Lyons, Terry (July 2020). Universal Approximation with Deep Narrow Networks. Conference on Learning Theory.
Lin, Hongzhou; Jegelka, Stefanie (2018). ResNet with one-neuron hidden layers is a Universal Approximator. Advances in Neural Information Processing Systems 30. Curran Associates, Inc. pp. 6169–6178.
Haykin, Simon (1998). Neural Networks: A Comprehensive Foundation, Volume 2, Prentice Hall. ISBN 0-13-273350-1.
Hassoun, M. (1995) Fundamentals of Artificial Neural Networks MIT Press, p. 48
Hanin, B. (2018). Approximating Continuous Functions by ReLU Nets of Minimal Width. arXiv preprint arXiv:1710.11278.
Johnson, Jesse (2019). Deep, Skinny Neural Networks are not Universal Approximators. International Conference on Learning Representations.

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[1] Balázs Csanád Csáji (2001) Approximation with Artificial Neural Networks; Faculty of Sciences; Eötvös Loránd University, Hungary

[2] Zhou, Ding-Xuan (2020) Universality of deep convolutional neural networks; Applied and computational harmonic analysis 48.2 (2020): 787-794.

[3] A. Heinecke, J. Ho and W. Hwang (2020); Refinement and Universal Approximation via Sparsely Connected ReLU Convolution Nets; IEEE Signal Processing Letters, vol. 27, pp. 1175-1179.

[4] Park, Jooyoung, and Irwin W. Sandberg (1991); Universal approximation using radial-basis-function networks; Neural computation 3.2, 246-257.

[5] Yarotsky, Dmitry (2018); Universal approximations of invariant maps by neural networks.

[cyb-6] Cybenko, G. (1989) "Approximations by superpositions of sigmoidal functions", Mathematics of Control, Signals, and Systems, 2(4), 303–314. doi:10.1007/BF02551274

[horn-7] Kurt Hornik (1991) "", Neural Networks, 4(2), 251–257. doi:10.1016/0893-6080(91)90009-T

[leshno-8] Leshno, Moshe; Lin, Vladimir Ya.; Pinkus, Allan; Schocken, Shimon (January 1993). "Multilayer feedforward networks with a nonpolynomial activation function can approximate any function". Neural Networks. 6 (6): 861–867. doi:10.1016/S0893-6080(05)80131-5.

[pinkus-9] Pinkus, Allan (January 1999). "Approximation theory of the MLP model in neural networks". Acta Numerica. 8: 143–195. doi:10.1017/S0962492900002919.

[Kratsios-10] Kratsios, Anastasis (August 7, 2020). "Characterizing the Universal Approximation Property". arXiv:1910.03344.

[ZhouLu-11] Lu, Zhou; Pu, Homgming; Wang, Feicheng; Hu, Zhiqiang; Wang, Liwei. "The Expressive Power of Neural Networks: A View from the Width". Advances in Neural Information Processing Systems 30. Curran Associates, Inc.: 6231–6239.

[hanin-12] Hanin, Boris; Sellke, Mark (March 2019). "Approximating Continuous Functions by ReLU Nets of Minimal Width". Mathematics, Special Issue on Computational Mathematics, Algorithms, and Data Processing. MDPI AG.

[kidger-13] Kidger, Patrick; Lyons, Terry (July 2020). Universal Approximation with Deep Narrow Networks. Conference on Learning Theory.

[14] Lin, Hongzhou; Jegelka, Stefanie (2018). ResNet with one-neuron hidden layers is a Universal Approximator. Advances in Neural Information Processing Systems 30. Curran Associates, Inc. pp. 6169–6178.

[15] Haykin, Simon (1998). Neural Networks: A Comprehensive Foundation, Volume 2, Prentice Hall. ISBN 0-13-273350-1.

[16] Hassoun, M. (1995) Fundamentals of Artificial Neural Networks MIT Press, p. 48

[17] Hanin, B. (2018). Approximating Continuous Functions by ReLU Nets of Minimal Width. arXiv preprint arXiv:1710.11278.

[johnson-18] Johnson, Jesse (2019). Deep, Skinny Neural Networks are not Universal Approximators. International Conference on Learning Representations.

Universal approximation theorem

History

Arbitrary Width Case

Arbitrary Depth Case

See also

References