Neural Network Gaussian Process (NNGP)

Summary

Here, I will prove the NNGP. The proof is by induction. As we go through the layers of the NN, the gaussian structure property is preserved from layer to layer and the kernel updates.

\(f^{old}\) is a GP \(\Rightarrow\) \(f^{new}\) is a GP.

Definition

A “gaussian process” is a random function \(f(x)\) and any finite sample of points x creates a gaussian vector. So a guassian process is a generalization for a gaussian vector to an infinite number of points.

\[f(x) \sim \mathcal{GP} \bigl( \underbrace{m(x)}_{\text{mean}}, \underbrace{\sum (x, x^\prime)}_{\text{covariance}} \bigr) \in \mathbb{R}\]

For any finite set of test points \(x^{(1)}, \dots, x^{(n_{\text{test}})}\). Then the joint distribution of the vector \(f(x)\) is a gaussian vector with mean \(m(x^{(i)})\) with covariance structure \(\sum \bigl( x^{(i)}, x^{(j)} \bigr)_{ij=1}^{n_{\text{test}}}\)

\[\begin{split}\begin{bmatrix} f(x^{(1)})\\ \vdots\\ f(x^{(n_{\text{test}})}) \end{bmatrix} \sim \mathcal{GaussianVector} \Biggl( \begin{bmatrix} m(x^{(i)})\\ \vdots\\ m(x^{(n_{\text{test}})})\\ \end{bmatrix}, \biggl[ \sum \bigl( x^{(i)}, x^{(j)} \bigr)_{ij=1}^{n_{\text{test}}} \biggr] \Biggr)\end{split}\]

So a gaussian process is just something where any finite dimensional subset of it is a guassian vector.

The main result is that in a neural network as you go from one layer to another layer, if one layer is a gaussian process then the next layer is also a gaussian process, and by propagating that through the network, you result in a NNGP. Note: it will mandate that the widths approach infinity.

Proposition

Note: An old function is assumed to be a gaussian process and the new function is our new layer update.

If \(f^{old} : \mathbb{R}^{n_{in}} \xrightarrow{} \mathbb{R}^{n_{old}}\) where every component \(f_i^{old} \sim \mathcal{GP}(0, \sum^{old} (x, x^\prime))\) and \(f_i^{old}, f_j^{old}\) are independent

Then \(f^{new} : \mathbb{R}^{n_{in}} \xrightarrow{} \mathbb{R}^{n_{new}}\) is defined as

\[f^{new}(x) = \frac{\sigma_W}{\sqrt{n_{old}}} \underbrace{W}_{n_{new} \times n_{old}} \underbrace{\phi( f^{old}(x) )}_{n_{hid} \times 1} + \sigma_b \underbrace{b}_{n_{old} \times 1}\]

Note \(f^{new}\) is also a gaussian process and in the limit \(n_{old} \xrightarrow{} \infty\), we can write its kernel:

\[f_i^{new} \sim \mathcal{GP} (0, \sum^{new} (x, x^\prime))\]

with covariance structure \(\sum^{new} (x, x^\prime) = \sigma_W^2 \mathbf{E} \bigl[ \phi \bigl( f^{old}_1 (x) \bigr)^T \phi \bigl( f_1^{old}(x^\prime ) \bigr) \bigr] + \sigma_b^2\), or more simply \(\sum^{new} (x, x^\prime) = \sigma_W^2 \mathbf{E} \bigl[ \phi \bigl( Z \bigr)^T \phi \bigl( Z^\prime \bigr) \bigr] + \sigma_b^2\) where

\[\begin{split}z,z^\prime \sim \mathcal{Gaussian} \biggl( 0, \begin{bmatrix} \sum^{old} (x, x) & \sum^{old} ( x, x^\prime)\\ \sum^{old} (x^\prime, x) & \sum^{old} ( x^\prime, x^\prime)\\ \end{bmatrix} \biggr)\end{split}\]

And \(f_i, f_j\) are independent for \(i \neq j\).

Short proposition

\(f^{old}\) is a GP \(\Rightarrow\) \(f^{new}\) is a GP.

Proof

\(f_i^{new}\) is exactly a random feature regression model with features given by

\[\phi(x) = \Bigl[ \phi \bigl( f^{old}(x) \bigr) \Bigr]_i\]

\(\Rightarrow f_i^{new}\) is a GP with kernel:

\[" K(x, x^\prime) " = \frac{\sigma w^2}{n_{old}} \sum_{i=1}^{n_{\text{old}}} \phi \bigl( f^{old}(x) \bigr)_i \phi \bigl( f^{old}(x^\prime) \bigr)_i + \sigma_b^2\]
\[\xrightarrow{} \sigma_W^2 \mathbb{E} \bigl[ \phi \bigl(f^{old}(x) \bigr)_1 , \phi \bigl(f^{old}(x) \bigr)_1 \bigr] + \sigma_b^2\]

and \(f_i^{new}\) and \(f_j^{new}\) are independent.

Acknowledgements

Summarized from https://bpb-ca-c1.wpmucdn.com/sites.uoguelph.ca/dist/8/175/files/2021/02/Notes_on_feature_regression_and_wide_NNs.pdf