Understanding Gaussian Processes, from scratch.

Sparse, Variational, Gaussian, and how they can be combined in one process.

Brownian Motion

As with most of Bayesian theory, we start with Bayes’ Rule,

\[P(\theta|y) = \frac{P(y|\theta) \cdot P(\theta)}{P(y)}\]

Until now, we have taken \(P(x)\) to mean the probability that \(x\) takes on a particular value, but now the game has changed – we shall now take \(P(x)\) to mean the probability that we encounter a particular function.

Indeed, this simple idea is at the heart of what we call the Gaussian Process.

Take a Normal Distribution, and model it as \(\mathcal{N}(0, \sigma^2)\). What if I keep sampling an element from this distribution, every second, for an hour?

As you can see, we get random noise – not very interesting. Note, of course, that if we take these samples and put them in a histogram, we shall regain the normal distribution, given a good number of samples drawn. That is the consequence of the Monte Carlo sampling process. Again, not very interesting.

What is interesting, however, is if we tweak the equation a little, and add a dependency of the normal distribution on the past sample – we go from \(\mathcal{N}(0, \sigma^2)\) to \(\mathcal{N}(\theta, \sigma^2)\), where \(\theta\) is the last sample drawn, in each case.

We shall now see a drastically different result –

Note here that these samples will not form a histogram-representation of the normal distribution.

What is happening here? How did we go from the undecipherably, decidedly noisy output to a rather pretty line, with just a small dependence? Well, that is the beauty of the first order Markov Process – where each element is dependent on (and only dependent on) the last sample from the distribution.

Think two-gram language models, and think Brownian motion. With a simple change of variables, we go from modelling almost nothing interesting, to how pollen particles behave on the surface of non-viscous fluids, how stock markets function, and how (eventually), we can model any function with a Gaussian process.

Distributing functions instead of variables.

With a different seed, I could have generate a very different path for my proposed brownian distribution, without really changing any of my parameters. Given that function, I propose the following: each path traced by a particle exhibiting Brownian motion is the realization of a probability distribution over a set of functions.

It is somewhat intuitive that this can be considered true. You collapse the probability at every timestep \(t\) into a value, \(y\), until you have a realization at every infinitesimal point of the axis, and then you plot those realizations.

This idea brings us comfortably into the domain of Gaussian Processes.

Imagine this: each 2D function (\(y\) is a function of \(x\)), is, ultimately, formed when we find a value of \(y\) corresponding with \(x\). What if, instead of knowing \(y\) fully at some point, I produced some confidence metric to quantify how sure I am of \(y\) being \(y\)?

If that confidence was modelled by (as things are in Bayesian statistics) a Gaussian distribution, we could reasonably have a good visualization by simply plotting a candlestick with the width of the standard deviation, centered at the mean.

The mean, of course, would just be the point that avoids discontinuity in the proposed \(f(x)\) distribution.

No big deal, right? Let’s expand this idea.

What if, instead of just one point, this expanded to every point in the space, each with a mean and a standard deviation?

Well, then, assuming the probability of collapsing each of these “wavefunctions” into their most likely values is independent, we could trace a line through the means (which is to say, expectations) of all of these infinite Gaussian distributions and call it a day (and a function!).

For good measure, we’ll even plot an area that traps every other function within one standard deviation of that most likely function.

Note that instead of a bunch of different Gaussian distributions, we can consider this function as the realization of one infinite-dimensional Gaussian distribution (in mathematics, when something is parameterized by infinite parameters, we just shift to calling it non-parameterized. Why? Because mathematicians are mercurial).

These “infinite-dimensional (and hence, not parameterized) Gaussian distributions” aren’t truly Gaussian distributions per se, since at infinite means and infinite variances and covariances, one can’t really verify the properties that a Gaussian distributions is deemed to have.

We can, however, say this:

A Gaussian process is a system of infinite distributions, all finite subsets of which will always be in Gaussian distribution with a defined mean and a covariance matrix.

Wonderful! Now, we have the theoretical understanding for what a Gaussian process is. What we still don’t know is how to fit it to some data.

Putting the Gaussian in the Gaussian Process.

Alright, so you have a way to quantify functions as probability distributions. Big deal – it’s not really useful unless you understand how to make predictions off of it, right?

Turns out there’s a way to do that: and it’s a closed form solution! But to get to that, we must first understand a few things.

The Covariance matrix is the prior.

As with all priors, to make a prediction, we first need to pack all of our assumptions about the underlying ground-truth into some quantity. In the case of Gaussian distributions, that quantity is encoded in how we think each of the infinite parameters in the Gaussian distribution is correlated.

The covariance matrix, \(k(x_a, x_b)\) models the joint variability of the points in the Gaussian process, so it tells us how much each of \(x_a\) and \(x_b\) varies with respect to the other.

Now, the covariance matrix needs to have certain properties, but let us try to approach them intuitively. The variance, in one dimension, is the square of the standard deviation – in other words, you must be able to take the square-root of the variance to get the standard deviation.

In N-dimensional matrices, what passes as the square-root is found through the Cholesky Decomposition, where we break an arbitrary matrix, \(\mathcal{N}\) into a product, \(M^T \cdot M\), in some ways the equivalent of its square-roots.

We can prove that a Cholesky Decomposition is only possible if the matrix is positive semi-definite, which is to say that it is a symmetric matrix with all eigenvalues non-negative (the semi-definite because definite when the eigenvalues are strictly positive). Note that some texts here mention that the matrix needs to actually be positive-definite, instead of allowing semi-definite matrices to suffice.

From hereon, we will use the terms covariance function and kernel function interchangeably.

The kernel function is a prior applied on the process that establishes a certain relationship between each pair of variables involved. Note that this is not a matrix, since the number of pairs is infinite, so the function is a valuable abstraction.

We use the Radial Basis Function, or the exponentiated quadratic covariance function, which is our go-to, vanilla kernel when we don’t know much about our function. We could also have a periodic kernel, which predicts that current trends will recur in our function, and so on. A linear kernel imposes that we have a straight line instead of a curve, and therefore makes our prediction boil down to simple Bayesian Linear Regression.

The RBF causes our functions to come out as a locally smooth with high probability. Nearby function values are highly correlated, and the correlation drops off as we move farther apart in the input space.

The prior, visualized.

Let’s take a step back and see what we have till now – we are imagining a vector-space of random functions, and each function is described as an infinite number of points, each of which is described by its own mean, and an overall covariance matrix.

The function at some \(x_0\) is most likely to be at the mean, \(\mu_0\) at that \(x_0\), but actually could be anywhere in that vertical line, \(x = x_0\). We can, however, restrict it to some standard deviation, so we can have a reasonable estimate of the value of \(f(x)\) at \(x_0\).

The covariance helps us to see how collapsing the probability at that point helps in determining all subsequent points.

This “perfect” abstraction of a function with infinite parameters we can plot can be realized by “demoting” it to a subset of \(n\) Gaussian distributions, a marginal distribution expressed as, \(y \sim \mathcal{N}(\mu, \Sigma)\) with a mean vector \(\mu(X)\), and a covariance \(\Sigma = k(X,X)\), where \(k(x_a, x_b)\) was our kernel function.

Let us say that our \(\mu\) is \(0\). This implies that the \(\mu\) vector is a row of zeroes of infinite length.

Now, we can draw correlated samples from this \(n\) dimensional Gaussian, \(\mathcal{N}(0, k(X,X))\). For now, we can approximate \(4\) random functions with a \(100\) dimensional Gaussian distribution.

The RBF Kernel correlates nearby points much more than points further away. As a result, in our \(100\) dimensional example, points corresponding to \(x_{50}\) and \(x_{51}\) could have a covariance of, say, \(0.95\), while that for \(x_{50}\) and \(x_{99}\) might have a covariance of just \(0.05\).

Building a Regression Model.

We start by noting that all functions that can exist in some \(n\) dimensional space have already been encoded into our simple, zero mean, RBF kernel Gaussian process. Based on the data, they are just more or less likely to be realized.

What we ought to do is to take the data, and use the information to push the prior towards a posterior that maximizes the likelihood of realizing the ground-truth function from our Gaussian Process.

Let’s keep this simple: the professor has given you some data, you have your very pretty kernel function as your prior (and from there, a prior distribution over the function space in the form of a zero-mean Gaussian Process, \(h(\cdot) \sim \mathcal{GP}(0, k(x_a, x_b))\)), and now he’s told you to predict what’s going to be the function’s values in a bunch of places.

The data you get from the professor is of size \(n_1\), expressed as (\(X_1, y_1\)); you have your prior kernel function, the RBF; and you have a set of \(n_2\) points, expressed as \(X_2\) – the places where he’s asked you to predict the function’s values.

Ergo, all you have to do is guess \(y_2 = f(X_2)\) as well as possible.

Let us look at what operations we can do here. We have a function, and some data. The function tells us how we think the data is correlated, so let’s find evaluate the function on all of the data. Considering \(\Sigma_{ab} = k(X_a, X_b)\), there are four things we can evaluate – \(\Sigma_{11}\), \(\Sigma_{22}\), \(\Sigma_{12}\), and \(\Sigma_{21}\).

Again, we are to predict \(y_2\), and we have at hand, \(y_1\), \(X_1\), and \(X_2\).

Among them, \(X_2\) doesn’t really encode any new information, it just instructs us about where to evaluate the data.

As always, we now use Bayes’ Rule to find \(p(y_2 \vert y_1, X_1, X_2)\), that is, we try to find a posterior for \(y_2\) given all of the data we have.

Gaussian distributions have certain properties that makes calculating this much simpler. As a recap (or as a hint), make sure that you know the following identities:

Given \(p(x_1) = \mathcal{N}(\mu_1, \Sigma_{11})\), and \(p(x_2) = \mathcal{N}(\mu_2, \Sigma_{22})\), it follows that,

\[p(x_1 | x_2) = \mathcal{N}(\mu_{x_1|x_2}, \Sigma_{x_1 | x_2})\] \[\mu_{1|2} = \mu_1 + \Sigma_{12} \Sigma_{22}^{-1} (x_2 - \mu_2)\] \[\Sigma_{1|2} = \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{12}^T\]

Let us consider \(\mu_2\) and \(\mu_1\)to be 0 in this case. Then, our posterior becomes:

\(\mu_{2 \vert 1} = \Sigma_{21} \Sigma_{11}^{-1} {x_1}\) and our posterior covariance, \(\Sigma_{2 \vert 1} = \Sigma_{22} - \Sigma_{21} \Sigma_{11}^{-1} \Sigma_{21}^T\)

Note: in our identities, we considered dataset \(1\) conditioned on \(2\), and here we’re doing \(2\) conditioned on \(1\), so the subscripts have changed. It is a good exercise to go through both and see what depends on what. I highly recommend a readthrough of the Conditional Probability section of Peter Roelants’s Multivariate Normal Primer.

Also note that \(\Sigma_{11}\) is symmetrical, and therefore equal to its transpose.

And voila! We have our Gaussian Process regression model ready to go!

Towards faster regression.

Now that we know our regression works, let us see how much data we need to get to a good approximation of the ground-truth function – after all, Bayesian statistics is about confidences first, and we should know how confident we are of the true function before we proceed.

We first try to approximate a simple trigonometric function, \(f(x) = sin(x) + cos(x)\) over a support of \(- 3\) to \(3\), with \(10,000\) randomly-chosen data-points. We expect a very confident answer, and we do get one – albeit after quite a while, due to the time complexity, which is something we do not want.

What happens when we reduce this number to, say, \(5\) randomly-chosen data-points? Also, what happens at \(10\)?

In fact, let us use our own Gaussian Process to predict how the complexity of our algorithm grows with time, inspired by Zhenwen Dai’s session on Sparse Gaussian Process in GPSS 2019 – with the runtime for \(N = {5, 10, 50, 100, 500, 1000}\).

It is quite clear that our confidence has decreased rather drastically: something we also do not want. This warrants the question – one, is there a tradeoff between number of data-points and the time required for inference that we are willing to take, and two (and more pertinent to the problem at hand), if we take (instead of randomly choosing) a heuristic method of manipulating our matrix, or our data-points in such a way that we choose the most important data-points, how far could we go?

A short discussion on the Time Complexity of the enterprise.

Notice that the matrix \(\Sigma_{xy} \in \mathbb{R}^{N \times N}\) needs to be inverted and taken the determinant of for analysis and inference – which incurs a cost of \(\mathcal{O}(N^3)\) in time and \(\mathcal{O}(N^2)\) in memory.

That is the primary problem with Gaussian Processes – \(\mathcal{O}(N^3)\) in time is more than we can afford.

To preserve the accuracy of our Gaussian Process, what we must do is shed away all redundant data, while retaining the information that collapses the probabilities of our Gaussian Process at novel locations. Remember that in a matrix, the rank is the number of linearly independent equations that we can glean out of it – a full-rank matrix would be one where nothing is wasted, and a low-rank matrix would be one which is wasteful and can be compressed (and so we shall do just that).

Our aim, now, is to determine whether (and it almost always is) our data matrix is low-rank, and then transform it to a matrix that encodes more novelty into our prediction, and then use that matrix for inversion.

Simple.

The Nyström approximation: efficient use of random samples.

For a Gaussian Process, the log-likelihood is \(\log p(y \vert X) = \log \mathcal{N} (y \vert 0, K(X_1,X_1))\), where \(X_1\) is, as discussed, the points where the data is known. \(K\) is the covariance matrix over the data, according to our kernel function.

Our assumption will be that \(K\) is low-rank, and hence wasteful. Our goal is to optimize K for eventual inference.

What follows is a series of matrix computations, so we shall keep the following identities in mind. For two matrices, if you have a matrix \(K\) of size \(N \times M\), and you truncate that matrix with a subset of values, to size \(M \times M\), to \(S\), keep in mind that \(K S^{-1} K^T\) retains the size \(N \times N\).

This, in effect, “sparsifies” our matrix, thereby reducing the amount of raw computations required to calculate the inverse of \(\Sigma\).

The Nyström approximation, therefore, requires us to,

Randomly pick a subset from the data, say \(D \in \mathbb{R}^{M \times 1}\).
Use our kernel function to find \(S\), where \(S = K(D, D)\).
Use the formula that we discussed to get a covariance matrix of size \(N \times N\) back after the computation \(K S^{-1} K^T\), where \(S \in \mathbb{R}^{M \times M}\).
We set \(\mathcal{K} = K S^{-1} K^T\), and use that as our covariance matrix.

Again, the log likelihood, as discussed, is given with our new covariance matrix and some added noise, as \(\mathcal{L} = \log p(y \vert X, \theta) = \log \mathcal{N} (y \vert 0, \mathcal{K} + \sigma^2 I)\).

Simply expanding \(\mathcal{L}\) yields \(- \frac{1}{2} [ \log \vert 2 \pi (\mathcal{K} + \sigma^2 I) \vert + y^T (\mathcal{K} + \sigma^2I)^{-1} y ]\), which does not lead to any speedup in computation – we still have to perform \(N^3\) calculations to invert \(\Sigma\).

However, we note that \((A + UCV)^{-1} = A^{-1} - A^{-1}U(C^{-1} + V A^{-1}U)^{-1}VA^{-1}\), where A, U, C and V have the dimensions necessary for said operations. This is called the Woodbury matrix identity, and essentially says that the inverse of a rank-\(k\) correction to a matrix can be computed by doing a rank-\(k\) correction to the inverse of the original matrix.

Aside: on rank corrections.

While working on the Woodbury formula, we talked about rank corrections – what is that supposed to mean?

Consider a linear regression problem, when you’re trying to compute the coefficients of a linear model. The closed form solution to a plain linear regression model is, \((X^T X)^{-1} X^T Y\). Assume that this solution has full-column rank, which means that the variables involved are linearly independent.

What are we to do when we observe new samples, \(x_{n+1}, x_{n+2}, ... \in \mathbb{R}^k\)? Rather than explicitly re-computing the matrix inverse of \(M = (X^T X)\) wastefully, we use the matrix inversion lemma to update the value of this \(M\), which we should store somewhere when saving our model.

Back to the Nyström Approximation

We note here that the Woodbury formula reduces the term \((\mathcal{K} + \sigma^2 I)^{-1}\), which appears in the expansion for our Normal distribution, to \([\sigma^{-2}I - \sigma^{-4}K(S + \sigma^{-2}K^T K )^{-1}K^T]\).

Note that here, \(K \in \mathbb{R}^{N \times M}\) and \(S \in \mathbb{R}^{M \times M}\).

Consequently, the matrix inverted, \((S + \sigma^{-2}K^T K )\) is of size \(M \times M\).

Immediately, our complexity falls to \(\mathcal{O}(NM^2)\). That is the Nyström Approximation, and our approximation gets better and better the closer we get to the whole dataset. The subset selection is generally done randomly.