Gaussian Processes

GP Basics

Sometimes an unknown parameter or variable in a model is not a scalar value or a fixed-length vector, but a function. A Gaussian process (GP) can be used as a prior probability distribution whose support is over the space of continuous functions. A GP prior on the function \(f(x)\) is usually written,

\[f(x) \sim \mathcal{GP}(m(x), \, k(x, x')) \,.\]

The function values are modeled as a draw from a multivariate normal distribution that is parameterized by the mean function, \(m(x)\), and the covariance function, \(k(x, x')\). Gaussian processes are a convenient choice as priors over functions due to the marginalization and conditioning properties of the multivariate normal distribution. Usually, the marginal distribution over \(f(x)\) is evaluated during the inference step. The conditional distribution is then used for predicting the function values \(f(x_*)\) at new points, \(x_*\).

The joint distribution of \(f(x)\) and \(f(x_*)\) is multivariate normal,

\[\begin{split}\begin{bmatrix} f(x) \\ f(x_*) \\ \end{bmatrix} \sim \text{N}\left( \begin{bmatrix} m(x) \\ m(x_*) \\ \end{bmatrix} \,, \begin{bmatrix} k(x,x') & k(x_*, x) \\ k(x_*, x) & k(x_*, x_*') \\ \end{bmatrix} \right) \,.\end{split}\]

Starting from the joint distribution, one obtains the marginal distribution of \(f(x)\), as \(\text{N}(m(x),\, k(x, x'))\). The conditional distribution is

\[f(x_*) \mid f(x) \sim \text{N}\left( k(x_*, x) k(x, x)^{-1} [f(x) - m(x)] + m(x_*) ,\, k(x_*, x_*) - k(x, x_*) k(x, x)^{-1} k(x, x_*) \right) \,.\]


For more information on GPs, check out the book Gaussian Processes for Machine Learning by Rasmussen & Williams, or this introduction by D. Mackay.

PyMC3 is a great environment for working with fully Bayesian Gaussian Process models. GPs in PyMC3 have a clear syntax and are highly composable, and many predefined covariance functions (or kernels), mean functions, and several GP implementations are included. GPs are treated as distributions that can be used within larger or hierarchical models, not just as standalone regression models.

Mean and covariance functions

Those who have used the GPy or GPflow Python packages will find the syntax for construction mean and covariance functions somewhat familiar. When first instantiated, the mean and covariance functions are parameterized, but not given their inputs yet. The covariance functions must additionally be provided with the number of dimensions of the input matrix, and a list that indexes which of those dimensions they are to operate on. The reason for this design is so that covariance functions can be constructed that are combinations of other covariance functions.

For example, to construct an exponentiated quadratic covariance function that operates on the second and third column of a three column matrix representing three predictor variables:

ls = [2, 5] # the lengthscales
cov_func =, ls=ls, active_dims=[1, 2])

Here the ls, or lengthscale, parameter is two dimensional, allowing the second and third dimension to have a different lengthscale. The reason we have to specify input_dim, the total number of columns of X, and active_dims, which of those columns or dimensions the covariance function will act on, is because cov_func hasn’t actually seen the input data yet. The active_dims argument is optional, and defaults to all columns of the matrix of inputs.

Covariance functions in PyMC3 closely follow the algebraic rules for kernels, which allows users to combine covariance functions into new ones, for example:

  • The sum two covariance functions is also a covariance function:

    cov_func = +
  • The product of two covariance functions is also a covariance function:

    cov_func = *
  • The product (or sum) of a covariance function with a scalar is a covariance


cov_func = eta**2 *

After the covariance function is defined, it is now a function that is evaluated by calling cov_func(x, x) (or mean_func(x)). Since PyMC3 is built on top of Theano, it is relatively easy to define and experiment with non-standard covariance and mean functons. For more information check out the tutorial on covariance functions.

GP Implementations

PyMC3 includes several GP implementations, including marginal and latent variable models and also some fast approximations. Their usage all follows a similar pattern: First, a GP is instantiated with a mean function and a covariance function. Then, GP objects can be added together, allowing for function characteristics to be carefully modeled and separated. Finally, one of prior, marginal_likelihood or conditional methods is called on the GP object to actually construct the PyMC3 random variable that represents the function prior.

Using gp.Latent for the example, the syntax to first specify the GP is:

gp =, cov_func)

The first argument is the mean function and the second is the covariance function. We’ve made the GP object, but we haven’t made clear which function it is to be a prior for, what the inputs are, or what parameters it will be conditioned on.


The gp.Marginal class and similar don’t have a prior method. Instead they have a marginal_likelihood method that is used similarly, but has additional required arguments, such as the observed data, noise, or other, depending on the implementation. See the notebooks for examples. The conditional method works similarly.

Calling the prior method will create a PyMC3 random variable that represents the latent function \(f(x) = \mathbf{f}\):

f = gp.prior("f", X)

f is a random variable that can be used within a PyMC3 model like any other type of random variable. The first argument is the name of the random variable representing the function we are placing the prior over. The second argument is the inputs to the function that the prior is over, X. The inputs are usually known and present in the data, but they can also be PyMC3 random variables. If the inputs are a Theano tensor or a PyMC3 random variable, the shape needs to be given.

Usually at this point, inference is performed on the model. The conditional method creates the conditional, or predictive, distribution over the latent function at arbitrary \(x_*\) input points, \(f(x_*)\). To construct the conditional distribution we write:

f_star = gp.conditional("f_star", X_star)

Additive GPs

The GP implementation in PyMC3 is constructed so that it is easy to define additive GPs and sample from individual GP components. We can write:

gp1 =, cov_func1)
gp2 =, cov_func2)
gp3 = gp1 + gp2

The GP objects have to have the same type, gp.Marginal cannot be added to gp.Latent.

Consider two independent GP distributed functions, \(f_1(x) \sim \mathcal{GP}\left(m_1(x),\, k_1(x, x')\right)\) and \(f_2(x) \sim \mathcal{GP}\left( m_2(x),\, k_2(x, x')\right)\). The joint distribution of \(f_1,\, f_1^*,\, f_2,\, f_2^*,\, f_1 + f_2 and f_1^* + f_2^*\) is

\[\begin{split}\begin{bmatrix} f_1 \\ f_1^* \\ f_2 \\ f_2^* \\ f_1 + f_2 \\ f_1^* + f_2^* \end{bmatrix} \sim \text{N}\left( \begin{bmatrix} m_1 \\ m_1^* \\ m_2 \\ m_2^* \\ m_1 + m_2 \\ m_1^* + m_2^* \\ \end{bmatrix} \,,\, \begin{bmatrix} K_1 & K_1^* & 0 & 0 & K_1 & K_1^* \\ K_1^{*^T} & K_1^{**} & 0 & 0 & K_1^* & K_1^{**} \\ 0 & 0 & K_2 & K_2^* & K_2 & K_2^{*} \\ 0 & 0 & K_2^{*^T} & K_2^{**} & K_2^{*} & K_2^{**} \\ K_1 & K_1^{*} & K_2 & K_2^{*} & K_1 + K_2 & K_1^{*} + K_2^{*} \\ K_1^{*^T} & K_1^{**} & K_2^{*^T} & K_2^{**} & K_1^{*^T}+K_2^{*^T} & K_1^{**}+K_2^{**} \end{bmatrix} \right) \,.\end{split}\]

Using the joint distribution to obtain the conditional distribution of \(f_1^*\) with the contribution due to \(f_1 + f_2\) factored out, we get

\[f_1^* \mid f_1 + f_2 \sim \text{N}\left( m_1^* + K_1^{*^T}(K_1 + K_2)^{-1}\left[f_1 + f_2 - m_1 - m_2\right] \,,\, K_1^{**} - K_1^{*^T}(K_1 + K_2)^{-1}K_1^* \right) \,.\]

These equations show how to break down GP models into individual components to see how each contributes to the data. For more information, check out David Duvenaud’s PhD thesis.

The GP objects in PyMC3 keeps track of these marginals automatically. The following code sketch shows how to define the conditional distribution of \(f_2^*\). We use gp.Marginal in the example, but the same works for other implementations. The first block fits the GP prior. We denote \(f_1 + f_2\) as just \(f\) for brevity:

with pm.Model() as model:
    gp1 =, cov_func1)
    gp2 =, cov_func2)

    # gp represents f1 + f2.
    gp = gp1 + gp2

    f = gp.marginal_likelihood("f", X, y, noise)

    trace = pm.sample(1000)

To construct the conditional distribution of gp1 or gp2, we also need to include the additional arguments, X, y, and noise:

with model:
    # conditional distributions of f1 and f2
    f1_star = gp1.conditional("f1_star", X_star,
                              given={"X": X, "y": y, "noise": noise, "gp": gp})
    f2_star = gp2.conditional("f2_star", X_star,
                              given={"X": X, "y": y, "noise": noise, "gp": gp})

    # conditional of f1 + f2, `given` not required
    f_star = gp.conditional("f_star", X_star)

This second block produces the conditional distributions. Notice that extra arguments are required for conditionals of \(f1\) and \(f2\), but not \(f\). This is because those arguments are cached when calling .marginal_likelihood was called on gp.


When constructing conditionals, the additional arguments X, y, noise and gp must be provided as a dict called given!

Since the marginal likelihoood method of gp1 or gp2 weren’t called, their conditionals need to be provided with the required inputs. In the same fashion as the prior, f_star, f1_star and f2_star are random variables that can now be used like any other random variable in PyMC3.

Check the notebooks for detailed demonstrations of the usage of GP functionality in PyMC3.