pymc.ADVI#

class pymc.ADVI(*args, **kwargs)[source]#

Automatic Differentiation Variational Inference (ADVI)

This class implements the meanfield ADVI, where the variational posterior distribution is assumed to be spherical Gaussian without correlation of parameters and fit to the true posterior distribution. The means and standard deviations of the variational posterior are referred to as variational parameters.

For explanation, we classify random variables in probabilistic models into three types. Observed random variables \({\cal Y}=\{\mathbf{y}_{i}\}_{i=1}^{N}\) are \(N\) observations. Each \(\mathbf{y}_{i}\) can be a set of observed random variables, i.e., \(\mathbf{y}_{i}=\{\mathbf{y}_{i}^{k}\}_{k=1}^{V_{o}}\), where \(V_{k}\) is the number of the types of observed random variables in the model.

The next ones are global random variables \(\Theta=\{\theta^{k}\}_{k=1}^{V_{g}}\), which are used to calculate the probabilities for all observed samples.

The last ones are local random variables \({\cal Z}=\{\mathbf{z}_{i}\}_{i=1}^{N}\), where \(\mathbf{z}_{i}=\{\mathbf{z}_{i}^{k}\}_{k=1}^{V_{l}}\). These RVs are used only in AEVB (which is not implemented in PyMC).

The goal of ADVI is to approximate the posterior distribution \(p(\Theta,{\cal Z}|{\cal Y})\) by variational posterior \(q(\Theta)\prod_{i=1}^{N}q(\mathbf{z}_{i})\). All of these terms are normal distributions (mean-field approximation).

\(q(\Theta)\) is parametrized with its means and standard deviations. These parameters are denoted as \(\gamma\). While \(\gamma\) is a constant, the parameters of \(q(\mathbf{z}_{i})\) are dependent on each observation. Therefore these parameters are denoted as \(\xi(\mathbf{y}_{i}; \nu)\), where \(\nu\) is the parameters of \(\xi(\cdot)\). For example, \(\xi(\cdot)\) can be a multilayer perceptron or convolutional neural network.

In addition to \(\xi(\cdot)\), we can also include deterministic mappings for the likelihood of observations. We denote the parameters of the deterministic mappings as \(\eta\). An example of such mappings is the deconvolutional neural network used in the convolutional VAE example in the PyMC notebook directory.

This function maximizes the evidence lower bound (ELBO) \({\cal L}(\gamma, \nu, \eta)\) defined as follows:

\[\begin{split}{\cal L}(\gamma,\nu,\eta) & = \mathbf{c}_{o}\mathbb{E}_{q(\Theta)}\left[ \sum_{i=1}^{N}\mathbb{E}_{q(\mathbf{z}_{i})}\left[ \log p(\mathbf{y}_{i}|\mathbf{z}_{i},\Theta,\eta) \right]\right] \\ & - \mathbf{c}_{g}KL\left[q(\Theta)||p(\Theta)\right] - \mathbf{c}_{l}\sum_{i=1}^{N} KL\left[q(\mathbf{z}_{i})||p(\mathbf{z}_{i})\right],\end{split}\]

where \(KL[q(v)||p(v)]\) is the Kullback-Leibler divergence

\[KL[q(v)||p(v)] = \int q(v)\log\frac{q(v)}{p(v)}dv,\]

\(\mathbf{c}_{o/g/l}\) are vectors for weighting each term of ELBO. More precisely, we can write each of the terms in ELBO as follows:

\[\begin{split}\mathbf{c}_{o}\log p(\mathbf{y}_{i}|\mathbf{z}_{i},\Theta,\eta) & = & \sum_{k=1}^{V_{o}}c_{o}^{k} \log p(\mathbf{y}_{i}^{k}| {\rm pa}(\mathbf{y}_{i}^{k},\Theta,\eta)) \\ \mathbf{c}_{g}KL\left[q(\Theta)||p(\Theta)\right] & = & \sum_{k=1}^{V_{g}}c_{g}^{k}KL\left[ q(\theta^{k})||p(\theta^{k}|{\rm pa(\theta^{k})})\right] \\ \mathbf{c}_{l}KL\left[q(\mathbf{z}_{i}||p(\mathbf{z}_{i})\right] & = & \sum_{k=1}^{V_{l}}c_{l}^{k}KL\left[ q(\mathbf{z}_{i}^{k})|| p(\mathbf{z}_{i}^{k}|{\rm pa}(\mathbf{z}_{i}^{k}))\right],\end{split}\]

where \({\rm pa}(v)\) denotes the set of parent variables of \(v\) in the directed acyclic graph of the model.

When using mini-batches, \(c_{o}^{k}\) and \(c_{l}^{k}\) should be set to \(N/M\), where \(M\) is the number of observations in each mini-batch. This is done with supplying total_size parameter to observed nodes (e.g. Normal('x', 0, 1, observed=data, total_size=10000)). In this case it is possible to automatically determine appropriate scaling for \(logp\) of observed nodes. Interesting to note that it is possible to have two independent observed variables with different total_size and iterate them independently during inference.

For working with ADVI, we need to give

  • The probabilistic model

    model with two types of RVs (observed_RVs, global_RVs).

  • (optional) Minibatches

    The tensors to which mini-bathced samples are supplied are handled separately by using callbacks in Inference.fit() method that change storage of shared Aesara variable or by pymc.generator() that automatically iterates over minibatches and defined beforehand.

  • (optional) Parameters of deterministic mappings

    They have to be passed along with other params to Inference.fit() method as more_obj_params argument.

For more information concerning training stage please reference pymc.variational.opvi.ObjectiveFunction.step_function()

Parameters
model:class:pymc.Model

PyMC model for inference

random_seed: None or int

leave None to use package global RandomStream or other valid value to create instance specific one

start: `dict[str, np.ndarray]` or `StartDict`

starting point for inference

start_sigma: `dict[str, np.ndarray]`

starting standard deviation for inference, only available for method ‘advi’

References

  • Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2016). Automatic Differentiation Variational Inference. arXiv preprint arXiv:1603.00788.

  • Geoffrey Roeder, Yuhuai Wu, David Duvenaud, 2016 Sticking the Landing: A Simple Reduced-Variance Gradient for ADVI approximateinference.org/accepted/RoederEtAl2016.pdf

  • Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. stat, 1050, 1.

Methods

ADVI.__init__(*args, **kwargs)

ADVI.fit([n, score, callbacks, progressbar])

Perform Operator Variational Inference

ADVI.refine(n[, progressbar])

Refine the solution using the last compiled step function

ADVI.run_profiling([n, score])

Attributes

approx