Inference

Sampling

pymc3.sampling.sample(draws=500, step=None, init='auto', n_init=200000, start=None, trace=None, chain_idx=0, chains=None, cores=None, tune=500, nuts_kwargs=None, step_kwargs=None, progressbar=True, model=None, random_seed=None, live_plot=False, discard_tuned_samples=True, live_plot_kwargs=None, compute_convergence_checks=True, use_mmap=False, **kwargs)

Draw samples from the posterior using the given step methods.

Multiple step methods are supported via compound step methods.

Parameters:

draws : int

The number of samples to draw. Defaults to 500. The number of tuned samples are discarded by default. See discard_tuned_samples.

step : function or iterable of functions

A step function or collection of functions. If there are variables without a step methods, step methods for those variables will be assigned automatically.

init : str

Initialization method to use for auto-assigned NUTS samplers.

  • auto : Choose a default initialization method automatically. Currently, this is ‘jitter+adapt_diag’, but this can change in the future. If you depend on the exact behaviour, choose an initialization method explicitly.
  • adapt_diag : Start with a identity mass matrix and then adapt a diagonal based on the variance of the tuning samples. All chains use the test value (usually the prior mean) as starting point.
  • jitter+adapt_diag : Same as adapt_diag, but add uniform jitter in [-1, 1] to the starting point in each chain.
  • advi+adapt_diag : Run ADVI and then adapt the resulting diagonal mass matrix based on the sample variance of the tuning samples.
  • advi+adapt_diag_grad : Run ADVI and then adapt the resulting diagonal mass matrix based on the variance of the gradients during tuning. This is experimental and might be removed in a future release.
  • advi : Run ADVI to estimate posterior mean and diagonal mass matrix.
  • advi_map: Initialize ADVI with MAP and use MAP as starting point.
  • map : Use the MAP as starting point. This is discouraged.
  • nuts : Run NUTS and estimate posterior mean and mass matrix from the trace.

n_init : int

Number of iterations of initializer. Only works for ‘nuts’ and ‘ADVI’. If ‘ADVI’, number of iterations, if ‘nuts’, number of draws.

start : dict, or array of dict

Starting point in parameter space (or partial point) Defaults to trace.point(-1)) if there is a trace provided and model.test_point if not (defaults to empty dict). Initialization methods for NUTS (see init keyword) can overwrite the default. For ‘SMC’ if should be a list of dict with length chains.

trace : backend, list, or MultiTrace

This should be a backend instance, a list of variables to track, or a MultiTrace object with past values. If a MultiTrace object is given, it must contain samples for the chain number chain. If None or a list of variables, the NDArray backend is used. Passing either “text” or “sqlite” is taken as a shortcut to set up the corresponding backend (with “mcmc” used as the base name). Ignored when using ‘SMC’.

chain_idx : int

Chain number used to store sample in backend. If chains is greater than one, chain numbers will start here. Ignored when using ‘SMC’.

chains : int

The number of chains to sample. Running independent chains is important for some convergence statistics and can also reveal multiple modes in the posterior. If None, then set to either cores or 2, whichever is larger. For SMC the number of chains is the number of draws.

cores : int

The number of chains to run in parallel. If None, set to the number of CPUs in the system, but at most 4 (for ‘SMC’ defaults to 1). Keep in mind that some chains might themselves be multithreaded via openmp or BLAS. In those cases it might be faster to set this to 1.

tune : int

Number of iterations to tune, defaults to 500. Ignored when using ‘SMC’. Samplers adjust the step sizes, scalings or similar during tuning. Tuning samples will be drawn in addition to the number specified in the draws argument, and will be discarded unless discard_tuned_samples is set to False.

nuts_kwargs : dict

Options for the NUTS sampler. See the docstring of NUTS for a complete list of options. Common options are:

  • target_accept: float in [0, 1]. The step size is tuned such that we approximate this acceptance rate. Higher values like 0.9 or 0.95 often work better for problematic posteriors.
  • max_treedepth: The maximum depth of the trajectory tree.
  • step_scale: float, default 0.25 The initial guess for the step size scaled down by 1/n**(1/4).

If you want to pass options to other step methods, please use step_kwargs.

step_kwargs : dict

Options for step methods. Keys are the lower case names of the step method, values are dicts of keyword arguments. You can find a full list of arguments in the docstring of the step methods. If you want to pass arguments only to nuts, you can use nuts_kwargs.

progressbar : bool

Whether or not to display a progress bar in the command line. The bar shows the percentage of completion, the sampling speed in samples per second (SPS), and the estimated remaining time until completion (“expected time of arrival”; ETA).

model : Model (optional if in with context)

random_seed : int or list of ints

A list is accepted if cores is greater than one.

live_plot : bool

Flag for live plotting the trace while sampling. Ignored when using ‘SMC’.

live_plot_kwargs : dict

Options for traceplot. Example: live_plot_kwargs={‘varnames’: [‘x’]}. Ignored when using ‘SMC’

discard_tuned_samples : bool

Whether to discard posterior samples of the tune interval. Ignored when using ‘SMC’

compute_convergence_checks : bool, default=True

Whether to compute sampler statistics like gelman-rubin and effective_n. Ignored when using ‘SMC’

use_mmap : bool, default=False

Whether to use joblib’s memory mapping to share numpy arrays when sampling across multiple cores. Ignored when using ‘SMC’

Returns:

trace : pymc3.backends.base.MultiTrace

A MultiTrace object that contains the samples.

Examples

>>> import pymc3 as pm
... n = 100
... h = 61
... alpha = 2
... beta = 2
>>> with pm.Model() as model: # context management
...     p = pm.Beta('p', alpha=alpha, beta=beta)
...     y = pm.Binomial('y', n=n, p=p, observed=h)
...     trace = pm.sample(2000, tune=1000, cores=4)
>>> pm.summary(trace)
       mean        sd  mc_error   hpd_2.5  hpd_97.5
p  0.604625  0.047086   0.00078  0.510498  0.694774
pymc3.sampling.iter_sample(draws, step, start=None, trace=None, chain=0, tune=None, model=None, random_seed=None)

Generator that returns a trace on each iteration using the given step method. Multiple step methods supported via compound step method returns the amount of time taken.

Parameters:

draws : int

The number of samples to draw

step : function

Step function

start : dict

Starting point in parameter space (or partial point). Defaults to trace.point(-1)) if there is a trace provided and model.test_point if not (defaults to empty dict)

trace : backend, list, or MultiTrace

This should be a backend instance, a list of variables to track, or a MultiTrace object with past values. If a MultiTrace object is given, it must contain samples for the chain number chain. If None or a list of variables, the NDArray backend is used.

chain : int

Chain number used to store sample in backend. If cores is greater than one, chain numbers will start here.

tune : int

Number of iterations to tune, if applicable (defaults to None)

model : Model (optional if in with context)

random_seed : int or list of ints

A list is accepted if more if cores is greater than one.

Examples

for trace in iter_sample(500, step):
    ...
pymc3.sampling.sample_posterior_predictive(trace, samples=None, model=None, vars=None, size=None, random_seed=None, progressbar=True)

Generate posterior predictive samples from a model given a trace.

Parameters:

trace : backend, list, or MultiTrace

Trace generated from MCMC sampling. Or a list containing dicts from find_MAP() or points

samples : int

Number of posterior predictive samples to generate. Defaults to the length of trace

model : Model (optional if in with context)

Model used to generate trace

vars : iterable

Variables for which to compute the posterior predictive samples. Defaults to model.observed_RVs.

size : int

The number of random draws from the distribution specified by the parameters in each sample of the trace.

random_seed : int

Seed for the random number generator.

progressbar : bool

Whether or not to display a progress bar in the command line. The bar shows the percentage of completion, the sampling speed in samples per second (SPS), and the estimated remaining time until completion (“expected time of arrival”; ETA).

Returns:

samples : dict

Dictionary with the variables as keys. The values corresponding to the posterior predictive samples.

pymc3.sampling.sample_posterior_predictive_w(traces, samples=None, models=None, weights=None, random_seed=None, progressbar=True)

Generate weighted posterior predictive samples from a list of models and a list of traces according to a set of weights.

Parameters:

traces : list or list of lists

List of traces generated from MCMC sampling, or a list of list containing dicts from find_MAP() or points. The number of traces should be equal to the number of weights.

samples : int

Number of posterior predictive samples to generate. Defaults to the length of the shorter trace in traces.

models : list

List of models used to generate the list of traces. The number of models should be equal to the number of weights and the number of observed RVs should be the same for all models. By default a single model will be inferred from with context, in this case results will only be meaningful if all models share the same distributions for the observed RVs.

weights: array-like

Individual weights for each trace. Default, same weight for each model.

random_seed : int

Seed for the random number generator.

progressbar : bool

Whether or not to display a progress bar in the command line. The bar shows the percentage of completion, the sampling speed in samples per second (SPS), and the estimated remaining time until completion (“expected time of arrival”; ETA).

Returns:

samples : dict

Dictionary with the variables as keys. The values corresponding to the posterior predictive samples from the weighted models.

pymc3.sampling.init_nuts(init='auto', chains=1, n_init=500000, model=None, random_seed=None, progressbar=True, **kwargs)

Set up the mass matrix initialization for NUTS.

NUTS convergence and sampling speed is extremely dependent on the choice of mass/scaling matrix. This function implements different methods for choosing or adapting the mass matrix.

Parameters:

init : str

Initialization method to use.

  • auto : Choose a default initialization method automatically. Currently, this is ‘jitter+adapt_diag’, but this can change in the future. If you depend on the exact behaviour, choose an initialization method explicitly.
  • adapt_diag : Start with a identity mass matrix and then adapt a diagonal based on the variance of the tuning samples. All chains use the test value (usually the prior mean) as starting point.
  • jitter+adapt_diag : Same as adapt_diag, but use uniform jitter in [-1, 1] as starting point in each chain.
  • advi+adapt_diag : Run ADVI and then adapt the resulting diagonal mass matrix based on the sample variance of the tuning samples.
  • advi+adapt_diag_grad : Run ADVI and then adapt the resulting diagonal mass matrix based on the variance of the gradients during tuning. This is experimental and might be removed in a future release.
  • advi : Run ADVI to estimate posterior mean and diagonal mass matrix.
  • advi_map: Initialize ADVI with MAP and use MAP as starting point.
  • map : Use the MAP as starting point. This is discouraged.
  • nuts : Run NUTS and estimate posterior mean and mass matrix from the trace.

chains : int

Number of jobs to start.

n_init : int

Number of iterations of initializer If ‘ADVI’, number of iterations, if ‘nuts’, number of draws.

model : Model (optional if in with context)

progressbar : bool

Whether or not to display a progressbar for advi sampling.

**kwargs : keyword arguments

Extra keyword arguments are forwarded to pymc3.NUTS.

Returns:

start : pymc3.model.Point

Starting point for sampler

nuts_sampler : pymc3.step_methods.NUTS

Instantiated and initialized NUTS sampler object

pymc3.sampling.sample_prior_predictive(samples=500, model=None, vars=None, random_seed=None)

Generate samples from the prior predictive distribution.

Parameters:

samples : int

Number of samples from the prior predictive to generate. Defaults to 500.

model : Model (optional if in with context)

vars : iterable

A list of names of variables for which to compute the posterior predictive

samples.

Defaults to model.named_vars.

random_seed : int

Seed for the random number generator.

Returns:

dict

Dictionary with variable names as keys. The values are numpy arrays of prior

samples.

pymc3.sampling.sample_ppc(*args, **kwargs)

This method is deprecated. Please use sample_posterior_predictive()

pymc3.sampling.sample_ppc_w(*args, **kwargs)

This method is deprecated. Please use sample_posterior_predictive_w()

Step-methods

NUTS

class pymc3.step_methods.hmc.nuts.NUTS(vars=None, max_treedepth=10, early_max_treedepth=8, **kwargs)

A sampler for continuous variables based on Hamiltonian mechanics.

NUTS automatically tunes the step size and the number of steps per sample. A detailed description can be found at [1], “Algorithm 6: Efficient No-U-Turn Sampler with Dual Averaging”.

NUTS provides a number of statistics that can be accessed with trace.get_sampler_stats:

  • mean_tree_accept: The mean acceptance probability for the tree that generated this sample. The mean of these values across all samples but the burn-in should be approximately target_accept (the default for this is 0.8).
  • diverging: Whether the trajectory for this sample diverged. If there are any divergences after burnin, this indicates that the results might not be reliable. Reparametrization can often help, but you can also try to increase target_accept to something like 0.9 or 0.95.
  • energy: The energy at the point in phase-space where the sample was accepted. This can be used to identify posteriors with problematically long tails. See below for an example.
  • energy_change: The difference in energy between the start and the end of the trajectory. For a perfect integrator this would always be zero.
  • max_energy_change: The maximum difference in energy along the whole trajectory.
  • depth: The depth of the tree that was used to generate this sample
  • tree_size: The number of leafs of the sampling tree, when the sample was accepted. This is usually a bit less than 2 ** depth. If the tree size is large, the sampler is using a lot of leapfrog steps to find the next sample. This can for example happen if there are strong correlations in the posterior, if the posterior has long tails, if there are regions of high curvature (“funnels”), or if the variance estimates in the mass matrix are inaccurate. Reparametrisation of the model or estimating the posterior variances from past samples might help.
  • tune: This is True, if step size adaptation was turned on when this sample was generated.
  • step_size: The step size used for this sample.
  • step_size_bar: The current best known step-size. After the tuning samples, the step size is set to this value. This should converge during tuning.
  • model_logp: The model log-likelihood for this sample.

References

[R23]Hoffman, Matthew D., & Gelman, Andrew. (2011). The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.

Set up the No-U-Turn sampler.

Parameters:

vars : list of Theano variables, default all continuous vars

Emax : float, default 1000

Maximum energy change allowed during leapfrog steps. Larger deviations will abort the integration.

target_accept : float, default .8

Adapt the step size such that the average acceptance probability across the trajectories are close to target_accept. Higher values for target_accept lead to smaller step sizes. Setting this to higher values like 0.9 or 0.99 can help with sampling from difficult posteriors. Valid values are between 0 and 1 (exclusive).

step_scale : float, default 0.25

Size of steps to take, automatically scaled down by 1/n**(1/4). If step size adaptation is switched off, the resulting step size is used. If adaptation is enabled, it is used as initial guess.

gamma : float, default .05

k : float, default .75

Parameter for dual averaging for step size adaptation. Values between 0.5 and 1 (exclusive) are admissible. Higher values correspond to slower adaptation.

t0 : int, default 10

Parameter for dual averaging. Higher values slow initial adaptation.

adapt_step_size : bool, default=True

Whether step size adaptation should be enabled. If this is disabled, k, t0, gamma and target_accept are ignored.

max_treedepth : int, default=10

The maximum tree depth. Trajectories are stoped when this depth is reached.

early_max_treedepth : int, default=8

The maximum tree depth during the first 200 tuning samples.

integrator : str, default “leapfrog”

The integrator to use for the trajectories. One of “leapfrog”, “two-stage” or “three-stage”. The second two can increase sampling speed for some high dimensional problems.

scaling : array_like, ndim = {1,2}

The inverse mass, or precision matrix. One dimensional arrays are interpreted as diagonal matrices. If is_cov is set to True, this will be interpreded as the mass or covariance matrix.

is_cov : bool, default=False

Treat the scaling as mass or covariance matrix.

potential : Potential, optional

An object that represents the Hamiltonian with methods velocity, energy, and random methods. It can be specified instead of the scaling matrix.

model : pymc3.Model

The model

kwargs: passed to BaseHMC

Notes

The step size adaptation stops when self.tune is set to False. This is usually achieved by setting the tune parameter if pm.sample to the desired number of tuning steps.

static competence(var, has_grad)

Check how appropriate this class is for sampling a random variable.

Metropolis

class pymc3.step_methods.metropolis.Metropolis(vars=None, S=None, proposal_dist=None, scaling=1.0, tune=True, tune_interval=100, model=None, mode=None, **kwargs)

Metropolis-Hastings sampling step

Parameters:

vars : list

List of variables for sampler

S : standard deviation or covariance matrix

Some measure of variance to parameterize proposal distribution

proposal_dist : function

Function that returns zero-mean deviates when parameterized with S (and n). Defaults to normal.

scaling : scalar or array

Initial scale factor for proposal. Defaults to 1.

tune : bool

Flag for tuning. Defaults to True.

tune_interval : int

The frequency of tuning. Defaults to 100 iterations.

model : PyMC Model

Optional model for sampling step. Defaults to None (taken from context).

mode : string or Mode instance.

compilation mode passed to Theano functions

class pymc3.step_methods.metropolis.DEMetropolis(vars=None, S=None, proposal_dist=None, lamb=None, scaling=0.001, tune=True, tune_interval=100, model=None, mode=None, **kwargs)

Differential Evolution Metropolis sampling step.

Parameters:

lamb : float

Lambda parameter of the DE proposal mechanism. Defaults to 2.38 / sqrt(2 * ndim)

vars : list

List of variables for sampler

S : standard deviation or covariance matrix

Some measure of variance to parameterize proposal distribution

proposal_dist : function

Function that returns zero-mean deviates when parameterized with S (and n). Defaults to Uniform(-S,+S).

scaling : scalar or array

Initial scale factor for epsilon. Defaults to 0.001

tune : bool

Flag for tuning the scaling. Defaults to True.

tune_interval : int

The frequency of tuning. Defaults to 100 iterations.

model : PyMC Model

Optional model for sampling step. Defaults to None (taken from context).

mode : string or Mode instance.

compilation mode passed to Theano functions

References

[Braak200623]Cajo C.F. ter Braak (2006). A Markov Chain Monte Carlo version of the genetic algorithm Differential Evolution: easy Bayesian computing for real parameter spaces. Statistics and Computing link
class pymc3.step_methods.metropolis.BinaryMetropolis(vars, scaling=1.0, tune=True, tune_interval=100, model=None)

Metropolis-Hastings optimized for binary variables

Parameters:

vars : list

List of variables for sampler

scaling : scalar or array

Initial scale factor for proposal. Defaults to 1.

tune : bool

Flag for tuning. Defaults to True.

tune_interval : int

The frequency of tuning. Defaults to 100 iterations.

model : PyMC Model

Optional model for sampling step. Defaults to None (taken from context).

static competence(var)

BinaryMetropolis is only suitable for binary (bool) and Categorical variables with k=1.

class pymc3.step_methods.metropolis.BinaryGibbsMetropolis(vars, order='random', transit_p=0.8, model=None)

A Metropolis-within-Gibbs step method optimized for binary variables

Parameters:

vars : list

List of variables for sampler

order : list or ‘random’

List of integers indicating the Gibbs update order e.g., [0, 2, 1, …]. Default is random

transit_p : float

The diagonal of the transition kernel. A value > .5 gives anticorrelated proposals, which resulting in more efficient antithetical sampling.

model : PyMC Model

Optional model for sampling step. Defaults to None (taken from context).

static competence(var)

BinaryMetropolis is only suitable for Bernoulli and Categorical variables with k=2.

class pymc3.step_methods.metropolis.CategoricalGibbsMetropolis(vars, proposal='uniform', order='random', model=None)

A Metropolis-within-Gibbs step method optimized for categorical variables. This step method works for Bernoulli variables as well, but it is not optimized for them, like BinaryGibbsMetropolis is. Step method supports two types of proposals: A uniform proposal and a proportional proposal, which was introduced by Liu in his 1996 technical report “Metropolized Gibbs Sampler: An Improvement”.

static competence(var)

CategoricalGibbsMetropolis is only suitable for Bernoulli and Categorical variables.

Slice

class pymc3.step_methods.slicer.Slice(vars=None, w=1.0, tune=True, model=None, iter_limit=inf, **kwargs)

Univariate slice sampler step method

Parameters:

vars : list

List of variables for sampler.

w : float

Initial width of slice (Defaults to 1).

tune : bool

Flag for tuning (Defaults to True).

model : PyMC Model

Optional model for sampling step. Defaults to None (taken from context).

Hamiltonian Monte Carlo

class pymc3.step_methods.hmc.hmc.HamiltonianMC(vars=None, path_length=2.0, adapt_step_size=True, gamma=0.05, k=0.75, t0=10, target_accept=0.8, **kwargs)

A sampler for continuous variables based on Hamiltonian mechanics.

See NUTS sampler for automatically tuned stopping time and step size scaling.

Set up the Hamiltonian Monte Carlo sampler.

Parameters:

vars : list of theano variables

path_length : float, default=2

total length to travel

step_rand : function float -> float, default=unif

A function which takes the step size and returns an new one used to randomize the step size at each iteration.

step_scale : float, default=0.25

Initial size of steps to take, automatically scaled down by 1/n**(1/4).

scaling : array_like, ndim = {1,2}

The inverse mass, or precision matrix. One dimensional arrays are interpreted as diagonal matrices. If is_cov is set to True, this will be interpreded as the mass or covariance matrix.

is_cov : bool, default=False

Treat the scaling as mass or covariance matrix.

potential : Potential, optional

An object that represents the Hamiltonian with methods velocity, energy, and random methods. It can be specified instead of the scaling matrix.

target_accept : float, default .8

Adapt the step size such that the average acceptance probability across the trajectories are close to target_accept. Higher values for target_accept lead to smaller step sizes. Setting this to higher values like 0.9 or 0.99 can help with sampling from difficult posteriors. Valid values are between 0 and 1 (exclusive).

gamma : float, default .05

k : float, default .75

Parameter for dual averaging for step size adaptation. Values between 0.5 and 1 (exclusive) are admissible. Higher values correspond to slower adaptation.

t0 : int, default 10

Parameter for dual averaging. Higher values slow initial adaptation.

adapt_step_size : bool, default=True

Whether step size adaptation should be enabled. If this is disabled, k, t0, gamma and target_accept are ignored.

model : pymc3.Model

The model

**kwargs : passed to BaseHMC

static competence(var, has_grad)

Check how appropriate this class is for sampling a random variable.

Variational

OPVI

Variational inference is a great approach for doing really complex, often intractable Bayesian inference in approximate form. Common methods (e.g. ADVI) lack from complexity so that approximate posterior does not reveal the true nature of underlying problem. In some applications it can yield unreliable decisions.

Recently on NIPS 2017 OPVI framework was presented. It generalizes variational inverence so that the problem is build with blocks. The first and essential block is Model itself. Second is Approximation, in some cases \(log Q(D)\) is not really needed. Necessity depends on the third and forth part of that black box, Operator and Test Function respectively.

Operator is like an approach we use, it constructs loss from given Model, Approximation and Test Function. The last one is not needed if we minimize KL Divergence from Q to posterior. As a drawback we need to compute \(loq Q(D)\). Sometimes approximation family is intractable and \(loq Q(D)\) is not available, here comes LS(Langevin Stein) Operator with a set of test functions.

Test Function has more unintuitive meaning. It is usually used with LS operator and represents all we want from our approximate distribution. For any given vector based function of \(z\) LS operator yields zero mean function under posterior. \(loq Q(D)\) is no more needed. That opens a door to rich approximation families as neural networks.

References

class pymc3.variational.opvi.ObjectiveFunction(op, tf)

Helper class for construction loss and updates for variational inference

Parameters:

op : Operator

OPVI Functional operator

tf : TestFunction

OPVI TestFunction

score_function(**kwargs)

Compile scoring function that operates which takes no inputs and returns Loss

Parameters:

sc_n_mc : int

number of scoring MC samples

more_replacements:

Apply custom replacements before compiling a function

fn_kwargs: `dict`

arbitrary kwargs passed to theano.function

Returns:

theano.function

step_function(**kwargs)

Step function that should be called on each optimization step.

Generally it solves the following problem:

\[\mathbf{\lambda^{\*}} = \inf_{\lambda} \sup_{\theta} t(\mathbb{E}_{\lambda}[(O^{p,q}f_{\theta})(z)])\]
Parameters:

obj_n_mc : int

Number of monte carlo samples used for approximation of objective gradients

tf_n_mc : int

Number of monte carlo samples used for approximation of test function gradients

obj_optimizer : function (grads, params) -> updates

Optimizer that is used for objective params

test_optimizer : function (grads, params) -> updates

Optimizer that is used for test function params

more_obj_params : list

Add custom params for objective optimizer

more_tf_params : list

Add custom params for test function optimizer

more_updates : dict

Add custom updates to resulting updates

total_grad_norm_constraint : float

Bounds gradient norm, prevents exploding gradient problem

score : bool

calculate loss on each step? Defaults to False for speed

fn_kwargs : dict

Add kwargs to theano.function (e.g. {‘profile’: True})

more_replacements : dict

Apply custom replacements before calculating gradients

Returns:

theano.function

updates(obj_n_mc=None, tf_n_mc=None, obj_optimizer=<function adagrad_window>, test_optimizer=<function adagrad_window>, more_obj_params=None, more_tf_params=None, more_updates=None, more_replacements=None, total_grad_norm_constraint=None)

Calculate gradients for objective function, test function and then constructs updates for optimization step

Parameters:

obj_n_mc : int

Number of monte carlo samples used for approximation of objective gradients

tf_n_mc : int

Number of monte carlo samples used for approximation of test function gradients

obj_optimizer : function (loss, params) -> updates

Optimizer that is used for objective params

test_optimizer : function (loss, params) -> updates

Optimizer that is used for test function params

more_obj_params : list

Add custom params for objective optimizer

more_tf_params : list

Add custom params for test function optimizer

more_updates : dict

Add custom updates to resulting updates

more_replacements : dict

Apply custom replacements before calculating gradients

total_grad_norm_constraint : float

Bounds gradient norm, prevents exploding gradient problem

Returns:

ObjectiveUpdates

class pymc3.variational.opvi.Operator(approx)

Base class for Operator

Parameters:

approx : Approximation

an approximation instance

Notes

For implementing custom operator it is needed to define Operator.apply() method

apply(f)

Operator itself

\[(O^{p,q}f_{\theta})(z)\]
Parameters:

f : TestFunction or None

function that takes z = self.input and returns same dimensional output

Returns:

TensorVariable

symbolically applied operator

objective_class

alias of ObjectiveFunction

class pymc3.variational.opvi.Group(group, vfam=None, params=None, random_seed=None, model=None, local=False, rowwise=False, options=None, **kwargs)

Base class for grouping variables in VI

Grouped Approximation is used for modelling mutual dependencies for a specified group of variables. Base for local and global group.

Parameters:

group : list

List of PyMC3 variables or None indicating that group takes all the rest variables

vfam : str

String that marks the corresponding variational family for the group. Cannot be passed both with params

params : dict

Dict with variational family parameters, full description can be found below. Cannot be passed both with vfam

random_seed : int

Random seed for underlying random generator

model :

PyMC3 Model

local : bool

Indicates whether this group is local. Cannot be passed without params. Such group should have only one variable

rowwise : bool

Indicates whether this group is independently parametrized over first dim. Such group should have only one variable

options : dict

Special options for the group

kwargs : Other kwargs for the group

See also

Approximation

Notes

Group instance/class has some important constants:

  • supports_batched Determines whether such variational family can be used for AEVB or rowwise approx.

    AEVB approx is such approx that somehow depends on input data. It can be treated as conditional distribution. You can see more about in the corresponding paper mentioned in references.

    Rowwise mode is a special case approximation that treats every ‘row’, of a tensor as independent from each other. Some distributions can’t do that by definition e.g. Empirical that consists of particles only.

  • has_logq Tells that distribution is defined explicitly

These constants help providing the correct inference method for given parametrization

References

Examples

Basic Initialization

Group is a factory class. You do not need to call every ApproximationGroup explicitly. Passing the correct vfam (Variational FAMily) argument you’ll tell what parametrization is desired for the group. This helps not to overload code with lots of classes.

>>> group = Group([latent1, latent2], vfam='mean_field')

The other way to select approximation is to provide params dictionary that has some predefined well shaped parameters. Keys of the dict serve as an identifier for variational family and help to autoselect the correct group class. To identify what approximation to use, params dict should have the full set of needed parameters. As there are 2 ways to instantiate the Group passing both vfam and params is prohibited. Partial parametrization is prohibited by design to avoid corner cases and possible problems.

>>> group = Group([latent3], params=dict(mu=my_mu, rho=my_rho))

Important to note that in case you pass custom params they will not be autocollected by optimizer, you’ll have to provide them with more_obj_params keyword.

Supported dict keys:

  • {‘mu’, ‘rho’}: MeanFieldGroup

  • {‘mu’, ‘L_tril’}: FullRankGroup

  • {‘histogram’}: EmpiricalGroup

  • {0, 1, 2, 3, …, k-1}: NormalizingFlowGroup of depth k

    NormalizingFlows have other parameters than ordinary groups and should be passed as nested dicts with the following keys:

    • {‘u’, ‘w’, ‘b’}: PlanarFlow
    • {‘a’, ‘b’, ‘z_ref’}: RadialFlow
    • {‘loc’}: LocFlow
    • {‘rho’}: ScaleFlow
    • {‘v’}: HouseholderFlow

    Note that all integer keys should be present in the dictionary. An example of NormalizingFlow initialization can be found below.

Using AEVB

Autoencoding variational Bayes is a powerful tool to get conditional \(q(\lambda|X)\) distribution on latent variables. It is well supported by PyMC3 and all you need is to provide a dictionary with well shaped variational parameters, the correct approximation will be autoselected as mentioned in section above. However we have some implementation restrictions in AEVB. They require autoencoded variable to have first dimension as batch dimension and other dimensions should stay fixed. With this assumptions it is possible to generalize all variational approximation families as batched approximations that have flexible parameters and leading axis.

Only single variable local group is supported. Params are required.

>>> # for mean field
>>> group = Group([latent3], params=dict(mu=my_mu, rho=my_rho), local=True)
>>> # or for full rank
>>> group = Group([latent3], params=dict(mu=my_mu, L_tril=my_L_tril), local=True)
  • An Approximation class is selected automatically based on the keys in dict.
  • my_mu and my_rho are usually estimated with neural network or function approximator.

Using Row-Wise Group

Batch groups have independent row wise approximations, thus using batched mean field will give no effect. It is more interesting if you want each row of a matrix to be parametrized independently with normalizing flow or full rank gaussian.

To tell Group that group is batched you need set batched kwarg as True. Only single variable group is allowed due to implementation details.

>>> group = Group([latent3], vfam='fr', rowwise=True) # 'fr' is alias for 'full_rank'

The resulting approximation for this variable will have the following structure

\[latent3_{i, \dots} \sim \mathcal{N}(\mu_i, \Sigma_i) \forall i\]

Note: Using rowwise and user-parametrized approximation is ok, but shape should be checked beforehand, it is impossible to infer it by PyMC3

Normalizing Flow Group

In case you use simple initialization pattern using vfam you’ll not meet any changes. Passing flow formula to vfam you’ll get correct flow parametrization for group

>>> group = Group([latent3], vfam='scale-hh*5-radial*4-loc')

Note: Consider passing location flow as the last one and scale as the first one for stable inference.

Rowwise normalizing flow is supported as well

>>> group = Group([latent3], vfam='scale-hh*2-radial-loc', rowwise=True)

Custom parameters for normalizing flow can be a real trouble for the first time. They have quite different format from the rest variational families.

>>> # int is used as key, it also tells the flow position
... flow_params = {
...     # `rho` parametrizes scale flow, softplus is used to map (-inf; inf) -> (0, inf)
...     0: dict(rho=my_scale),
...     1: dict(v=my_v1),  # Householder Flow, `v` is parameter name from the original paper
...     2: dict(v=my_v2),  # do not miss any number in dict, or else error is raised
...     3: dict(a=my_a, b=my_b, z_ref=my_z_ref),  # Radial flow
...     4: dict(loc=my_loc)  # Location Flow
... }
... group = Group([latent3], params=flow_params)
... # local=True can be added in case you do AEVB inference
... group = Group([latent3], params=flow_params, local=True)

Delayed Initialization

When you have a lot of latent variables it is impractical to do it all manually. To make life much simpler, You can pass None instead of list of variables. That case you’ll not create shared parameters until you pass all collected groups to Approximation object that collects all the groups together and checks that every group is correctly initialized. For those groups which have group equal to None it will collect all the rest variables not covered by other groups and perform delayed init.

>>> group_1 = Group([latent1], vfam='fr')  # latent1 has full rank approximation
>>> group_other = Group(None, vfam='mf')  # other variables have mean field Q
>>> approx = Approximation([group_1, group_other])

Summing Up

When you have created all the groups they need to pass all the groups to Approximation. It does not accept any other parameter rather than groups

>>> approx = Approximation(my_groups)
logq

Dev - Monte Carlo estimate for group logQ

logq_norm

Dev - Monte Carlo estimate for group logQ normalized

make_size_and_deterministic_replacements(s, d, more_replacements=None)

Dev - creates correct replacements for initial depending on sample size and deterministic flag

Parameters:

s : scalar

sample size

d : bool or scalar

whether sampling is done deterministically

more_replacements : dict

replacements for shape and initial

Returns:

dict with replacements for initial

set_size_and_deterministic(**kwargs)

Dev - after node is sampled via symbolic_sample_over_posterior() or symbolic_single_sample() new random generator can be allocated and applied to node

Parameters:

node : Variable

Theano node with symbolically applied VI replacements

s : scalar

desired number of samples

d : bool or int

whether sampling is done deterministically

more_replacements : dict

more replacements to apply

Returns:

Variable with applied replacements, ready to use

symbolic_logq

Dev - correctly scaled self.symbolic_logq_not_scaled

symbolic_logq_not_scaled

Dev - symbolically computed logq for self.symbolic_random computations can be more efficient since all is known beforehand including self.symbolic_random

symbolic_normalizing_constant

Dev - normalizing constant for self.logq, scales it to minibatch_size instead of total_size

symbolic_random

Dev - abstract node that takes self.symbolic_initial and creates approximate posterior that is parametrized with self.params_dict.

Implementation should take in account self.batched. If self.batched is True, then self.symbolic_initial is 3d tensor, else 2d

Returns:tensor
symbolic_random2d

Dev - self.symbolic_random flattened to matrix

symbolic_sample_over_posterior(node)

Dev - performs sampling of node applying independent samples from posterior each time. Note that it is done symbolically and this node needs set_size_and_deterministic() call

symbolic_single_sample(node)

Dev - performs sampling of node applying single sample from posterior. Note that it is done symbolically and this node needs set_size_and_deterministic() call with size=1

to_flat_input(node)

Dev - replace vars with flattened view stored in self.inputs

class pymc3.variational.opvi.Approximation(groups, model=None)

Wrapper for grouped approximations

Wraps list of groups, creates an Approximation instance that collects sampled variables from all the groups, also collects logQ needed for explicit Variational Inference.

Parameters:

groups : list[Group]

List of Group instances. They should have all model variables

model : Model

See also

Group

Notes

Some shortcuts for single group approximations are available:

  • MeanField
  • FullRank
  • NormalizingFlow
  • Empirical

Single group accepts local_rv keyword with dict mapping PyMC3 variables to their local Group parameters dict

datalogp

Dev - computes \(E_{q}(data term)\) from model via theano.scan that can be optimized later

datalogp_norm

Dev - normalized \(E_{q}(data term)\)

get_optimization_replacements(s, d)

Dev - optimizations for logP. If sample size is static and equal to 1: then theano.scan MC estimate is replaced with single sample without call to theano.scan.

logp

Dev - computes \(E_{q}(logP)\) from model via theano.scan that can be optimized later

logp_norm

Dev - normalized \(E_{q}(logP)\)

logq

Dev - collects logQ for all groups

logq_norm

Dev - collects logQ for all groups and normalizes it

make_size_and_deterministic_replacements(s, d, more_replacements=None)

Dev - creates correct replacements for initial depending on sample size and deterministic flag

Parameters:

s : scalar

sample size

d : bool

whether sampling is done deterministically

more_replacements : dict

replacements for shape and initial

Returns:

dict with replacements for initial

replacements

Dev - all replacements from groups to replace PyMC random variables with approximation

rslice(name)

Dev - vectorized sampling for named random variable without call to theano.scan. This node still needs set_size_and_deterministic() to be evaluated

sample(draws=500, include_transformed=True)

Draw samples from variational posterior.

Parameters:

draws : int

Number of random samples.

include_transformed : bool

If True, transformed variables are also sampled. Default is False.

Returns:

trace : pymc3.backends.base.MultiTrace

Samples drawn from variational posterior.

sample_node(**kwargs)

Samples given node or nodes over shared posterior

Parameters:

node : Theano Variables (or Theano expressions)

size : None or scalar

number of samples

more_replacements : dict

add custom replacements to graph, e.g. change input source

deterministic : bool

whether to use zeros as initial distribution if True - zero initial point will produce constant latent variables

Returns:

sampled node(s) with replacements

scale_cost_to_minibatch

Dev - Property to control scaling cost to minibatch

set_size_and_deterministic(**kwargs)

Dev - after node is sampled via symbolic_sample_over_posterior() or symbolic_single_sample() new random generator can be allocated and applied to node

Parameters:

node : Variable

Theano node with symbolically applied VI replacements

s : scalar

desired number of samples

d : bool or int

whether sampling is done deterministically

more_replacements : dict

more replacements to apply

Returns:

Variable with applied replacements, ready to use

single_symbolic_datalogp

Dev - for single MC sample estimate of \(E_{q}(data term)\) theano.scan is not needed and code can be optimized

single_symbolic_logp

Dev - for single MC sample estimate of \(E_{q}(logP)\) theano.scan is not needed and code can be optimized

single_symbolic_varlogp

Dev - for single MC sample estimate of \(E_{q}(prior term)\) theano.scan is not needed and code can be optimized

sized_symbolic_datalogp

Dev - computes sampled data term from model via theano.scan

sized_symbolic_logp

Dev - computes sampled logP from model via theano.scan

sized_symbolic_varlogp

Dev - computes sampled prior term from model via theano.scan

symbolic_logq

Dev - collects symbolic_logq for all groups

symbolic_normalizing_constant

Dev - normalizing constant for self.logq, scales it to minibatch_size instead of total_size. Here the effect is controlled by self.scale_cost_to_minibatch

symbolic_sample_over_posterior(node)

Dev - performs sampling of node applying independent samples from posterior each time. Note that it is done symbolically and this node needs set_size_and_deterministic() call

symbolic_single_sample(node)

Dev - performs sampling of node applying single sample from posterior. Note that it is done symbolically and this node needs set_size_and_deterministic() call with size=1

to_flat_input(node)

Dev - replace vars with flattened view stored in self.inputs

varlogp

Dev - computes \(E_{q}(prior term)\) from model via theano.scan that can be optimized later

varlogp_norm

Dev - normalized \(E_{q}(prior term)\)

Inference

class pymc3.variational.inference.ADVI(*args, **kwargs)

Automatic Differentiation Variational Inference (ADVI)

This class implements the meanfield ADVI, where the variational posterior distribution is assumed to be spherical Gaussian without correlation of parameters and fit to the true posterior distribution. The means and standard deviations of the variational posterior are referred to as variational parameters.

For explanation, we classify random variables in probabilistic models into three types. Observed random variables \({\cal Y}=\{\mathbf{y}_{i}\}_{i=1}^{N}\) are \(N\) observations. Each \(\mathbf{y}_{i}\) can be a set of observed random variables, i.e., \(\mathbf{y}_{i}=\{\mathbf{y}_{i}^{k}\}_{k=1}^{V_{o}}\), where \(V_{k}\) is the number of the types of observed random variables in the model.

The next ones are global random variables \(\Theta=\{\theta^{k}\}_{k=1}^{V_{g}}\), which are used to calculate the probabilities for all observed samples.

The last ones are local random variables \({\cal Z}=\{\mathbf{z}_{i}\}_{i=1}^{N}\), where \(\mathbf{z}_{i}=\{\mathbf{z}_{i}^{k}\}_{k=1}^{V_{l}}\). These RVs are used only in AEVB.

The goal of ADVI is to approximate the posterior distribution \(p(\Theta,{\cal Z}|{\cal Y})\) by variational posterior \(q(\Theta)\prod_{i=1}^{N}q(\mathbf{z}_{i})\). All of these terms are normal distributions (mean-field approximation).

\(q(\Theta)\) is parametrized with its means and standard deviations. These parameters are denoted as \(\gamma\). While \(\gamma\) is a constant, the parameters of \(q(\mathbf{z}_{i})\) are dependent on each observation. Therefore these parameters are denoted as \(\xi(\mathbf{y}_{i}; \nu)\), where \(\nu\) is the parameters of \(\xi(\cdot)\). For example, \(\xi(\cdot)\) can be a multilayer perceptron or convolutional neural network.

In addition to \(\xi(\cdot)\), we can also include deterministic mappings for the likelihood of observations. We denote the parameters of the deterministic mappings as \(\eta\). An example of such mappings is the deconvolutional neural network used in the convolutional VAE example in the PyMC3 notebook directory.

This function maximizes the evidence lower bound (ELBO) \({\cal L}(\gamma, \nu, \eta)\) defined as follows:

\[\begin{split}{\cal L}(\gamma,\nu,\eta) & = \mathbf{c}_{o}\mathbb{E}_{q(\Theta)}\left[ \sum_{i=1}^{N}\mathbb{E}_{q(\mathbf{z}_{i})}\left[ \log p(\mathbf{y}_{i}|\mathbf{z}_{i},\Theta,\eta) \right]\right] \\ & - \mathbf{c}_{g}KL\left[q(\Theta)||p(\Theta)\right] - \mathbf{c}_{l}\sum_{i=1}^{N} KL\left[q(\mathbf{z}_{i})||p(\mathbf{z}_{i})\right],\end{split}\]

where \(KL[q(v)||p(v)]\) is the Kullback-Leibler divergence

\[KL[q(v)||p(v)] = \int q(v)\log\frac{q(v)}{p(v)}dv,\]

\(\mathbf{c}_{o/g/l}\) are vectors for weighting each term of ELBO. More precisely, we can write each of the terms in ELBO as follows:

\[\begin{split}\mathbf{c}_{o}\log p(\mathbf{y}_{i}|\mathbf{z}_{i},\Theta,\eta) & = & \sum_{k=1}^{V_{o}}c_{o}^{k} \log p(\mathbf{y}_{i}^{k}| {\rm pa}(\mathbf{y}_{i}^{k},\Theta,\eta)) \\ \mathbf{c}_{g}KL\left[q(\Theta)||p(\Theta)\right] & = & \sum_{k=1}^{V_{g}}c_{g}^{k}KL\left[ q(\theta^{k})||p(\theta^{k}|{\rm pa(\theta^{k})})\right] \\ \mathbf{c}_{l}KL\left[q(\mathbf{z}_{i}||p(\mathbf{z}_{i})\right] & = & \sum_{k=1}^{V_{l}}c_{l}^{k}KL\left[ q(\mathbf{z}_{i}^{k})|| p(\mathbf{z}_{i}^{k}|{\rm pa}(\mathbf{z}_{i}^{k}))\right],\end{split}\]

where \({\rm pa}(v)\) denotes the set of parent variables of \(v\) in the directed acyclic graph of the model.

When using mini-batches, \(c_{o}^{k}\) and \(c_{l}^{k}\) should be set to \(N/M\), where \(M\) is the number of observations in each mini-batch. This is done with supplying total_size parameter to observed nodes (e.g. Normal('x', 0, 1, observed=data, total_size=10000)). In this case it is possible to automatically determine appropriate scaling for \(logp\) of observed nodes. Interesting to note that it is possible to have two independent observed variables with different total_size and iterate them independently during inference.

For working with ADVI, we need to give

  • The probabilistic model

    model with three types of RVs (observed_RVs, global_RVs and local_RVs).

  • (optional) Minibatches

    The tensors to which mini-bathced samples are supplied are handled separately by using callbacks in Inference.fit() method that change storage of shared theano variable or by pymc3.generator() that automatically iterates over minibatches and defined beforehand.

  • (optional) Parameters of deterministic mappings

    They have to be passed along with other params to Inference.fit() method as more_obj_params argument.

For more information concerning training stage please reference pymc3.variational.opvi.ObjectiveFunction.step_function()

Parameters:

local_rv : dict[var->tuple]

mapping {model_variable -> approx params} Local Vars are used for Autoencoding Variational Bayes See (AEVB; Kingma and Welling, 2014) for details

model : pymc3.Model

PyMC3 model for inference

random_seed : None or int

leave None to use package global RandomStream or other valid value to create instance specific one

start : Point

starting point for inference

References

  • Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2016). Automatic Differentiation Variational Inference. arXiv preprint arXiv:1603.00788.
  • Geoffrey Roeder, Yuhuai Wu, David Duvenaud, 2016 Sticking the Landing: A Simple Reduced-Variance Gradient for ADVI approximateinference.org/accepted/RoederEtAl2016.pdf
  • Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. stat, 1050, 1.
class pymc3.variational.inference.FullRankADVI(*args, **kwargs)

Full Rank Automatic Differentiation Variational Inference (ADVI)

Parameters:

local_rv : dict[var->tuple]

mapping {model_variable -> approx params} Local Vars are used for Autoencoding Variational Bayes See (AEVB; Kingma and Welling, 2014) for details

model : pymc3.Model

PyMC3 model for inference

random_seed : None or int

leave None to use package global RandomStream or other valid value to create instance specific one

start : Point

starting point for inference

References

  • Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2016). Automatic Differentiation Variational Inference. arXiv preprint arXiv:1603.00788.
  • Geoffrey Roeder, Yuhuai Wu, David Duvenaud, 2016 Sticking the Landing: A Simple Reduced-Variance Gradient for ADVI approximateinference.org/accepted/RoederEtAl2016.pdf
  • Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. stat, 1050, 1.
class pymc3.variational.inference.SVGD(n_particles=100, jitter=1, model=None, start=None, random_seed=None, estimator=<class 'pymc3.variational.operators.KSD'>, kernel=<pymc3.variational.test_functions.RBF object>, **kwargs)

Stein Variational Gradient Descent

This inference is based on Kernelized Stein Discrepancy it’s main idea is to move initial noisy particles so that they fit target distribution best.

Algorithm is outlined below

Input: A target distribution with density function \(p(x)\)
and a set of initial particles \(\{x^0_i\}^n_{i=1}\)

Output: A set of particles \(\{x^{*}_i\}^n_{i=1}\) that approximates the target distribution.

\[\begin{split}x_i^{l+1} &\leftarrow x_i^{l} + \epsilon_l \hat{\phi}^{*}(x_i^l) \\ \hat{\phi}^{*}(x) &= \frac{1}{n}\sum^{n}_{j=1}[k(x^l_j,x) \nabla_{x^l_j} logp(x^l_j)+ \nabla_{x^l_j} k(x^l_j,x)]\end{split}\]
Parameters:

n_particles : int

number of particles to use for approximation

jitter : float

noise sd for initial point

model : pymc3.Model

PyMC3 model for inference

kernel : callable

kernel function for KSD \(f(histogram) -> (k(x,.), \nabla_x k(x,.))\)

temperature : float

parameter responsible for exploration, higher temperature gives more broad posterior estimate

start : dict

initial point for inference

random_seed : None or int

leave None to use package global RandomStream or other valid value to create instance specific one

start : Point

starting point for inference

kwargs : other keyword arguments passed to estimator

References

  • Qiang Liu, Dilin Wang (2016) Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm arXiv:1608.04471
  • Yang Liu, Prajit Ramachandran, Qiang Liu, Jian Peng (2017) Stein Variational Policy Gradient arXiv:1704.02399
class pymc3.variational.inference.ASVGD(approx=None, estimator=<class 'pymc3.variational.operators.KSD'>, kernel=<pymc3.variational.test_functions.RBF object>, **kwargs)

Amortized Stein Variational Gradient Descent

not suggested to use

This inference is based on Kernelized Stein Discrepancy it’s main idea is to move initial noisy particles so that they fit target distribution best.

Algorithm is outlined below

Input: Parametrized random generator \(R_{\theta}\)

Output: \(R_{\theta^{*}}\) that approximates the target distribution.

\[\begin{split}\Delta x_i &= \hat{\phi}^{*}(x_i) \\ \hat{\phi}^{*}(x) &= \frac{1}{n}\sum^{n}_{j=1}[k(x_j,x) \nabla_{x_j} logp(x_j)+ \nabla_{x_j} k(x_j,x)] \\ \Delta_{\theta} &= \frac{1}{n}\sum^{n}_{i=1}\Delta x_i\frac{\partial x_i}{\partial \theta}\end{split}\]
Parameters:

approx : Approximation

default is FullRank but can be any

kernel : callable

kernel function for KSD \(f(histogram) -> (k(x,.), \nabla_x k(x,.))\)

model : Model

kwargs : kwargs for gradient estimator

References

  • Dilin Wang, Yihao Feng, Qiang Liu (2016) Learning to Sample Using Stein Discrepancy http://bayesiandeeplearning.org/papers/BDL_21.pdf
  • Dilin Wang, Qiang Liu (2016) Learning to Draw Samples: With Application to Amortized MLE for Generative Adversarial Learning arXiv:1611.01722
  • Yang Liu, Prajit Ramachandran, Qiang Liu, Jian Peng (2017) Stein Variational Policy Gradient arXiv:1704.02399
fit(n=10000, score=None, callbacks=None, progressbar=True, obj_n_mc=500, **kwargs)

Perform Operator Variational Inference

Parameters:

n : int

number of iterations

score : bool

evaluate loss on each iteration or not

callbacks : list[function

calls provided functions after each iteration step

progressbar : bool

whether to show progressbar or not

Returns:

Approximation

Other Parameters:
 

obj_n_mc : int

Number of monte carlo samples used for approximation of objective gradients

tf_n_mc : int

Number of monte carlo samples used for approximation of test function gradients

obj_optimizer : function (grads, params) -> updates

Optimizer that is used for objective params

test_optimizer : function (grads, params) -> updates

Optimizer that is used for test function params

more_obj_params : list

Add custom params for objective optimizer

more_tf_params : list

Add custom params for test function optimizer

more_updates : dict

Add custom updates to resulting updates

total_grad_norm_constraint : float

Bounds gradient norm, prevents exploding gradient problem

fn_kwargs : dict

Add kwargs to theano.function (e.g. {‘profile’: True})

more_replacements : dict

Apply custom replacements before calculating gradients

class pymc3.variational.inference.NFVI(*args, **kwargs)

Normalizing Flow based :class:`KLqp` inference

Normalizing flow is a series of invertible transformations on initial distribution.

\[z_K = f_K \circ \dots \circ f_2 \circ f_1(z_0)\]

In that case we can compute tractable density for the flow.

\[\ln q_K(z_K) = \ln q_0(z_0) - \sum_{k=1}^{K}\ln \left|\frac{\partial f_k}{\partial z_{k-1}}\right|\]

Every \(f_k\) here is a parametric function with defined determinant. We can choose every step here. For example the here is a simple flow is an affine transform:

\[z = loc(scale(z_0)) = \mu + \sigma * z_0\]

Here we get mean field approximation if \(z_0 \sim \mathcal{N}(0, 1)\)

Flow Formulas

In PyMC3 there is a flexible way to define flows with formulas. We have 5 of them by the moment:

  • Loc (loc): \(z' = z + \mu\)
  • Scale (scale): \(z' = \sigma * z\)
  • Planar (planar): \(z' = z + u * \tanh(w^T z + b)\)
  • Radial (radial): \(z' = z + \beta (\alpha + (z-z_r))^{-1}(z-z_r)\)
  • Householder (hh): \(z' = H z\)

Formula can be written as a string, e.g. ‘scale-loc’, ‘scale-hh*4-loc’, ‘panar*10’. Every step is separated with ‘-‘, repeated flow is marked with ‘*’ producing ‘flow*repeats’.

Parameters:

flow : str|AbstractFlow

formula or initialized Flow, default is ‘scale-loc’ that is identical to MeanField

model : pymc3.Model

PyMC3 model for inference

random_seed : None or int

leave None to use package global RandomStream or other valid value to create instance specific one

class pymc3.variational.inference.Inference(op, approx, tf, **kwargs)

Base class for Variational Inference

Communicates Operator, Approximation and Test Function to build Objective Function

Parameters:

op : Operator class

approx : Approximation class or instance

tf : TestFunction instance

model : Model

PyMC3 Model

kwargs : kwargs passed to Operator

fit(n=10000, score=None, callbacks=None, progressbar=True, **kwargs)

Perform Operator Variational Inference

Parameters:

n : int

number of iterations

score : bool

evaluate loss on each iteration or not

callbacks : list[function

calls provided functions after each iteration step

progressbar : bool

whether to show progressbar or not

Returns:

Approximation

Other Parameters:
 

obj_n_mc : int

Number of monte carlo samples used for approximation of objective gradients

tf_n_mc : int

Number of monte carlo samples used for approximation of test function gradients

obj_optimizer : function (grads, params) -> updates

Optimizer that is used for objective params

test_optimizer : function (grads, params) -> updates

Optimizer that is used for test function params

more_obj_params : list

Add custom params for objective optimizer

more_tf_params : list

Add custom params for test function optimizer

more_updates : dict

Add custom updates to resulting updates

total_grad_norm_constraint : float

Bounds gradient norm, prevents exploding gradient problem

fn_kwargs : dict

Add kwargs to theano.function (e.g. {‘profile’: True})

more_replacements : dict

Apply custom replacements before calculating gradients

refine(n, progressbar=True)

Refine the solution using the last compiled step function

class pymc3.variational.inference.ImplicitGradient(approx, estimator=<class 'pymc3.variational.operators.KSD'>, kernel=<pymc3.variational.test_functions.RBF object>, **kwargs)

Implicit Gradient for Variational Inference

not suggested to use

An approach to fit arbitrary approximation by computing kernel based gradient By default RBF kernel is used for gradient estimation. Default estimator is Kernelized Stein Discrepancy with temperature equal to 1. This temperature works only for large number of samples. Larger temperature is needed for small number of samples but there is no theoretical approach to choose the best one in such case.

class pymc3.variational.inference.KLqp(approx, beta=1.0)

Kullback Leibler Divergence Inference

General approach to fit Approximations that define \(logq\) by maximizing ELBO (Evidence Lower Bound). In some cases rescaling the regularization term KL may be beneficial

\[ELBO_eta = \log p(D| heta) - eta KL(q||p)\]
Parameters:

approx : Approximation

Approximation to fit, it is required to have logQ

beta : float

Scales the regularization term in ELBO (see Christopher P. Burgess et al., 2017)

References

  • Christopher P. Burgess et al. (NIPS, 2017) Understanding disentangling in \(eta\)-VAE arXiv preprint 1804.03599
pymc3.variational.inference.fit(n=10000, local_rv=None, method='advi', model=None, random_seed=None, start=None, inf_kwargs=None, **kwargs)

Handy shortcut for using inference methods in functional way

Parameters:

n : int

number of iterations

local_rv : dict[var->tuple]

mapping {model_variable -> approx params} Local Vars are used for Autoencoding Variational Bayes See (AEVB; Kingma and Welling, 2014) for details

method : str or Inference

string name is case insensitive in:

  • ‘advi’ for ADVI
  • ‘fullrank_advi’ for FullRankADVI
  • ‘svgd’ for Stein Variational Gradient Descent
  • ‘asvgd’ for Amortized Stein Variational Gradient Descent
  • ‘nfvi’ for Normalizing Flow with default scale-loc flow
  • ‘nfvi=<formula>’ for Normalizing Flow using formula

model : Model

PyMC3 model for inference

random_seed : None or int

leave None to use package global RandomStream or other valid value to create instance specific one

inf_kwargs : dict

additional kwargs passed to Inference

start : Point

starting point for inference

Returns:

Approximation

Other Parameters:
 

score : bool

evaluate loss on each iteration or not

callbacks : list[function

calls provided functions after each iteration step

progressbar : bool

whether to show progressbar or not

obj_n_mc : int

Number of monte carlo samples used for approximation of objective gradients

tf_n_mc : int

Number of monte carlo samples used for approximation of test function gradients

obj_optimizer : function (grads, params) -> updates

Optimizer that is used for objective params

test_optimizer : function (grads, params) -> updates

Optimizer that is used for test function params

more_obj_params : list

Add custom params for objective optimizer

more_tf_params : list

Add custom params for test function optimizer

more_updates : dict

Add custom updates to resulting updates

total_grad_norm_constraint : float

Bounds gradient norm, prevents exploding gradient problem

fn_kwargs : dict

Add kwargs to theano.function (e.g. {‘profile’: True})

more_replacements : dict

Apply custom replacements before calculating gradients

Approximations

class pymc3.variational.approximations.MeanField(*args, **kwargs)

Single Group Mean Field Approximation

Mean Field approximation to the posterior where spherical Gaussian family is fitted to minimize KL divergence from True posterior. It is assumed that latent space variables are uncorrelated that is the main drawback of the method

class pymc3.variational.approximations.FullRank(*args, **kwargs)

Single Group Full Rank Approximation

Full Rank approximation to the posterior where Multivariate Gaussian family is fitted to minimize KL divergence from True posterior. In contrast to MeanField approach correlations between variables are taken in account. The main drawback of the method is computational cost.

class pymc3.variational.approximations.Empirical(trace=None, size=None, **kwargs)

Single Group Full Rank Approximation

Builds Approximation instance from a given trace, it has the same interface as variational approximation

evaluate_over_trace(node)

This allows to statically evaluate any symbolic expression over the trace.

Parameters:node : Theano Variables (or Theano expressions)
Returns:evaluated node(s) over the posterior trace contained in the empirical approximation
class pymc3.variational.approximations.NormalizingFlow(flow='scale-loc', *args, **kwargs)

Single Group Normalizing Flow Approximation

Normalizing flow is a series of invertible transformations on initial distribution.

\[\begin{split}z_K &= f_K \circ \dots \circ f_2 \circ f_1(z_0) \\ & z_0 \sim \mathcal{N}(0, 1)\end{split}\]

In that case we can compute tractable density for the flow.

\[\ln q_K(z_K) = \ln q_0(z_0) - \sum_{k=1}^{K}\ln \left|\frac{\partial f_k}{\partial z_{k-1}}\right|\]

Every \(f_k\) here is a parametric function with defined determinant. We can choose every step here. For example the here is a simple flow is an affine transform:

\[z = loc(scale(z_0)) = \mu + \sigma * z_0\]

Here we get mean field approximation if \(z_0 \sim \mathcal{N}(0, 1)\)

Flow Formulas

In PyMC3 there is a flexible way to define flows with formulas. We have 5 of them by the moment:

  • Loc (loc): \(z' = z + \mu\)
  • Scale (scale): \(z' = \sigma * z\)
  • Planar (planar): \(z' = z + u * \tanh(w^T z + b)\)
  • Radial (radial): \(z' = z + \beta (\alpha + (z-z_r))^{-1}(z-z_r)\)
  • Householder (hh): \(z' = H z\)

Formula can be written as a string, e.g. ‘scale-loc’, ‘scale-hh*4-loc’, ‘panar*10’. Every step is separated with ‘-‘, repeated flow is marked with ‘*’ producing ‘flow*repeats’.

References

  • Danilo Jimenez Rezende, Shakir Mohamed, 2015 Variational Inference with Normalizing Flows arXiv:1505.05770
  • Jakub M. Tomczak, Max Welling, 2016 Improving Variational Auto-Encoders using Householder Flow arXiv:1611.09630
pymc3.variational.approximations.sample_approx(approx, draws=100, include_transformed=True)

Draw samples from variational posterior.

Parameters:

approx : Approximation

Approximation to sample from

draws : int

Number of random samples.

include_transformed : bool

If True, transformed variables are also sampled. Default is True.

Returns:

trace : class:pymc3.backends.base.MultiTrace

Samples drawn from variational posterior.

Operators

class pymc3.variational.operators.KL(approx, beta=1.0)

Operator based on Kullback Leibler Divergence

This operator constructs Evidence Lower Bound (ELBO) objective

\[ELBO_\beta = \log p(D|\theta) - \beta KL(q||p)\]

where

\[KL[q(v)||p(v)] = \int q(v)\log\frac{q(v)}{p(v)}dv\]
Parameters:

approx : Approximation

Approximation used for inference

beta : float

Beta parameter for KL divergence, scales the regularization term.

apply(f)

Operator itself

\[(O^{p,q}f_{\theta})(z)\]
Parameters:

f : TestFunction or None

function that takes z = self.input and returns same dimensional output

Returns:

TensorVariable

symbolically applied operator

class pymc3.variational.operators.KSD(approx, temperature=1)

Operator based on Kernelized Stein Discrepancy

Input: A target distribution with density function \(p(x)\)
and a set of initial particles \(\{x^0_i\}^n_{i=1}\)

Output: A set of particles \(\{x_i\}^n_{i=1}\) that approximates the target distribution.

\[\begin{split}x_i^{l+1} \leftarrow \epsilon_l \hat{\phi}^{*}(x_i^l) \\ \hat{\phi}^{*}(x) = \frac{1}{n}\sum^{n}_{j=1}[k(x^l_j,x) \nabla_{x^l_j} logp(x^l_j)/temp + \nabla_{x^l_j} k(x^l_j,x)]\end{split}\]
Parameters:

approx : Approximation

Approximation used for inference

temperature: float

Temperature for Stein gradient

References

  • Qiang Liu, Dilin Wang (2016) Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm arXiv:1608.04471
apply(f)

Operator itself

\[(O^{p,q}f_{\theta})(z)\]
Parameters:

f : TestFunction or None

function that takes z = self.input and returns same dimensional output

Returns:

TensorVariable

symbolically applied operator

objective_class

alias of KSDObjective