API quickstart¶
In [1]:
%matplotlib inline
import numpy as np
import theano.tensor as tt
import pymc3 as pm
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_context('notebook')
1. Model creation¶
Models in PyMC3 are centered around the Model
class. It has
references to all random variables (RVs) and computes the model logp and
its gradients. Usually, you would instantiate it as part of a with
context:
In [2]:
with pm.Model() as model:
# Model definition
pass
We discuss RVs further below but let’s create a simple model to explore
the Model
class.
In [3]:
with pm.Model() as model:
mu = pm.Normal('mu', mu=0, sd=1)
obs = pm.Normal('obs', mu=mu, sd=1, observed=np.random.randn(100))
In [4]:
model.basic_RVs
Out[4]:
[mu, obs]
In [5]:
model.free_RVs
Out[5]:
[mu]
In [6]:
model.observed_RVs
Out[6]:
[obs]
In [7]:
model.logp({'mu': 0})
Out[7]:
array(139.7234487505281)
Warning It’s worth highlighting one of the counterintuitive design
choices with logp. The API makes the logp
look like an attribute,
when it actually puts together a function based on the current state of
the model.
The current design is super maintainable, does terrible if the state
stays constant, and great if the state keeps changing, for reasons of
design we assume that Model
isn’t static, in fact it’s best in our
experience and avoids bad results.
If you need to use logp
in an inner loop and it needs to be static,
simply use something like logp = model.logp
below. You can see the
caching effect with the speed up below.
In [8]:
%timeit model.logp({mu: 0.1})
logp = model.logp
%timeit logp({mu: 0.1})
91.8 ms ± 1.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
35.6 µs ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
2. Probability Distributions¶
Every probabilistic program consists of observed and unobserved Random Variables (RVs). Observed RVs are defined via likelihood distributions, while unobserved RVs are defined via prior distributions. In PyMC3, probability distributions are available from the main module space:
In [9]:
help(pm.Normal)
Help on class Normal in module pymc3.distributions.continuous:
class Normal(pymc3.distributions.distribution.Continuous)
 Univariate normal loglikelihood.

 .. math::

 f(x \mid \mu, \tau) =
 \sqrt{\frac{\tau}{2\pi}}
 \exp\left\{ \frac{\tau}{2} (x\mu)^2 \right\}

 ======== ==========================================
 Support :math:`x \in \mathbb{R}`
 Mean :math:`\mu`
 Variance :math:`\dfrac{1}{\tau}` or :math:`\sigma^2`
 ======== ==========================================

 Normal distribution can be parameterized either in terms of precision
 or standard deviation. The link between the two parametrizations is
 given by

 .. math::

 \tau = \dfrac{1}{\sigma^2}

 .. plot::

 import matplotlib.pyplot as plt
 import numpy as np
 import scipy.stats as st
 x = np.linspace(5.0, 5.0, 1000)
 fig, ax = plt.subplots()
 f = lambda mu, sd : st.norm.pdf(x, loc=mu, scale=sd)
 plot_pdf = lambda a, b : ax.plot(x, f(a,b), label=r'$\mu$={0}, $\sigma$={1}'.format(a,b))
 plot_pdf(0.0, 0.4)
 plot_pdf(0.0, 1.0)
 plot_pdf(0.0, 2.0)
 plot_pdf(2.0, 0.4)
 plt.legend(loc='upper right', frameon=False)
 ax.set(xlim=[5,5], ylim=[0,1.2], xlabel='x', ylabel='f(x)')
 plt.show()

 Parameters
 
 mu : float
 Mean.
 sd : float
 Standard deviation (sd > 0).
 tau : float
 Precision (tau > 0).

 Method resolution order:
 Normal
 pymc3.distributions.distribution.Continuous
 pymc3.distributions.distribution.Distribution
 builtins.object

 Methods defined here:

 __init__(self, mu=0, sd=None, tau=None, **kwargs)
 Initialize self. See help(type(self)) for accurate signature.

 logp(self, value)

 random(self, point=None, size=None, repeat=None)

 
 Methods inherited from pymc3.distributions.distribution.Distribution:

 __getnewargs__(self)

 __latex__ = _repr_latex_(self, name=None, dist=None)
 Magic method name for IPython to use for LaTeX formatting.

 default(self)

 get_test_val(self, val, defaults)

 getattr_value(self, val)

 logp_nojac(self, *args, **kwargs)
 Return the logp, but do not include a jacobian term for transforms.

 If we use different parametrizations for the same distribution, we
 need to add the determinant of the jacobian of the transformation
 to make sure the densities still describe the same distribution.
 However, MAP estimates are not invariant with respect to the
 parametrization, we need to exclude the jacobian terms in this case.

 This function should be overwritten in base classes for transformed
 distributions.

 logp_sum(self, *args, **kwargs)
 Return the sum of the logp values for the given observations.

 Subclasses can use this to improve the speed of logp evaluations
 if only the sum of the logp values is needed.

 
 Class methods inherited from pymc3.distributions.distribution.Distribution:

 dist(*args, **kwargs) from builtins.type

 
 Static methods inherited from pymc3.distributions.distribution.Distribution:

 __new__(cls, name, *args, **kwargs)
 Create and return a new object. See help(type) for accurate signature.

 
 Data descriptors inherited from pymc3.distributions.distribution.Distribution:

 __dict__
 dictionary for instance variables (if defined)

 __weakref__
 list of weak references to the object (if defined)
In the PyMC3 module, the structure for probability distributions looks like this:
pymc3.distributions  continuous  discrete  timeseries  mixture
In [10]:
dir(pm.distributions.mixture)
Out[10]:
['Discrete',
'Distribution',
'Mixture',
'Normal',
'NormalMixture',
'__builtins__',
'__cached__',
'__doc__',
'__file__',
'__loader__',
'__name__',
'__package__',
'__spec__',
'all_discrete',
'bound',
'draw_values',
'generate_samples',
'get_tau_sd',
'get_variable_name',
'logsumexp',
'np',
'tt']
Unobserved Random Variables¶
Every unobserved RV has the following calling signature: name (str), parameter keyword arguments. Thus, a normal prior can be defined in a model context like this:
In [11]:
with pm.Model():
x = pm.Normal('x', mu=0, sd=1)
As with the model, we can evaluate its logp:
In [12]:
x.logp({'x': 0})
Out[12]:
array(0.9189385332046727)
Observed Random Variables¶
Observed RVs are defined just like unobserved RVs but require data to be
passed into the observed
keyword argument:
In [13]:
with pm.Model():
obs = pm.Normal('x', mu=0, sd=1, observed=np.random.randn(100))
observed
supports lists, numpy.ndarray
, theano
and
pandas
data structures.
Deterministic transforms¶
PyMC3 allows you to freely do algebra with RVs in all kinds of ways:
In [14]:
with pm.Model():
x = pm.Normal('x', mu=0, sd=1)
y = pm.Gamma('y', alpha=1, beta=1)
plus_2 = x + 2
summed = x + y
squared = x**2
sined = pm.math.sin(x)
While these transformations work seamlessly, its results are not stored
automatically. Thus, if you want to keep track of a transformed
variable, you have to use pm.Determinstic
:
In [15]:
with pm.Model():
x = pm.Normal('x', mu=0, sd=1)
plus_2 = pm.Deterministic('x plus 2', x + 2)
Note that plus_2
can be used in the identical way to above, we only
tell PyMC3 to keep track of this RV for us.
Automatic transforms of bounded RVs¶
In order to sample models more efficiently, PyMC3 automatically transforms bounded RVs to be unbounded.
In [16]:
with pm.Model() as model:
x = pm.Uniform('x', lower=0, upper=1)
When we look at the RVs of the model, we would expect to find x
there, however:
In [17]:
model.free_RVs
Out[17]:
[x_interval__]
x_interval__
represents x
transformed to accept parameter values
between inf and +inf. In the case of an upper and a lower bound, a
LogOdd
s transform is applied. Sampling in this transformed space
makes it easier for the sampler. PyMC3 also keeps track of the
nontransformed, bounded parameters. These are common determinstics (see
above):
In [18]:
model.deterministics
Out[18]:
[x]
When displaying results, PyMC3 will usually hide transformed parameters.
You can pass the include_transformed=True
parameter to many
functions to see the transformed parameters that are used for sampling.
You can also turn transforms off:
In [19]:
with pm.Model() as model:
x = pm.Uniform('x', lower=0, upper=1, transform=None)
print(model.free_RVs)
[x]
Lists of RVs / higherdimensional RVs¶
Above we have seen how to create scalar RVs. In many models, you want multiple RVs. There is a tendency (mainly inherited from PyMC 2.x) to create list of RVs, like this:
In [20]:
with pm.Model():
x = [pm.Normal('x_{}'.format(i), mu=0, sd=1) for i in range(10)] # bad
However, even though this works it is quite slow and not recommended.
Instead, use the shape
kwarg:
In [21]:
with pm.Model() as model:
x = pm.Normal('x', mu=0, sd=1, shape=10) # good
x
is now a random vector of length 10. We can index into it or do
linear algebra operations on it:
In [22]:
with model:
y = x[0] * x[1] # full indexing is supported
x.dot(x.T) # Linear algebra is supported
Initialization with test_values¶
While PyMC3 tries to automatically initialize models it is sometimes
helpful to define initial values for RVs. This can be done via the
testval
kwarg:
In [23]:
with pm.Model():
x = pm.Normal('x', mu=0, sd=1, shape=5)
x.tag.test_value
Out[23]:
array([ 0., 0., 0., 0., 0.])
In [24]:
with pm.Model():
x = pm.Normal('x', mu=0, sd=1, shape=5, testval=np.random.randn(5))
x.tag.test_value
Out[24]:
array([ 0.421665 , 1.63966614, 0.92402752, 0.14413675, 1.39721381])
This technique is quite useful to identify problems with model specification or initialization.
3. Inference¶
Once we have defined our model, we have to perform inference to approximate the posterior distribution. PyMC3 supports two broad classes of inference: sampling and variational inference.
3.1 Sampling¶
The main entry point to MCMC sampling algorithms is via the
pm.sample()
function. By default, this function tries to autoassign
the right sampler(s) and autoinitialize if you don’t pass anything.
In [25]:
with pm.Model() as model:
mu = pm.Normal('mu', mu=0, sd=1)
obs = pm.Normal('obs', mu=mu, sd=1, observed=np.random.randn(100))
trace = pm.sample(1000, tune=500)
Autoassigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
100%██████████ 1500/1500 [00:01<00:00, 1026.78it/s]
As you can see, on a continuous model, PyMC3 assigns the NUTS sampler, which is very efficient even for complex models. PyMC3 also runs variational inference (i.e. ADVI) to find good starting parameters for the sampler. Here we draw 1000 samples from the posterior and allow the sampler to adjust its parameters in an additional 500 iterations. These 500 samples are discarded by default:
In [26]:
len(trace)
Out[26]:
1000
You can also run multiple chains in parallel using the cores
kwarg:
In [27]:
with pm.Model() as model:
mu = pm.Normal('mu', mu=0, sd=1)
obs = pm.Normal('obs', mu=mu, sd=1, observed=np.random.randn(100))
trace = pm.sample(cores=4)
Autoassigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
100%██████████ 1000/1000 [00:01<00:00, 851.68it/s]
Note, that we are now drawing 2000 samples, 500 samples for 4 chains each. The 500 tuning samples are discarded by default.
In [28]:
trace['mu'].shape
Out[28]:
(2000,)
In [29]:
trace.nchains
Out[29]:
4
In [30]:
trace.get_values('mu', chains=1).shape # get values of a single chain
Out[30]:
(500,)
PyMC3, offers a variety of other samplers, found in pm.step_methods
.
In [31]:
list(filter(lambda x: x[0].isupper(), dir(pm.step_methods)))
Out[31]:
['BinaryGibbsMetropolis',
'BinaryMetropolis',
'CategoricalGibbsMetropolis',
'CauchyProposal',
'CompoundStep',
'ElemwiseCategorical',
'EllipticalSlice',
'HamiltonianMC',
'LaplaceProposal',
'Metropolis',
'MultivariateNormalProposal',
'NUTS',
'NormalProposal',
'PoissonProposal',
'SGFS',
'SMC',
'Slice']
Commonly used stepmethods besides NUTS are Metropolis
and
Slice
. For almost all continuous models, ``NUTS`` should be
preferred. There are hardtosample models for which NUTS
will be
very slow causing many users to use Metropolis
instead. This
practice, however, is rarely successful. NUTS is fast on simple models
but can be slow if the model is very complex or it is badly initialized.
In the case of a complex model that is hard for NUTS, Metropolis, while
faster, will have a very low effective sample size or not converge
properly at all. A better approach is to instead try to improve
initialization of NUTS, or reparameterize the model.
For completeness, other sampling methods can be passed to sample:
In [32]:
with pm.Model() as model:
mu = pm.Normal('mu', mu=0, sd=1)
obs = pm.Normal('obs', mu=mu, sd=1, observed=np.random.randn(100))
step = pm.Metropolis()
trace = pm.sample(1000, step=step)
100%██████████ 1500/1500 [00:00<00:00, 5511.71it/s]
You can also assign variables to different step methods.
In [33]:
with pm.Model() as model:
mu = pm.Normal('mu', mu=0, sd=1)
sd = pm.HalfNormal('sd', sd=1)
obs = pm.Normal('obs', mu=mu, sd=sd, observed=np.random.randn(100))
step1 = pm.Metropolis(vars=[mu])
step2 = pm.Slice(vars=[sd])
trace = pm.sample(10000, step=[step1, step2], cores=4)
100%██████████ 10500/10500 [00:13<00:00, 774.35it/s]
3.2 Analyze sampling results¶
The most common used plot to analyze sampling results is the socalled traceplot:
In [34]:
pm.traceplot(trace);
Another common metric to look at is Rhat, also known as the GelmanRubin statistic:
In [35]:
pm.gelman_rubin(trace)
Out[35]:
{'mu': 1.0001326715972627,
'sd': 0.99995267558629874,
'sd_log__': 0.99995398290485815}
These are also part of the forestplot
:
In [36]:
pm.forestplot(trace);
Finally, for a plot of the posterior that is inspired by the book Doing Bayesian Data Analysis, you can use the:
In [37]:
pm.plot_posterior(trace);
For highdimensional models it becomes cumbersome to look at all
parameter’s traces. When using NUTS
we can look at the energy plot
to assess problems of convergence:
In [38]:
with pm.Model() as model:
x = pm.Normal('x', mu=0, sd=1, shape=100)
trace = pm.sample(cores=4)
pm.energyplot(trace);
Autoassigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
100%██████████ 1000/1000 [00:05<00:00, 198.29it/s]
For more information on sampler stats and the energy plot, see here. For more information on identifying sampling problems and what to do about them, see here.
3.3 Variational inference¶
PyMC3 supports various Variational Inference techniques. While these
methods are much faster, they are often also less accurate and can lead
to biased inference. The main entry point is pymc3.fit()
.
In [39]:
with pm.Model() as model:
mu = pm.Normal('mu', mu=0, sd=1)
sd = pm.HalfNormal('sd', sd=1)
obs = pm.Normal('obs', mu=mu, sd=sd, observed=np.random.randn(100))
approx = pm.fit()
Average Loss = 150.94: 100%██████████ 10000/10000 [00:03<00:00, 2988.43it/s]
Finished [100%]: Average Loss = 150.93
The returned Approximation
object has various capabilities, like
drawing samples from the approximated posterior, which we can analyse
like a regular sampling run:
In [40]:
approx.sample(500)
Out[40]:
<MultiTrace: 1 chains, 500 iterations, 3 variables>
The variational
submodule offers a lot of flexibility in which VI to
use and follows an object oriented design. For example, fullrank ADVI
estimates a full covariance matrix:
In [41]:
mu = pm.floatX([0., 0.])
cov = pm.floatX([[1, .5], [.5, 1.]])
with pm.Model() as model:
pm.MvNormal('x', mu=mu, cov=cov, shape=2)
approx = pm.fit(method='fullrank_advi')
Average Loss = 0.0068883: 100%██████████ 10000/10000 [00:08<00:00, 1233.38it/s]
Finished [100%]: Average Loss = 0.0065707
An equivalent expression using the objectoriented interface is:
In [42]:
with pm.Model() as model:
pm.MvNormal('x', mu=mu, cov=cov, shape=2)
approx = pm.FullRankADVI().fit()
Average Loss = 0.01127: 100%██████████ 10000/10000 [00:08<00:00, 1225.82it/s]
Finished [100%]: Average Loss = 0.011343
In [43]:
plt.figure()
trace = approx.sample(10000)
sns.kdeplot(trace['x'])
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x11817c908>
Stein Variational Gradient Descent (SVGD) uses particles to estimate the posterior:
In [44]:
w = pm.floatX([.2, .8])
mu = pm.floatX([.3, .5])
sd = pm.floatX([.1, .1])
with pm.Model() as model:
pm.NormalMixture('x', w=w, mu=mu, sd=sd)
approx = pm.fit(method=pm.SVGD(n_particles=200, jitter=1.))
100%██████████ 10000/10000 [01:21<00:00, 123.44it/s]
In [45]:
plt.figure()
trace = approx.sample(10000)
sns.distplot(trace['x']);
For more information on variational inference, see these examples.
4. Posterior Predictive Sampling¶
The sample_ppc()
function performs prediction on holdout data and
posterior predictive checks.
In [46]:
data = np.random.randn(100)
with pm.Model() as model:
mu = pm.Normal('mu', mu=0, sd=1)
sd = pm.HalfNormal('sd', sd=1)
obs = pm.Normal('obs', mu=mu, sd=sd, observed=data)
trace = pm.sample()
Autoassigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
97%█████████▋ 968/1000 [00:01<00:00, 1050.37it/s]/Users/ferres/dev/pymc3/pymc3/step_methods/hmc/nuts.py:451: UserWarning: The acceptance probability in chain 0 does not match the target. It is 0.886449584522, but should be close to 0.8. Try to increase the number of tuning steps.
% (self._chain_id, mean_accept, target_accept))
100%██████████ 1000/1000 [00:01<00:00, 949.39it/s]
In [47]:
with model:
post_pred = pm.sample_ppc(trace, samples=500, size=len(data))
100%██████████ 500/500 [00:01<00:00, 484.68it/s]
sample_ppc()
returns a dict with a key for every observed node:
In [48]:
post_pred['obs'].shape
Out[48]:
(500, 100)
In [49]:
plt.figure()
ax = sns.distplot(post_pred['obs'].mean(axis=1), label='Posterior predictive means')
ax.axvline(data.mean(), color='r', ls='', label='True mean')
ax.legend()
Out[49]:
<matplotlib.legend.Legend at 0x11a8af940>
4.1 Predicting on holdout data¶
In many cases you want to predict on unseen / holdout data. This is
especially relevant in Probabilistic Machine Learning and Bayesian Deep
Learning. While we plan to improve the API in this regard, this can
currently be achieved with a theano.shared
variable. These are
theano tensors whose values can be changed later. Otherwise they can be
passed into PyMC3 just like any other numpy array or tensor.
This distinction is significant since internally all models in PyMC3 are
giant symbolic expressions. When you pass data directly into a model,
you are giving Theano permission to treat this data as a constant and
optimize it away as it sees fit. If you need to change this data later
you might not have a way to point at it in the symbolic expression.
Using theano.shared
offers a way to point to a place in that
symbolic expression, and change what is there.
In [50]:
import theano
x = np.random.randn(100)
y = x > 0
x_shared = theano.shared(x)
y_shared = theano.shared(y)
with pm.Model() as model:
coeff = pm.Normal('x', mu=0, sd=1)
logistic = pm.math.sigmoid(coeff * x_shared)
pm.Bernoulli('obs', p=logistic, observed=y_shared)
trace = pm.sample()
Autoassigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
100%██████████ 1000/1000 [00:00<00:00, 1369.19it/s]
Now assume we want to predict on unseen data. For this we have to change
the values of x_shared
and y_shared
. Theoretically we don’t need
to set y_shared
as we want to predict it but it has to match the
shape of x_shared
.
In [51]:
x_shared.set_value([1, 0, 1.])
y_shared.set_value([0, 0, 0]) # dummy values
with model:
post_pred = pm.sample_ppc(trace, samples=500)
100%██████████ 500/500 [00:01<00:00, 463.31it/s]
In [52]:
post_pred['obs'].mean(axis=0)
Out[52]:
array([ 0.026, 0.516, 0.966])