[1]:

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pymc3 as pm

[2]:

%load_ext watermark
az.style.use("arviz-darkgrid")


# Sequential Monte Carlo - Approximate Bayesian Computation¶

Approximate Bayesian Computation methods (also called likelihood free inference methods), are a group of techniques developed for inferring posterior distributions in cases where the likelihood function is intractable or costly to evaluate. This does not mean that the likelihood function is not part of the analysis, it just the we are approximating the likelihood, and hence the name of the ABC methods.

ABC comes useful when modeling complex phenomena in certain fields of study, like systems biology. Such models often contain unobservable random quantities, which make the likelihood function hard to specify, but data can be simulated from the model.

These methods follow a general form:

1- Sample a parameter $$\theta^*$$ from a prior/proposal distribution $$\pi(\theta)$$.

2- Simulate a data set $$y^*$$ using a function that takes $$\theta$$ and returns a data set of the same dimensions as the observed data set $$y_0$$ (simulator).

3- Compare the simulated dataset $$y^*$$ with the experimental data set $$y_0$$ using a distance function $$d$$ and a tolerance threshold $$\epsilon$$.

In some cases a distance function is computed between two summary statistics $$d(S(y_0), S(y^*))$$, avoiding the issue of computing distances for entire datasets.

As a result we obtain a sample of parameters from a distribution $$\pi(\theta | d(y_0, y^*)) \leqslant \epsilon$$.

If $$\epsilon$$ is sufficiently small this distribution will be a good approximation of the posterior distribution $$\pi(\theta | y_0)$$.

Sequential monte carlo ABC is a method that iteratively morphs the prior into a posterior by propagating the sampled parameters through a series of proposal distributions $$\phi(\theta^{(i)})$$, weighting the accepted parameters $$\theta^{(i)}$$ like:

$w^{(i)} \propto \frac{\pi(\theta^{(i)})}{\phi(\theta^{(i)})}$

It combines the advantages of traditional SMC, i.e. ability to sample from distributions with multiple peaks, but without the need for evaluating the likelihood function.

(Lintusaari, 2016), (Toni, T., 2008), (Nuñez, Prangle, 2015)

# Old good Gaussian fit¶

To illustrate how to use ABC within PyMC3 we are going to start with a very simple example estimating the mean and standard deviation of Gaussian data.

[3]:

data = np.random.normal(loc=0, scale=1, size=1000)


Clearly under normal circumstances using a Gaussian likelihood will do the job very well. But that would defeat the purpose of this example, the notebook would end here and everything would be very boring. So, instead of that we are going to define a simulator. A very straightforward simulator for normal data is a pseudo random number generator, in real life our simulator will be most likely something fancier.

[4]:

def normal_sim(a, b):
return np.random.normal(a, b, 1000)


Defining an ABC model in PyMC3 is in general, very similar to defining other PyMC3 models. The two important differences are: we need to define a Simulator distribution and we need to use sample_smc with kernel="ABC". The Simulator works as a generic interface to pass the synthetic data generating function (normal_sim in this example), its parameters, the observed data and optionally a distance function and a summary statistics. In the following code we are using the default distance, gaussian_kernel, and the sort summary_statistic. As the name suggests sort sorts the data before computing the distance.

Finally, SMC-ABC offers the option to store the simulated data. This can he handy as simulators can be expensive to evaluate and we may want to use the simulated data for example for posterior predictive checks.

[5]:

with pm.Model() as example:
a = pm.Normal("a", mu=0, sigma=5)
b = pm.HalfNormal("b", sigma=1)
s = pm.Simulator("s", normal_sim, params=(a, b), sum_stat="sort", epsilon=1, observed=data)

trace, sim_data = pm.sample_smc(kernel="ABC", parallel=True, save_sim_data=True)
idata = az.from_pymc3(trace, posterior_predictive=sim_data)

Initializing SMC sampler...
Multiprocess sampling (2 chains in 2 jobs)
/home/osvaldo/proyectos/00_BM/pymc3/pymc3/smc/sample_smc.py:167: UserWarning: Warning: SMC-ABC is an experimental step method and not yet recommended for use in PyMC3!
warnings.warn(EXPERIMENTAL_WARNING)
Stage:   0 Beta: 0.000
Stage:   1 Beta: 0.002
Stage:   2 Beta: 0.008
Stage:   3 Beta: 0.027
Stage:   4 Beta: 0.087
Stage:   5 Beta: 0.290
Stage:   6 Beta: 0.842
Stage:   7 Beta: 1.000


Judging by plot_trace the sampler did its job very well, which is not surprising given this is a very simple model. Anyway, it is always reassuring to look at a flat rank plot :-)

[6]:

az.plot_trace(idata, kind="rank_vlines");

[7]:

az.summary(idata, kind="stats")

[7]:

mean sd hdi_3% hdi_97%
a -0.042 0.046 -0.130 0.043
b 1.008 0.039 0.934 1.079

The posterior predictive check shows that we have an overall good fit, but the synthetic data has heavier tails than the observed one. You may want to decrease the value of epsilon, and see if you can get a tighter fit.

[8]:

az.plot_ppc(idata);


## Lotka–Volterra¶

The Lotka-Volterra is well-know biological model describing how the number of individuals of two species change when there is a predator/prey interaction (A Biologist’s Guide to Mathematical Modeling in Ecology and Evolution,Otto and Day, 2007). For example, rabbits and foxes. Given an initial population number for each species, the integration of this ordinary differential equations (ODE) describes curves for the progression of both populations. This ODE’s takes four parameters:

• a is the natural growing rate of rabbits, when there’s no fox.

• b is the natural dying rate of rabbits, due to predation.

• c is the natural dying rate of fox, when there is no rabbit.

• d is the factor describing how many caught rabbits let create a new fox.

Notice that there is nothing intrinsically especial about SMC-ABC and ODEs. In principle a simulator can be any piece of code able to generate fake data given a set of parameters.

[9]:

from scipy.integrate import odeint

# Definition of parameters
a = 1.0
b = 0.1
c = 1.5
d = 0.75

# initial population of rabbits and foxes
X0 = [10.0, 5.0]
# size of data
size = 100
# time lapse
time = 15
t = np.linspace(0, time, size)

# Lotka - Volterra equation
def dX_dt(X, t, a, b, c, d):
""" Return the growth rate of fox and rabbit populations. """

return np.array([a * X[0] - b * X[0] * X[1], -c * X[1] + d * b * X[0] * X[1]])

# simulator function
def competition_model(a, b):
return odeint(dX_dt, y0=X0, t=t, rtol=0.01, args=(a, b, c, d))


Using the simulator function we will obtain a dataset with some noise added, for using it as observed data.

[10]:

# function for generating noisy data to be used as observed data.
noise = np.random.normal(size=(size, 2))
simulated = competition_model(a, b) + noise
return simulated

[11]:

# plotting observed data.
observed = add_noise(a, b, c, d)
_, ax = plt.subplots(figsize=(12, 4))
ax.plot(observed[:, 0], "x", label="prey")
ax.plot(observed[:, 1], "x", label="predator")
ax.set_xlabel("time")
ax.set_ylabel("population")
ax.set_title("Observed data")
ax.legend();


As with the first example, instead of specifying a likelihood function, we use pm.Simulator().

[12]:

with pm.Model() as model_lv:
a = pm.HalfNormal("a", 1.0)
b = pm.HalfNormal("b", 1.0)

sim = pm.Simulator("sim", competition_model, params=(a, b), epsilon=10, observed=observed)

trace_lv = pm.sample_smc(kernel="ABC", parallel=True)
idata_lv = az.from_pymc3(trace_lv)

Initializing SMC sampler...
Multiprocess sampling (2 chains in 2 jobs)
/home/osvaldo/proyectos/00_BM/pymc3/pymc3/smc/sample_smc.py:167: UserWarning: Warning: SMC-ABC is an experimental step method and not yet recommended for use in PyMC3!
warnings.warn(EXPERIMENTAL_WARNING)
Stage:   0 Beta: 0.009
/home/osvaldo/anaconda3/lib/python3.7/site-packages/scipy/integrate/odepack.py:248: ODEintWarning: Repeated error test failures (internal error). Run with full_output = 1 to get quantitative information.
warnings.warn(warning_msg, ODEintWarning)
/home/osvaldo/anaconda3/lib/python3.7/site-packages/scipy/integrate/odepack.py:248: ODEintWarning: Repeated convergence failures (perhaps bad Jacobian or tolerances). Run with full_output = 1 to get quantitative information.
warnings.warn(warning_msg, ODEintWarning)
/home/osvaldo/anaconda3/lib/python3.7/site-packages/scipy/integrate/odepack.py:248: ODEintWarning: Repeated error test failures (internal error). Run with full_output = 1 to get quantitative information.
warnings.warn(warning_msg, ODEintWarning)
Stage:   1 Beta: 0.019
Stage:   2 Beta: 0.031
Stage:   3 Beta: 0.065
Stage:   4 Beta: 0.217
Stage:   5 Beta: 0.694
Stage:   6 Beta: 1.000

[13]:

az.plot_trace(idata_lv, kind="rank_vlines");

[14]:

az.plot_posterior(idata_lv);

[15]:

# plot results
_, ax = plt.subplots(figsize=(14, 6))
ax.plot(observed[:, 0], "o", label="prey", c="C0", mec="k")
ax.plot(observed[:, 1], "o", label="predator", c="C1", mec="k")
ax.plot(competition_model(trace_lv["a"].mean(), trace_lv["b"].mean()), linewidth=3)
for i in np.random.randint(0, size, 75):
sim = competition_model(trace_lv["a"][i], trace_lv["b"][i])
ax.plot(sim[:, 0], alpha=0.1, c="C0")
ax.plot(sim[:, 1], alpha=0.1, c="C1")
ax.set_xlabel("time")
ax.set_ylabel("population")
ax.legend();

[16]:

%watermark -n -u -v -iv -w

autopep8 1.5
numpy    1.18.1
pymc3    3.9.2
json     2.0.9
arviz    0.9.0
last updated: Fri Jul 03 2020

CPython 3.7.6
IPython 7.12.0
watermark 2.0.2