# Diagnosing Biased Inference with Divergences¶

In [1]:

%matplotlib inline
import pymc3 as pm
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
import pandas as pd
from collections import defaultdict

plt.style.use('seaborn-darkgrid')
print('Runing on PyMC3 v{}'.format(pm.__version__))

SEED = [20100425, 20100234]

%config InlineBackend.figure_format = 'retina'

Runing on PyMC3 v3.3


** PyMC3 port of Michael Betancourt’s post on ms-stan. For detailed explanation of the underlying mechanism please check the original post and Betancourt’s excellent paper.**

Bayesian statistics is all about building a model and estimating the parameters in that model. However, a naive or direct parameterization of our probability model can sometimes be ineffective (you can check out Thomas Wiecki’s blog post on the same issue in PyMC3). Suboptimal parameterization is often lead to slow sampling, and more problematic, biased MCMC estimators.

** More formally, as explained in the original

post <http://mc-stan.org/documentation/case-studies/divergences_and_bias.html>__ (in markdown block, same below):** | > Markov chain Monte Carlo (MCMC) approximates expectations with respect to a given target distribution,

$\mathbb{E}{\pi} [ f ] = \int \mathrm{d}q \, \pi (q) \, f(q),$

using the states of a Markov chain, $${q{0}, \ldots, q_{N} }$$,

$\mathbb{E}{\pi} [ f ] \approx \hat{f}{N} = \frac{1}{N + 1} \sum_{n = 0}^{N} f(q_{n}).$
> >These estimators, however, are guaranteed to be accurate only

asymptotically as the chain grows to be infinitely long,

$\lim_{N \rightarrow \infty} \hat{f}{N} = \mathbb{E}{\pi} [ f ].$
> To be useful in applied analyses, we need MCMC estimators to

converge to the true expectation values sufficiently quickly that they are reasonably accurate before we exhaust our finite computational resources. This fast convergence requires strong ergodicity conditions to hold, in particular geometric ergodicity between a Markov transition and a target distribution. Geometric ergodicity is usually the necessary condition for MCMC estimators to follow a central limit theorem, which ensures not only that they are unbiased even after only a finite number of iterations but also that we can empirically quantify their precision using the MCMC standard error. > Unfortunately, proving geometric ergodicity theoretically is infeasible for any nontrivial problem. Instead we must rely on empirical diagnostics that identify obstructions to geometric ergodicity, and hence well-behaved MCMC estimators. For a general Markov transition and target distribution, the best known diagnostic is the split $$\hat{R}$$ statistic over an ensemble of Markov chains initialized from diffuse points in parameter space; to do any better we need to exploit the particular structure of a given transition or target distribution. > Hamiltonian Monte Carlo, for example, is especially powerful in this regard as its failures to be geometrically ergodic with respect to any target distribution manifest in distinct behaviors that have been developed into sensitive diagnostics. One of these behaviors is the appearance of divergences that indicate the Hamiltonian Markov chain has encountered regions of high curvature in the target distribution which it cannot adequately explore.

In this notebook we aim to replicate the identification of divergences sample and the underlying pathologies in PyMC3 similar to the original post.

## The Eight Schools Model¶

The hierarchical model of the the Eight Schools dataset (Rubin 1981) as seen in Stan:

$\mu \sim \mathcal{N}(0, 5)$
$\tau \sim \text{Half-Cauchy}(0, 5)$
$\theta_{n} \sim \mathcal{N}(\mu, \tau)$
$y_{n} \sim \mathcal{N}(\theta_{n}, \sigma_{n}),$

where $$n \in \{1, \ldots, 8 \}$$ and the $$\{ y_{n}, \sigma_{n} \}$$ are given as data.

Inferring the hierarchical hyperparameters, $$\mu$$ and $$\sigma$$, together with the group-level parameters, $$\theta_{1}, \ldots, \theta_{8}$$, allows the model to pool data across the groups and reduce their posterior variance. Unfortunately, the direct centered parameterization also squeezes the posterior distribution into a particularly challenging geometry that obstructs geometric ergodicity and hence biases MCMC estimation.

In [2]:

# Data of the Eight Schools Model
J = 8
y = np.asarray([28,  8, -3,  7, -1,  1, 18, 12], dtype=float)
sigma = np.asarray([15, 10, 16, 11,  9, 11, 10, 18], dtype=float)
# tau = 25.


## A Centered Eight Schools Implementation¶

Stan model:

data {
int<lower=0> J;
real y[J];
real<lower=0> sigma[J];
}

parameters {
real mu;
real<lower=0> tau;
real theta[J];
}

model {
mu ~ normal(0, 5);
tau ~ cauchy(0, 5);
theta ~ normal(mu, tau);
y ~ normal(theta, sigma);
}


Similarly, we can easily implemented it in PyMC3

In [3]:

with pm.Model() as Centered_eight:
mu = pm.Normal('mu', mu=0, sd=5)
tau = pm.HalfCauchy('tau', beta=5)
theta = pm.Normal('theta', mu=mu, sd=tau, shape=J)
obs = pm.Normal('obs', mu=theta, sd=sigma, observed=y)

Unfortunately, this direct implementation of the model exhibits a
pathological geometry that frustrates geometric ergodicity. Even
more worrisome, the resulting bias is subtle and may not be obvious
upon inspection of the Markov chain alone. To understand this bias,
let's consider first a short Markov chain, commonly used when
computational expediency is a motivating factor, and only afterwards
a longer Markov chain.


### A Dangerously-Short Markov Chain¶

In [4]:

with Centered_eight:
short_trace = pm.sample(600, chains=2, random_seed=SEED)

100%|██████████| 1100/1100 [00:06<00:00, 182.56it/s]


In the the original post a single chain of 1200 sample is applied. However, since split $$\hat{R}$$ is not implemented in PyMC3 we fit 2 chains with 600 sample each instead.

The Gelman-Rubin diagnostic $$\hat{R}$$ doesn’t indicate any problems (values are all close to 1) and the effective sample size per iteration is reasonable

In [5]:

pm.summary(short_trace)

Out[5]:

mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat
mu 3.383965 3.439588 0.246051 -2.215044 9.665151 12.0 1.050860
theta__0 5.329064 6.487373 0.384821 -4.061500 18.687019 20.0 1.027115
theta__1 4.052428 5.046472 0.275174 -4.596014 14.348434 19.0 1.033783
theta__2 2.939800 5.345882 0.282373 -8.580365 12.195760 52.0 1.022895
theta__3 3.640527 4.898505 0.277777 -4.994877 13.426649 19.0 1.037943
theta__4 2.541175 4.616346 0.262909 -5.926081 11.796799 24.0 1.021987
theta__5 2.997123 4.987137 0.257417 -5.683100 13.715136 26.0 1.020094
theta__6 5.557910 5.618046 0.335931 -3.972271 16.805225 18.0 1.033176
theta__7 3.821625 5.606172 0.313449 -6.788711 14.301429 18.0 1.043135
tau 3.788597 3.441395 0.246438 0.515715 9.969811 30.0 1.014096
In [6]:

# plot the trace of log(tau)
pm.traceplot(short_trace, varnames=['tau']);


Unfortunately, the resulting estimate for the mean of $$log(\tau)$$ is strongly biased away from the true value, here shown in grey.

In [7]:

# plot the estimate for the mean of log(τ) cumulating mean
logtau = short_trace['tau_log__']
mlogtau = [np.mean(logtau[:i]) for i in np.arange(1, len(logtau))]
plt.figure(figsize=(15, 4))
plt.axhline(0.7657852, lw=2.5, color='gray')
plt.plot(mlogtau, lw=2.5)
plt.ylim(0, 2)
plt.xlabel('Iteration')
plt.ylabel('MCMC mean of log(tau)')
plt.title('MCMC estimation of log(tau)');


Hamiltonian Monte Carlo, however, is not so oblivious to these issues as 2% of the iterations in our lone Markov chain ended with a divergence.

In [8]:

# display the total number and percentage of divergent
divergent = short_trace['diverging']
print('Number of Divergent %d' % divergent.nonzero()[0].size)
divperc = divergent.nonzero()[0].size/len(short_trace)
print('Percentage of Divergent %.5f' % divperc)

Number of Divergent 88
Percentage of Divergent 0.14667


Even with a single short chain these divergences are able to identity the bias and advise skepticism of any resulting MCMC estimators.

Additionally, because the divergent transitions, here shown here in green, tend to be located near the pathologies we can use them to identify the location of the problematic neighborhoods in parameter space.

In [9]:

pm.pairplot(short_trace,
sub_varnames=['theta_0','tau_log__'],
divergences=True,
color='red', figsize=(10, 5), kwargs_divergence={'color':'green'})
plt.title('scatter plot between log(tau) and theta[0]');


It is important to point out that the pathological samples from the trace are not necessarily concentrated at the funnel: when a divergence is encountered, the subtree being constructed is rejected and the transition samples uniformly from the existing discrete trajectory. Consequently, divergent samples will not be located exactly in the region of high curvature.

In pymc3, we recently implemented a warning system that also saves the information of where the divergence occurs, which you can also visualise them directly. To be more precise, what we include as the divergence point in the warning is the point where that problematic leapfrog step started. Some could also be because the divergence happens in one of the leapfrog step (which strictly speaking is not a point). But nonetheless, visualizing these should give a closer proximate where the funnel is.

Notices that only the first 100 divergences are stored, so that we don’t eat all memory.

In [10]:

divergent_point = defaultdict(list)

chain_warn = short_trace.report._chain_warnings
for i in range(len(chain_warn)):
for warning_ in chain_warn[i]:
if warning_.step is not None and warning_.extra is not None:
for RV in Centered_eight.free_RVs:
para_name = RV.name
divergent_point[para_name].append(warning_.extra[para_name])

for RV in Centered_eight.free_RVs:
para_name = RV.name
divergent_point[para_name] = np.asarray(divergent_point[para_name])
ii = 5

tau_log_d = divergent_point['tau_log__']
theta0_d = divergent_point['theta'][:, ii]
Ndiv_recorded = len(tau_log_d)

In [11]:

_, ax = plt.subplots(1, 2, figsize=(15, 6), sharex=True, sharey=True)

pm.pairplot(short_trace,
sub_varnames=['theta_0', 'tau_log__'],
divergences=True,
ax=ax[0],
color='C7', figsize=(12, 10), alpha=0.2,
kwargs_divergence={'color':'green'})

plt.title('scatter plot between log(tau) and theta[5]')

pm.pairplot(short_trace,
sub_varnames=['theta_0', 'tau_log__'],
divergences=True,
ax=ax[1],
color='C7', alpha=0.2, figsize=(12, 10),
kwargs_divergence={'color':'green', 'label':'Divergent samples'})

theta_trace = short_trace['theta']
theta0 = theta_trace[:, 0]

plt.plot([theta0[divergent == 1][:Ndiv_recorded], theta0_d],
[logtau[divergent == 1][:Ndiv_recorded], tau_log_d],
'k-', alpha=.25)

plt.scatter(theta0_d, tau_log_d,
color='r', alpha=.5, label='Location of Energy error (start location of leapfrog)')

plt.title('scatter plot between log(tau) and theta[0]')
plt.legend();


There are many other ways to explore and visualize the pathological region in the parameter space. For example, we can reproduce Figure 5b in Visualization in Bayesian workflow

In [18]:

tracedf = pm.trace_to_dataframe(short_trace)
plotorder = ['mu', 'tau', 'theta__0', 'theta__1', 'theta__2', 'theta__3', 'theta__4',
'theta__5', 'theta__6', 'theta__7']
tracedf = tracedf[plotorder]

_, ax = plt.subplots(1, 2, figsize=(15, 4), sharex=True, sharey=True)
ax[0].plot(tracedf.values[divergent == 0].T, color='k', alpha=.025)
ax[0].plot(tracedf.values[divergent == 1].T, color='g', lw=.5)

ax[1].plot(tracedf.values[divergent == 0].T, color='k', alpha=.025)
ax[1].plot(tracedf.values[divergent == 1].T, color='g', lw=.5)
divsp = np.hstack([divergent_point['mu'][:,None],
np.exp(divergent_point['tau_log__'])[:,None],
divergent_point['theta']])
ax[1].plot(divsp.T, 'r', lw=.5)
plt.ylim([-20,40])
plt.xticks(range(10), plotorder)
plt.tight_layout();

In [19]:

# A small wrapper function for displaying the MCMC sampler diagnostics as above
def report_trace(trace):
# plot the trace of log(tau)
pm.traceplot(trace, varnames=['tau'])

# plot the estimate for the mean of log(τ) cumulating mean
logtau = np.log(trace['tau'])
mlogtau = [np.mean(logtau[:i]) for i in np.arange(1, len(logtau))]
plt.figure(figsize=(15, 4))
plt.axhline(0.7657852, lw=2.5, color='gray')
plt.plot(mlogtau, lw=2.5)
plt.ylim(0, 2)
plt.xlabel('Iteration')
plt.ylabel('MCMC mean of log(tau)')
plt.title('MCMC estimation of log(tau)')
plt.show()

# display the total number and percentage of divergent
divergent = trace['diverging']
print('Number of Divergent %d' % divergent.nonzero()[0].size)
divperc = divergent.nonzero()[0].size/len(trace)
print('Percentage of Divergent %.5f' % divperc)

# scatter plot between log(tau) and theta[0]
# for the identifcation of the problematic neighborhoods in parameter space
pm.pairplot(trace,
sub_varnames=['theta_0', 'tau_log__'],
divergences=True,
color='red', figsize=(10, 5), kwargs_divergence={'color':'green'})
plt.title('scatter plot between log(tau) and theta[0]');
plt.show()


### A Safer, Longer Markov Chain¶

Given the potential insensitivity of split $$\hat{R}$$ on single short chains, Stan recommend always running multiple chains as long as possible to have the best chance to observe any obstructions to geometric ergodicity. Because it is not always possible to run long chains for complex models, however, divergences are an incredibly powerful diagnostic for biased MCMC estimation.
In [20]:

with Centered_eight:
longer_trace = pm.sample(4000, chains=2, tune=1000, random_seed=SEED)

100%|██████████| 5000/5000 [00:25<00:00, 195.76it/s]

In [21]:

report_trace(longer_trace)

Number of Divergent 235
Percentage of Divergent 0.05875

In [22]:

pm.summary(longer_trace)

Out[22]:

mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat
mu 4.606617 3.298700 0.093574 -1.835694 10.949581 1041.0 1.004609
theta__0 6.650487 5.684370 0.130767 -4.350847 18.279303 1996.0 0.999959
theta__1 5.124756 4.785258 0.097833 -4.239980 14.946820 2104.0 1.001742
theta__2 4.086419 5.397922 0.116412 -7.351793 14.006299 1760.0 1.002782
theta__3 5.018538 4.913245 0.100900 -4.401920 15.361819 2075.0 1.000902
theta__4 3.784149 4.748403 0.111546 -5.991161 12.989582 1475.0 1.002856
theta__5 4.227186 4.865038 0.110039 -5.914311 13.814640 1771.0 1.003856
theta__6 6.706690 5.166720 0.114938 -3.302674 17.364571 1848.0 1.000102
theta__7 5.065655 5.382678 0.102710 -6.738087 15.289560 2285.0 1.001107
tau 3.946011 3.150152 0.123383 0.594162 10.245412 540.0 1.001870

More details can be found in Betancourt’s recent paper.

Divergences in Hamiltonian Monte Carlo arise when the Hamiltonian transition encounters regions of extremely large curvature, such as the opening of the hierarchical funnel. Unable to accurate resolve these regions, the transition malfunctions and flies off towards infinity. With the transitions unable to completely explore these regions of extreme curvature, we lose geometric ergodicity and our MCMC estimators become biased.

Algorithm implemented in Stan uses a heuristic to quickly identify these misbehaving trajectories, and hence label divergences, without having to wait for them to run all the way to infinity. This heuristic can be a bit aggressive, however, and sometimes label transitions as divergent even when we have not lost geometric ergodicity.

To resolve this potential ambiguity we can adjust the step size, $$\epsilon$$, of the Hamiltonian transition. The smaller the step size the more accurate the trajectory and the less likely it will be mislabeled as a divergence. In other words, if we have geometric ergodicity between the Hamiltonian transition and the target distribution then decreasing the step size will reduce and then ultimately remove the divergences entirely. If we do not have geometric ergodicity, however, then decreasing the step size will not completely remove the divergences.

Like Stan, the step size in PyMC3 is tuned automatically during warm up, but we can coerce smaller step sizes by tweaking the configuration of PyMC3’s adaptation routine. In particular, we can increase the target_accept parameter from its default value of 0.8 closer to its maximum value of 1.

In [23]:

with Centered_eight:
fit_cp85 = pm.sample(5000, chains=2, tune=2000,
nuts_kwargs=dict(target_accept=.85))

100%|██████████| 7000/7000 [00:38<00:00, 181.13it/s]

In [24]:

with Centered_eight:
fit_cp90 = pm.sample(5000, chains=2, tune=2000,
nuts_kwargs=dict(target_accept=.90))

100%|██████████| 7000/7000 [00:58<00:00, 119.01it/s]

In [25]:

with Centered_eight:
fit_cp95 = pm.sample(5000, chains=2, tune=2000,
nuts_kwargs=dict(target_accept=.95))

100%|██████████| 7000/7000 [00:59<00:00, 117.28it/s]

In [26]:

with Centered_eight:
fit_cp99 = pm.sample(5000, chains=2, tune=2000,
nuts_kwargs=dict(target_accept=.99))

100%|██████████| 7000/7000 [03:02<00:00, 38.40it/s]

In [27]:

df = pd.DataFrame([longer_trace['step_size'].mean(),
fit_cp85['step_size'].mean(),
fit_cp90['step_size'].mean(),
fit_cp95['step_size'].mean(),
fit_cp99['step_size'].mean()], columns=['Step_size'])
df['Divergent'] = pd.Series([longer_trace['diverging'].sum(),
fit_cp85['diverging'].sum(),
fit_cp90['diverging'].sum(),
fit_cp95['diverging'].sum(),
fit_cp99['diverging'].sum()])
df['delta_target'] = pd.Series(['.80', '.85', '.90', '.95', '.99'])
df

Out[27]:

Step_size Divergent delta_target
0 0.168369 235 .80
1 0.164436 360 .85
2 0.184810 115 .90
3 0.152659 95 .95
4 0.043369 9 .99

Here, the number of divergent transitions dropped dramatically when delta was increased to 0.99. > This behavior also has a nice geometric intuition. The more we decrease the step size the more the Hamiltonian Markov chain can explore the neck of the funnel. Consequently, the marginal posterior distribution for $$log (\tau)$$ stretches further and further towards negative values with the decreasing step size.

Since in PyMC3 after tuning we have a smaller step size than Stan, the geometery is better explored. > However, the Hamiltonian transition is still not geometrically ergodic with respect to the centered implementation of the Eight Schools model. Indeed, this is expected given the observed bias.

In [28]:

_, ax = plt.subplots(1, 1, figsize=(10, 6))

pm.pairplot(fit_cp99, sub_varnames=['theta_1', 'tau_log__'], ax=ax,
color='red', alpha=0.5, label='Centered, delta=0.85')

pm.pairplot(longer_trace, sub_varnames=['theta_1', 'tau_log__'], ax=ax,
color='orange', alpha=0.5, label='Centered, delta=0.99')

plt.title('scatter plot between log(tau) and theta[1]')
plt.legend();

In [29]:

logtau0 = longer_trace['tau_log__']
logtau2 = np.log(fit_cp90['tau'])
logtau1 = fit_cp99['tau_log__']

plt.figure(figsize=(15, 4))
plt.axhline(0.7657852, lw=2.5, color='gray')
mlogtau0 = [np.mean(logtau0[:i]) for i in np.arange(1, len(logtau0))]
plt.plot(mlogtau0, label='Centered, delta=0.85', lw=2.5)
mlogtau2 = [np.mean(logtau2[:i]) for i in np.arange(1, len(logtau2))]
plt.plot(mlogtau2, label='Centered, delta=0.90', lw=2.5)
mlogtau1 = [np.mean(logtau1[:i]) for i in np.arange(1, len(logtau1))]
plt.plot(mlogtau1, label='Centered, delta=0.99', lw=2.5)
plt.ylim(0, 2)
plt.xlabel('Iteration')
plt.ylabel('MCMC mean of log(tau)')
plt.title('MCMC estimation of log(tau)')
plt.legend();


## A Non-Centered Eight Schools Implementation¶

Although reducing the step size improves exploration, ultimately it only reveals the true extent the pathology in the centered implementation. Fortunately, there is another way to implement hierarchical models that does not suffer from the same pathologies.

In a non-centered parameterization we do not try to fit the group-level parameters directly, rather we fit a latent Gaussian variable from which we can recover the group-level parameters with a scaling and a translation.

$\mu \sim \mathcal{N}(0, 5)$
$\tau \sim \text{Half-Cauchy}(0, 5)$
$\tilde{\theta}_{n} \sim \mathcal{N}(0, 1)$
$\theta_{n} = \mu + \tau \cdot \tilde{\theta}_{n}.$

Stan model:

data {
int<lower=0> J;
real y[J];
real<lower=0> sigma[J];
}

parameters {
real mu;
real<lower=0> tau;
real theta_tilde[J];
}

transformed parameters {
real theta[J];
for (j in 1:J)
theta[j] = mu + tau * theta_tilde[j];
}

model {
mu ~ normal(0, 5);
tau ~ cauchy(0, 5);
theta_tilde ~ normal(0, 1);
y ~ normal(theta, sigma);
}

In [30]:

with pm.Model() as NonCentered_eight:
mu = pm.Normal('mu', mu=0, sd=5)
tau = pm.HalfCauchy('tau', beta=5)
theta_tilde = pm.Normal('theta_t', mu=0, sd=1, shape=J)
theta = pm.Deterministic('theta', mu + tau * theta_tilde)
obs = pm.Normal('obs', mu=theta, sd=sigma, observed=y)

In [31]:

with NonCentered_eight:
fit_ncp80 = pm.sample(5000, chains=2, tune=1000, random_seed=SEED,
nuts_kwargs=dict(target_accept=.80))

100%|██████████| 6000/6000 [00:14<00:00, 434.07it/s]

In [32]:

pm.summary(fit_ncp80)

Out[32]:

mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat
mu 4.427166 3.354160 0.031689 -1.881304 11.230099 10000.0 0.999907
theta_t__0 0.309566 0.979113 0.009734 -1.594355 2.241825 10000.0 0.999905
theta_t__1 0.079461 0.935187 0.009046 -1.837403 1.870839 10000.0 1.000031
theta_t__2 -0.088198 0.955461 0.009036 -2.037183 1.749658 10000.0 0.999921
theta_t__3 0.056894 0.929890 0.008755 -1.734245 1.913276 10000.0 0.999901
theta_t__4 -0.173858 0.941583 0.009589 -2.063461 1.691885 10000.0 1.000185
theta_t__5 -0.080808 0.931005 0.007813 -1.873244 1.772136 10000.0 0.999933
theta_t__6 0.359448 0.967399 0.010466 -1.631959 2.193805 10000.0 0.999941
theta_t__7 0.073983 0.969540 0.009227 -1.795899 2.043971 10000.0 0.999911
tau 3.592946 3.195408 0.039796 0.000209 9.859548 6498.0 1.000008
theta__0 6.147026 5.436518 0.059122 -3.977321 17.513389 9481.0 0.999939
theta__1 4.937034 4.719273 0.045782 -4.269079 14.730716 10000.0 0.999941
theta__2 3.927348 5.198048 0.053749 -6.261090 14.095719 9240.0 0.999900
theta__3 4.785853 4.717095 0.045999 -4.365430 14.709010 10000.0 0.999900
theta__4 3.587214 4.755796 0.048571 -5.668259 13.039226 10000.0 0.999996
theta__5 4.057664 4.819400 0.046848 -5.736878 13.614134 10000.0 0.999913
theta__6 6.334858 5.142805 0.052402 -2.915145 17.254603 10000.0 0.999902
theta__7 4.865761 5.379749 0.057599 -5.168168 16.178869 9525.0 0.999900
In [33]:

report_trace(fit_ncp80)

Number of Divergent 12
Percentage of Divergent 0.00240


As expected of false positives, we can remove the divergences entirely by decreasing the step size,

In [34]:

with NonCentered_eight:
fit_ncp90 = pm.sample(5000, chains=2, tune=1000, random_seed=SEED,
nuts_kwargs=dict(target_accept=.90))

# display the total number and percentage of divergent
divergent = fit_ncp90['diverging']
print('Number of Divergent %d' % divergent.nonzero()[0].size)

100%|██████████| 6000/6000 [00:21<00:00, 280.50it/s]

Number of Divergent 0


The more agreeable geometry of the non-centered implementation allows the Markov chain to explore deep into the neck of the funnel, capturing even the smallest values of $$\tau$$ that are consistent with the measurements. Consequently, MCMC estimators from the non-centered chain rapidly converge towards their true expectation values.

In [35]:

_, ax = plt.subplots(1, 1, figsize=(10, 6))

pm.pairplot(fit_ncp80, sub_varnames=['theta_0', 'tau_log__'], ax=ax,
color='blue', alpha=0.5, label='Centered, delta=0.80')

pm.pairplot(fit_cp99, sub_varnames=['theta_0', 'tau_log__'], ax=ax,
color='red', alpha=0.5, label='Centered, delta=0.99')

pm.pairplot(fit_cp90, sub_varnames=['theta_0', 'tau_log__'], ax=ax,
color='orange', alpha=0.5, label='Centered, delta=0.90')
plt.title('scatter plot between log(tau) and theta[1]')
plt.legend();


In [36]:

logtaun = fit_ncp80['tau_log__']

plt.figure(figsize=(15, 4))
plt.axhline(0.7657852, lw=2.5, color='gray')
mlogtaun = [np.mean(logtaun[:i]) for i in np.arange(1, len(logtaun))]
plt.plot(mlogtaun, color='b', lw=2.5, label='Non-Centered, delta=0.80')

mlogtau1 = [np.mean(logtau1[:i]) for i in np.arange(1, len(logtau1))]
plt.plot(mlogtau1, color='r', lw=2.5, label='Centered, delta=0.99')

mlogtau0 = [np.mean(logtau0[:i]) for i in np.arange(1, len(logtau0))]
plt.plot(mlogtau0, color=[1, 0.5, 0], lw=2.5, label='Centered, delta=0.90')
plt.ylim(0, 2)
plt.xlabel('Iteration')
plt.ylabel('MCMC mean of log(tau)')
plt.title('MCMC estimation of log(tau)')
plt.legend();