# A Primer on Bayesian Methods for Multilevel Modeling¶

Hierarchical or multilevel modeling is a generalization of regression modeling.

*Multilevel models* are regression models in which the constituent model
parameters are given **probability models**. This implies that model
parameters are allowed to **vary by group**.

Observational units are often naturally **clustered**. Clustering
induces dependence between observations, despite random sampling of
clusters and random sampling within clusters.

A *hierarchical model* is a particular multilevel model where parameters
are nested within one another.

Some multilevel structures are not hierarchical.

- e.g. “country” and “year” are not nested, but may represent separate, but overlapping, clusters of parameters

We will motivate this topic using an environmental epidemiology example.

## Example: Radon contamination (Gelman and Hill 2006)¶

Radon is a radioactive gas that enters homes through contact points with the ground. It is a carcinogen that is the primary cause of lung cancer in non-smokers. Radon levels vary greatly from household to household.

The EPA did a study of radon levels in 80,000 houses. Two important predictors:

- measurement in basement or first floor (radon higher in basements)
- county uranium level (positive correlation with radon levels)

We will focus on modeling radon levels in Minnesota.

The hierarchy in this example is households within county.

## Data organization¶

First, we import the data from a local file, and extract Minnesota’s data.

```
In [1]:
```

```
%matplotlib inline
import numpy as np
import pandas as pd
from pymc3 import __version__
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-darkgrid')
print('Running on PyMC3 v{}'.format(__version__))
from pymc3 import get_data
# Import radon data
srrs2 = pd.read_csv(get_data('srrs2.dat'))
srrs2.columns = srrs2.columns.map(str.strip)
srrs_mn = srrs2[srrs2.state=='MN'].copy()
```

```
Running on PyMC3 v3.4.1
```

Next, obtain the county-level predictor, uranium, by combining two variables.

```
In [2]:
```

```
srrs_mn['fips'] = srrs_mn.stfips*1000 + srrs_mn.cntyfips
cty = pd.read_csv(get_data('cty.dat'))
cty_mn = cty[cty.st=='MN'].copy()
cty_mn[ 'fips'] = 1000*cty_mn.stfips + cty_mn.ctfips
```

Use the `merge`

method to combine home- and county-level information
in a single DataFrame.

```
In [3]:
```

```
srrs_mn = srrs_mn.merge(cty_mn[['fips', 'Uppm']], on='fips')
srrs_mn = srrs_mn.drop_duplicates(subset='idnum')
u = np.log(srrs_mn.Uppm)
n = len(srrs_mn)
```

```
In [4]:
```

```
srrs_mn.head()
```

```
Out[4]:
```

idnum | state | state2 | stfips | zip | region | typebldg | floor | room | basement | ... | stopdt | activity | pcterr | adjwt | dupflag | zipflag | cntyfips | county | fips | Uppm | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 5081 | MN | MN | 27 | 55735 | 5 | 1 | 1 | 3 | N | ... | 12288 | 2.2 | 9.7 | 1146.499190 | 1 | 0 | 1 | AITKIN | 27001 | 0.502054 |

1 | 5082 | MN | MN | 27 | 55748 | 5 | 1 | 0 | 4 | Y | ... | 12088 | 2.2 | 14.5 | 471.366223 | 0 | 0 | 1 | AITKIN | 27001 | 0.502054 |

2 | 5083 | MN | MN | 27 | 55748 | 5 | 1 | 0 | 4 | Y | ... | 21188 | 2.9 | 9.6 | 433.316718 | 0 | 0 | 1 | AITKIN | 27001 | 0.502054 |

3 | 5084 | MN | MN | 27 | 56469 | 5 | 1 | 0 | 4 | Y | ... | 123187 | 1.0 | 24.3 | 461.623670 | 0 | 0 | 1 | AITKIN | 27001 | 0.502054 |

4 | 5085 | MN | MN | 27 | 55011 | 3 | 1 | 0 | 4 | Y | ... | 13088 | 3.1 | 13.8 | 433.316718 | 0 | 0 | 3 | ANOKA | 27003 | 0.428565 |

5 rows × 27 columns

We also need a lookup table (`dict`

) for each unique county, for
indexing.

```
In [5]:
```

```
srrs_mn.county = srrs_mn.county.map(str.strip)
mn_counties = srrs_mn.county.unique()
counties = len(mn_counties)
county_lookup = dict(zip(mn_counties, range(len(mn_counties))))
```

Finally, create local copies of variables.

```
In [6]:
```

```
county = srrs_mn['county_code'] = srrs_mn.county.replace(county_lookup).values
radon = srrs_mn.activity
srrs_mn['log_radon'] = log_radon = np.log(radon + 0.1).values
floor_measure = srrs_mn.floor.values
```

Distribution of radon levels in MN (log scale):

```
In [7]:
```

```
srrs_mn.activity.apply(lambda x: np.log(x+0.1)).hist(bins=25);
```

### Conventional approaches¶

The two conventional alternatives to modeling radon exposure represent the two extremes of the bias-variance tradeoff:

***Complete pooling***:

Treat all counties the same, and estimate a single radon level.

***No pooling***:

Model radon in each county independently.

where \(j = 1,\ldots,85\)

The errors \(\epsilon_i\) may represent measurement error, temporal within-house variation, or variation among houses.

Here are the point estimates of the slope and intercept for the complete pooling model:

```
In [8]:
```

```
from pymc3 import Model, sample, Normal, HalfCauchy, Uniform, model_to_graphviz
floor = srrs_mn.floor.values
log_radon = srrs_mn.log_radon.values
with Model() as pooled_model:
beta = Normal('beta', 0, sd=1e5, shape=2)
sigma = HalfCauchy('sigma', 5)
theta = beta[0] + beta[1]*floor
y = Normal('y', theta, sd=sigma, observed=log_radon)
model_to_graphviz(pooled_model)
```

```
Out[8]:
```

```
In [9]:
```

```
with pooled_model:
pooled_trace = sample(1000, tune=1000)
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, beta]
Sampling 4 chains: 100%|██████████| 8000/8000 [00:02<00:00, 3192.66draws/s]
```

```
In [10]:
```

```
b0, m0 = pooled_trace['beta'].mean(axis=0)
```

```
In [11]:
```

```
plt.scatter(srrs_mn.floor, np.log(srrs_mn.activity+0.1))
xvals = np.linspace(-0.2, 1.2)
plt.plot(xvals, m0*xvals+b0, 'r--');
```

Estimates of county radon levels for the unpooled model:

```
In [12]:
```

```
with Model() as unpooled_model:
beta0 = Normal('beta0', 0, sd=1e5, shape=counties)
beta1 = Normal('beta1', 0, sd=1e5)
sigma = HalfCauchy('sigma', 5)
theta = beta0[county] + beta1*floor
y = Normal('y', theta, sd=sigma, observed=log_radon)
model_to_graphviz(unpooled_model)
```

```
Out[12]:
```

```
In [13]:
```

```
with unpooled_model:
unpooled_trace = sample(1000, tune=1000)
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, beta1, beta0]
Sampling 4 chains: 100%|██████████| 8000/8000 [00:06<00:00, 1164.07draws/s]
```

Here are the unpooled county expected values

```
In [14]:
```

```
from pymc3 import forestplot
plt.figure(figsize=(6,14))
forestplot(unpooled_trace, varnames=['beta0']);
```

```
In [15]:
```

```
unpooled_estimates = pd.Series(unpooled_trace['beta0'].mean(axis=0), index=mn_counties)
unpooled_se = pd.Series(unpooled_trace['beta0'].std(axis=0), index=mn_counties)
```

We can plot the ordered estimates to identify counties with high radon levels:

```
In [16]:
```

```
order = unpooled_estimates.sort_values().index
plt.scatter(range(len(unpooled_estimates)), unpooled_estimates[order])
for i, m, se in zip(range(len(unpooled_estimates)), unpooled_estimates[order], unpooled_se[order]):
plt.plot([i,i], [m-se, m+se], 'b-')
plt.xlim(-1,86); plt.ylim(-1,4)
plt.ylabel('Radon estimate');plt.xlabel('Ordered county');
```

Here are visual comparisons between the pooled and unpooled estimates for a subset of counties representing a range of sample sizes.

```
In [17]:
```

```
sample_counties = ('LAC QUI PARLE', 'AITKIN', 'KOOCHICHING',
'DOUGLAS', 'CLAY', 'STEARNS', 'RAMSEY', 'ST LOUIS')
fig, axes = plt.subplots(2, 4, figsize=(12, 6), sharey=True, sharex=True)
axes = axes.ravel()
m = unpooled_trace['beta1'].mean()
for i,c in enumerate(sample_counties):
y = srrs_mn.log_radon[srrs_mn.county==c]
x = srrs_mn.floor[srrs_mn.county==c]
axes[i].scatter(x + np.random.randn(len(x))*0.01, y, alpha=0.4)
# No pooling model
b = unpooled_estimates[c]
# Plot both models and data
xvals = np.linspace(-0.2, 1.2)
axes[i].plot(xvals, m*xvals+b)
axes[i].plot(xvals, m0*xvals+b0, 'r--')
axes[i].set_xticks([0,1])
axes[i].set_xticklabels(['basement', 'floor'])
axes[i].set_ylim(-1, 3)
axes[i].set_title(c)
if not i%2:
axes[i].set_ylabel('log radon level')
```

Neither of these models are satisfactory:

- if we are trying to identify high-radon counties, pooling is useless
- we do not trust extreme unpooled estimates produced by models using few observations

### Multilevel and hierarchical models¶

When we pool our data, we imply that they are sampled from the same model. This ignores any variation among sampling units (other than sampling variance):

When we analyze data unpooled, we imply that they are sampled independently from separate models. At the opposite extreme from the pooled case, this approach claims that differences between sampling units are to large to combine them:

In a hierarchical model, parameters are viewed as a sample from a
population distribution of parameters. Thus, we view them as being
neither entirely different or exactly the same. This is ***parital
pooling***.

We can use PyMC to easily specify multilevel models, and fit them using Markov chain Monte Carlo.

### Partial pooling model¶

The simplest partial pooling model for the household radon dataset is one which simply estimates radon levels, without any predictors at any level. A partial pooling model represents a compromise between the pooled and unpooled extremes, approximately a weighted average (based on sample size) of the unpooled county estimates and the pooled estimates.

Estimates for counties with smaller sample sizes will shrink towards the state-wide average.

Estimates for counties with larger sample sizes will be closer to the unpooled county estimates.

```
In [18]:
```

```
with Model() as partial_pooling:
# Priors
mu_a = Normal('mu_a', mu=0., sd=1e5)
sigma_a = HalfCauchy('sigma_a', 5)
# Random intercepts
a = Normal('a', mu=mu_a, sd=sigma_a, shape=counties)
# Model error
sigma_y = HalfCauchy('sigma_y',5)
# Expected value
y_hat = a[county]
# Data likelihood
y_like = Normal('y_like', mu=y_hat, sd=sigma_y, observed=log_radon)
model_to_graphviz(partial_pooling)
```

```
Out[18]:
```

```
In [19]:
```

```
with partial_pooling:
partial_pooling_trace = sample(1000, tune=1000)
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma_y, a, sigma_a, mu_a]
Sampling 4 chains: 100%|██████████| 8000/8000 [00:10<00:00, 732.78draws/s]
The number of effective samples is smaller than 25% for some parameters.
```

```
In [20]:
```

```
sample_trace = partial_pooling_trace['a']
fig, axes = plt.subplots(1, 2, figsize=(14,6), sharex=True, sharey=True)
samples, counties = sample_trace.shape
jitter = np.random.normal(scale=0.1, size=counties)
n_county = srrs_mn.groupby('county')['idnum'].count()
unpooled_means = srrs_mn.groupby('county')['log_radon'].mean()
unpooled_sd = srrs_mn.groupby('county')['log_radon'].std()
unpooled = pd.DataFrame({'n':n_county, 'm':unpooled_means, 'sd':unpooled_sd})
unpooled['se'] = unpooled.sd/np.sqrt(unpooled.n)
axes[0].plot(unpooled.n + jitter, unpooled.m, 'b.')
for j, row in zip(jitter, unpooled.iterrows()):
name, dat = row
axes[0].plot([dat.n+j,dat.n+j], [dat.m-dat.se, dat.m+dat.se], 'b-')
axes[0].set_xscale('log')
axes[0].hlines(sample_trace.mean(), 0.9, 100, linestyles='--')
samples, counties = sample_trace.shape
means = sample_trace.mean(axis=0)
sd = sample_trace.std(axis=0)
axes[1].scatter(n_county.values + jitter, means)
axes[1].set_xscale('log')
axes[1].set_xlim(1,100)
axes[1].set_ylim(0, 3)
axes[1].hlines(sample_trace.mean(), 0.9, 100, linestyles='--')
for j,n,m,s in zip(jitter, n_county.values, means, sd):
axes[1].plot([n+j]*2, [m-s, m+s], 'b-')
```

Notice the difference between the unpooled and partially-pooled estimates, particularly at smaller sample sizes. The former are both more extreme and more imprecise.

### Varying intercept model¶

This model allows intercepts to vary across county, according to a random effect.

where

and the intercept random effect:

As with the the “no-pooling” model, we set a separate intercept for each
county, but rather than fitting separate least squares regression models
for each county, multilevel modeling **shares strength** among counties,
allowing for more reasonable inference in counties with little data.

```
In [21]:
```

```
with Model() as varying_intercept:
# Priors
mu_a = Normal('mu_a', mu=0., tau=0.0001)
sigma_a = HalfCauchy('sigma_a', 5)
# Random intercepts
a = Normal('a', mu=mu_a, sd=sigma_a, shape=counties)
# Common slope
b = Normal('b', mu=0., sd=1e5)
# Model error
sd_y = HalfCauchy('sd_y', 5)
# Expected value
y_hat = a[county] + b * floor_measure
# Data likelihood
y_like = Normal('y_like', mu=y_hat, sd=sd_y, observed=log_radon)
model_to_graphviz(varying_intercept)
```

```
Out[21]:
```

We can fit the above model using MCMC.

```
In [22]:
```

```
with varying_intercept:
varying_intercept_trace = sample(1000, tune=1000)
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sd_y, b, a, sigma_a, mu_a]
Sampling 4 chains: 100%|██████████| 8000/8000 [00:10<00:00, 771.10draws/s]
```

```
In [23]:
```

```
from pymc3 import forestplot, traceplot, plot_posterior
plt.figure(figsize=(6,14))
forestplot(varying_intercept_trace, varnames=['a']);
```

```
In [24]:
```

```
plot_posterior(varying_intercept_trace, varnames=['sigma_a', 'b']);
```

The estimate for the `floor`

coefficient is approximately -0.66, which
can be interpreted as houses without basements having about half
(\(\exp(-0.66) = 0.52\)) the radon levels of those with basements,
after accounting for county.

```
In [25]:
```

```
from pymc3 import summary
summary(varying_intercept_trace, varnames=['b'])
```

```
Out[25]:
```

mean | sd | mc_error | hpd_2.5 | hpd_97.5 | n_eff | Rhat | |
---|---|---|---|---|---|---|---|

b | -0.663514 | 0.068519 | 0.001185 | -0.799726 | -0.533691 | 3035.212365 | 0.999866 |

```
In [26]:
```

```
xvals = np.arange(2)
bp = varying_intercept_trace[a].mean(axis=0)
mp = varying_intercept_trace[b].mean()
for bi in bp:
plt.plot(xvals, mp*xvals + bi, 'bo-', alpha=0.4)
plt.xlim(-0.1,1.1);
```

It is easy to show that the partial pooling model provides more objectively reasonable estimates than either the pooled or unpooled models, at least for counties with small sample sizes.

```
In [27]:
```

```
fig, axes = plt.subplots(2, 4, figsize=(12, 6), sharey=True, sharex=True)
axes = axes.ravel()
for i,c in enumerate(sample_counties):
# Plot county data
y = srrs_mn.log_radon[srrs_mn.county==c]
x = srrs_mn.floor[srrs_mn.county==c]
axes[i].scatter(x + np.random.randn(len(x))*0.01, y, alpha=0.4)
# No pooling model
m,b = unpooled_estimates[['floor', c]]
xvals = np.linspace(-0.2, 1.2)
# Unpooled estimate
axes[i].plot(xvals, m*xvals+b)
# Pooled estimate
axes[i].plot(xvals, m0*xvals+b0, 'r--')
# Partial pooling esimate
axes[i].plot(xvals, mp*xvals+bp[county_lookup[c]], 'k:')
axes[i].set_xticks([0,1])
axes[i].set_xticklabels(['basement', 'floor'])
axes[i].set_ylim(-1, 3)
axes[i].set_title(c)
if not i%2:
axes[i].set_ylabel('log radon level')
```

```
/home/colin/miniconda3/envs/pymc33.6/lib/python3.6/site-packages/pandas/core/series.py:850: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
return self.loc[key]
```

### Varying slope model¶

Alternatively, we can posit a model that allows the counties to vary according to how the location of measurement (basement or floor) influences the radon reading.

```
In [28]:
```

```
with Model() as varying_slope:
# Priors
mu_b = Normal('mu_b', mu=0., sd=1e5)
sigma_b = HalfCauchy('sigma_b', 5)
# Common intercepts
a = Normal('a', mu=0., sd=1e5)
# Random slopes
b = Normal('b', mu=mu_b, sd=sigma_b, shape=counties)
# Model error
sigma_y = HalfCauchy('sigma_y',5)
# Expected value
y_hat = a + b[county] * floor_measure
# Data likelihood
y_like = Normal('y_like', mu=y_hat, sd=sigma_y, observed=log_radon)
model_to_graphviz(varying_slope)
```

```
Out[28]:
```

```
In [29]:
```

```
with varying_slope:
varying_slope_trace = sample(1000, tune=1000)
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma_y, b, a, sigma_b, mu_b]
Sampling 4 chains: 100%|██████████| 8000/8000 [00:22<00:00, 363.62draws/s]
There were 1 divergences after tuning. Increase `target_accept` or reparameterize.
There were 71 divergences after tuning. Increase `target_accept` or reparameterize.
There were 1 divergences after tuning. Increase `target_accept` or reparameterize.
There were 32 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.5938732117578109, but should be close to 0.8. Try to increase the number of tuning steps.
The estimated number of effective samples is smaller than 200 for some parameters.
```

```
In [30]:
```

```
plt.figure(figsize=(6,14))
forestplot(varying_slope_trace, varnames=['b']);
```

```
In [31]:
```

```
xvals = np.arange(2)
b = varying_slope_trace['a'].mean()
m = varying_slope_trace['b'].mean(axis=0)
for mi in m:
plt.plot(xvals, mi*xvals + b, 'bo-', alpha=0.4)
plt.xlim(-0.2, 1.2);
```

### Varying intercept and slope model¶

The most general model allows both the intercept and slope to vary by county:

```
In [32]:
```

```
with Model() as varying_intercept_slope:
# Priors
mu_a = Normal('mu_a', mu=0., sd=1e5)
sigma_a = HalfCauchy('sigma_a', 5)
mu_b = Normal('mu_b', mu=0., sd=1e5)
sigma_b = HalfCauchy('sigma_b', 5)
# Random intercepts
a = Normal('a', mu=mu_a, sd=sigma_a, shape=counties)
# Random slopes
b = Normal('b', mu=mu_b, sd=sigma_b, shape=counties)
# Model error
sigma_y = Uniform('sigma_y', lower=0, upper=100)
# Expected value
y_hat = a[county] + b[county] * floor_measure
# Data likelihood
y_like = Normal('y_like', mu=y_hat, sd=sigma_y, observed=log_radon)
model_to_graphviz(varying_intercept_slope)
```

```
Out[32]:
```

```
In [33]:
```

```
with varying_intercept_slope:
varying_intercept_slope_trace = sample(1000, tune=1000)
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma_y, b, a, sigma_b, mu_b, sigma_a, mu_a]
Sampling 4 chains: 100%|██████████| 8000/8000 [00:26<00:00, 305.07draws/s]
There were 35 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.7018759674375128, but should be close to 0.8. Try to increase the number of tuning steps.
There were 4 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.6832205781644568, but should be close to 0.8. Try to increase the number of tuning steps.
The estimated number of effective samples is smaller than 200 for some parameters.
```

```
In [34]:
```

```
plt.figure(figsize=(6,16))
forestplot(varying_intercept_slope_trace, varnames=['a','b']);
```

```
In [35]:
```

```
xvals = np.arange(2)
b = varying_intercept_slope_trace['a'].mean(axis=0)
m = varying_intercept_slope_trace['b'].mean(axis=0)
for bi,mi in zip(b,m):
plt.plot(xvals, mi*xvals + bi, 'bo-', alpha=0.4)
plt.xlim(-0.1, 1.1);
```

### Adding group-level predictors¶

A primary strength of multilevel models is the ability to handle predictors on multiple levels simultaneously. If we consider the varying-intercepts model above:

we may, instead of a simple random effect to describe variation in the expected radon value, specify another regression model with a county-level covariate. Here, we use the county uranium reading \(u_j\), which is thought to be related to radon levels:

Thus, we are now incorporating a house-level predictor (floor or basement) as well as a county-level predictor (uranium).

Note that the model has both indicator variables for each county, plus a county-level covariate. In classical regression, this would result in collinearity. In a multilevel model, the partial pooling of the intercepts towards the expected value of the group-level linear model avoids this.

Group-level predictors also serve to reduce group-level variation \(\sigma_{\alpha}\). An important implication of this is that the group-level estimate induces stronger pooling.

```
In [36]:
```

```
from pymc3 import Deterministic
with Model() as hierarchical_intercept:
# Priors
sigma_a = HalfCauchy('sigma_a', 5)
# County uranium model for slope
gamma_0 = Normal('gamma_0', mu=0., sd=1e5)
gamma_1 = Normal('gamma_1', mu=0., sd=1e5)
# Uranium model for intercept
mu_a = gamma_0 + gamma_1*u
# County variation not explained by uranium
eps_a = Normal('eps_a', mu=0, sd=sigma_a, shape=counties)
a = Deterministic('a', mu_a + eps_a[county])
# Common slope
b = Normal('b', mu=0., sd=1e5)
# Model error
sigma_y = Uniform('sigma_y', lower=0, upper=100)
# Expected value
y_hat = a + b * floor_measure
# Data likelihood
y_like = Normal('y_like', mu=y_hat, sd=sigma_y, observed=log_radon)
model_to_graphviz(hierarchical_intercept)
```

```
Out[36]:
```

```
In [37]:
```

```
with hierarchical_intercept:
hierarchical_intercept_trace = sample(1000, tune=1000)
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma_y, b, eps_a, gamma_1, gamma_0, sigma_a]
Sampling 4 chains: 100%|██████████| 8000/8000 [00:18<00:00, 436.44draws/s]
There were 2 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.6586914850689609, but should be close to 0.8. Try to increase the number of tuning steps.
The estimated number of effective samples is smaller than 200 for some parameters.
```

```
In [38]:
```

```
a_means = hierarchical_intercept_trace['a'].mean(axis=0)
plt.scatter(u, a_means)
g0 = hierarchical_intercept_trace['gamma_0'].mean()
g1 = hierarchical_intercept_trace['gamma_1'].mean()
xvals = np.linspace(-1, 0.8)
plt.plot(xvals, g0+g1*xvals, 'k--')
plt.xlim(-1, 0.8)
a_se = hierarchical_intercept_trace['a'].std(axis=0)
for ui, m, se in zip(u, a_means, a_se):
plt.plot([ui,ui], [m-se, m+se], 'b-')
plt.xlabel('County-level uranium'); plt.ylabel('Intercept estimate');
```

The standard errors on the intercepts are narrower than for the partial-pooling model without a county-level covariate.

## Correlations among levels¶

In some instances, having predictors at multiple levels can reveal correlation between individual-level variables and group residuals. We can account for this by including the average of the individual predictors as a covariate in the model for the group intercept.

These are broadly referred to as ***contextual effects***.

```
In [39]:
```

```
# Create new variable for mean of floor across counties
xbar = srrs_mn.groupby('county')['floor'].mean().rename(county_lookup).values
```

```
In [40]:
```

```
with Model() as contextual_effect:
# Priors
sigma_a = HalfCauchy('sigma_a', 5)
# County uranium model for slope
gamma = Normal('gamma', mu=0., sd=1e5, shape=3)
# Uranium model for intercept
mu_a = Deterministic('mu_a', gamma[0] + gamma[1]*u.values + gamma[2]*xbar[county])
# County variation not explained by uranium
eps_a = Normal('eps_a', mu=0, sd=sigma_a, shape=counties)
a = Deterministic('a', mu_a + eps_a[county])
# Common slope
b = Normal('b', mu=0., sd=1e15)
# Model error
sigma_y = Uniform('sigma_y', lower=0, upper=100)
# Expected value
y_hat = a + b * floor_measure
# Data likelihood
y_like = Normal('y_like', mu=y_hat, sd=sigma_y, observed=log_radon)
model_to_graphviz(contextual_effect)
```

```
Out[40]:
```

```
In [41]:
```

```
with contextual_effect:
contextual_effect_trace = sample(1000, tune=1000)
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma_y, b, eps_a, gamma, sigma_a]
Sampling 4 chains: 100%|██████████| 8000/8000 [00:19<00:00, 407.55draws/s]
There were 2 divergences after tuning. Increase `target_accept` or reparameterize.
There were 1 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.6922214272506528, but should be close to 0.8. Try to increase the number of tuning steps.
The estimated number of effective samples is smaller than 200 for some parameters.
```

```
In [42]:
```

```
summary(contextual_effect_trace, varnames=['gamma'])
```

```
Out[42]:
```

mean | sd | mc_error | hpd_2.5 | hpd_97.5 | n_eff | Rhat | |
---|---|---|---|---|---|---|---|

gamma__0 | 1.427291 | 0.048235 | 0.001048 | 1.330803 | 1.516513 | 2024.872123 | 1.001484 |

gamma__1 | 0.698287 | 0.087482 | 0.001630 | 0.527938 | 0.869216 | 3172.220320 | 1.000012 |

gamma__2 | 0.390631 | 0.196336 | 0.004177 | -0.004603 | 0.771079 | 2233.710720 | 0.999814 |

So, we might infer from this that counties with higher proportions of houses without basements tend to have higher baseline levels of radon. Perhaps this is related to the soil type, which in turn might influence what type of structures are built.

## Prediction¶

Gelman (2006) used cross-validation tests to check the prediction error of the unpooled, pooled, and partially-pooled models

**root mean squared cross-validation prediction errors**:

- unpooled = 0.86
- pooled = 0.84
- multilevel = 0.79

There are two types of prediction that can be made in a multilevel model:

- a new individual within an existing group
- a new individual within a new group

For example, if we wanted to make a prediction for a new house with no basement in St. Louis county, we just need to sample from the radon model with the appropriate intercept.

```
In [43]:
```

```
county_lookup['ST LOUIS']
```

```
Out[43]:
```

```
69
```

That is,

This is simply a matter of adding a single additional line in PyMC:

```
In [44]:
```

```
with Model() as contextual_pred:
# Priors
sigma_a = HalfCauchy('sigma_a', 5)
# County uranium model for slope
gamma = Normal('gamma', mu=0., sd=1e5, shape=3)
# Uranium model for intercept
mu_a = Deterministic('mu_a', gamma[0] + gamma[1]*u.values + gamma[2]*xbar[county])
# County variation not explained by uranium
eps_a = Normal('eps_a', mu=0, sd=sigma_a, shape=counties)
a = Deterministic('a', mu_a + eps_a[county])
# Common slope
b = Normal('b', mu=0., sd=1e15)
# Model error
sigma_y = Uniform('sigma_y', lower=0, upper=100)
# Expected value
y_hat = a + b * floor_measure
# Data likelihood
y_like = Normal('y_like', mu=y_hat, sd=sigma_y, observed=log_radon)
# St Louis county prediction
stl_pred = Normal('stl_pred', mu=a[69] + b, sd=sigma_y)
model_to_graphviz(contextual_pred)
```

```
Out[44]:
```

```
In [45]:
```

```
with contextual_pred:
contextual_pred_trace = sample(1000, tune=1000)
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [stl_pred, sigma_y, b, eps_a, gamma, sigma_a]
Sampling 4 chains: 100%|██████████| 8000/8000 [00:20<00:00, 395.41draws/s]
There were 176 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.48972929484728844, but should be close to 0.8. Try to increase the number of tuning steps.
The estimated number of effective samples is smaller than 200 for some parameters.
```

```
In [46]:
```

```
plot_posterior(contextual_pred_trace, varnames=['stl_pred']);
```

### Benefits of Multilevel Models¶

Accounting for natural hierarchical structure of observational data

Estimation of coefficients for (under-represented) groups

Incorporating individual- and group-level information when estimating group-level coefficients

Allowing for variation among individual-level coefficients across groups

### References¶

Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models (1st ed.). Cambridge University Press.

Gelman, A. (2006). Multilevel (Hierarchical) modeling: what it can and cannot do. Technometrics, 48(3), 432–435.