Censored Data Models

In [1]:
%matplotlib inline
import pymc3 as pm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import theano.tensor as tt

plt.style.use('seaborn-darkgrid')
print('Running on PyMC3 v{}'.format(pm.__version__))
Running on PyMC3 v3.4.1

This example notebook on Bayesian survival analysis touches on the point of censored data. Censoring is a form of missing-data problem, in which observations greater than a certain threshold are clipped down to that threshold, or observations less than a certain threshold are clipped up to that threshold, or both. These are called right, left and interval censoring, respectively. In this example notebook we consider interval censoring.

Censored data arises in many modelling problems. Two common examples are:

  1. Survival analysis: when studying the effect of a certain medical treatment on survival times, it is impossible to prolong the study until all subjects have died. At the end of the study, the only data collected for many patients is that they were still alive for a time period \(T\) after the treatment was administered: in reality, their true survival times are greater than \(T\).
  2. Sensor saturation: a sensor might have a limited range and the upper and lower limits would simply be the highest and lowest values a sensor can report. For instance, many mercury thermometers only report a very narrow range of temperatures.

This example notebook presents two different ways of dealing with censored data in PyMC3:

  1. An imputed censored model, which represents censored data as parameters and makes up plausible values for all censored values. As a result of this imputation, this model is capable of generating plausible sets of made-up values that would have been censored. Each censored element introduces a random variable.
  2. An unimputed censored model, where the censored data are integrated out and accounted for only through the log-likelihood. This method deals more adequately with large amounts of censored data and converges more quickly.

To establish a baseline we compare to an uncensored model of the uncensored data.

In [2]:
# Produce normally distributed samples
np.random.seed(1618)
size = 500
mu = 13.
sigma = 5.
samples = np.random.normal(mu, sigma, size)

# Set censoring limits
high = 16.
low = -1.

# Censor samples
censored = samples[(samples > low) & (samples < high)]
In [3]:
# Visualize uncensored and censored data
_, axarr = plt.subplots(ncols=2, figsize=[16, 4], sharex=True, sharey=True)
for i, data in enumerate([samples, censored]):
    sns.distplot(data, ax=axarr[i])
axarr[0].set_title('Uncensored')
axarr[1].set_title('Censored')
plt.show()
../_images/notebooks_censored_data_4_0.png

Baseline - Uncensored Model of Uncensored Data

In [4]:
# Uncensored model
with pm.Model() as uncensored_model:
    mu = pm.Normal('mu', mu=0., sd=(high - low) / 2.)
    sigma = pm.HalfNormal('sigma', sd=(high - low) / 2.)
    observed = pm.Normal('observed', mu=mu, sd=sigma, observed=samples)

Model 1 - Imputed Censored Model of Censored Data

In this model, we impute the censored values from the same distribution as the uncensored data. Sampling from the posterior generates possible uncensored data sets.

This model makes use of PyMC3’s bounded variables.

In [5]:
# Imputed censored model
n_right_censored = len(samples[samples >= high])
n_left_censored = len(samples[samples <= low])
n_observed = len(samples) - n_right_censored - n_left_censored

with pm.Model() as imputed_censored_model:
    mu = pm.Normal('mu', mu=0., sd=(high - low) / 2.)
    sigma = pm.HalfNormal('sigma', sd=(high - low) / 2.)

    right_censored = pm.Bound(pm.Normal, lower=high)(
        'right_censored', mu=mu, sd=sigma, shape=n_right_censored
    )
    left_censored = pm.Bound(pm.Normal, upper=low)(
        'left_censored', mu=mu, sd=sigma, shape=n_left_censored
    )

    observed = pm.Normal(
        'observed',
        mu=mu,
        sd=sigma,
        observed=censored,
        shape=n_observed
    )

Model 2 - Unimputed Censored Model of Censored Data

In this model, we do not impute censored data, but instead integrate them out through the likelihood.

The implementations of the likelihoods are non-trivial. See the Stan manual (section 11.3 on censored data) and the original PyMC3 issue on GitHub for more information.

This model makes use of PyMC3’s ``Potential` <https://docs.pymc.io/api/model.html#pymc3.model.Potential>`__.

In [6]:
# Import the log cdf and log complementary cdf of the normal Distribution from PyMC3
from pymc3.distributions.dist_math import normal_lcdf, normal_lccdf

# Helper functions for unimputed censored model
def left_censored_likelihood(mu, sigma, n_left_censored, lower_bound):
    ''' Likelihood of left-censored data. '''
    return n_left_censored * normal_lcdf(mu, sigma, lower_bound)


def right_censored_likelihood(mu, sigma, n_right_censored, upper_bound):
    ''' Likelihood of right-censored data. '''
    return n_right_censored * normal_lccdf(mu, sigma, upper_bound)
In [7]:
# Unimputed censored model
with pm.Model() as unimputed_censored_model:
    mu = pm.Normal('mu', mu=0., sd=(high - low) / 2.)
    sigma = pm.HalfNormal('sigma', sd=(high - low) / 2.)

    observed = pm.Normal(
        'observed',
        mu=mu,
        sd=sigma,
        observed=censored,
    )

    left_censored = pm.Potential(
        'left_censored',
        left_censored_likelihood(mu, sigma, n_left_censored, low)
    )
    right_censored = pm.Potential(
        'right_censored',
        right_censored_likelihood(mu, sigma, n_right_censored, high)
    )

Sampling

In [8]:
# Uncensored model
with uncensored_model:
    trace = pm.sample(tune=1000)  # Increase `tune` to avoid divergences
    pm.plot_posterior(trace);
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [sigma, mu]
Sampling 2 chains: 100%|██████████| 3000/3000 [00:01<00:00, 2633.37draws/s]
../_images/notebooks_censored_data_13_1.png
In [9]:
# Imputed censored model
with imputed_censored_model:
    trace = pm.sample()
    pm.plot_posterior(trace, varnames=['mu', 'sigma']);
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [left_censored, right_censored, sigma, mu]
Sampling 2 chains: 100%|██████████| 2000/2000 [00:04<00:00, 478.99draws/s]
../_images/notebooks_censored_data_14_1.png
In [10]:
# Unimputed censored model
with unimputed_censored_model:
    trace = pm.sample(tune=1000)  # Increase `tune` to avoid divergences
    pm.plot_posterior(trace, varnames=['mu', 'sigma']);
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [sigma, mu]
Sampling 2 chains: 100%|██████████| 3000/3000 [00:01<00:00, 1960.47draws/s]
../_images/notebooks_censored_data_15_1.png

Discussion

As we can see, both censored models appear to capture the mean and variance of the underlying distribution as well as the uncensored model! In addition, the imputed censored model is capable of generating data sets of censored values (sample from the posteriors of left_censored and right_censored to generate them), while the unimputed censored model scales much better with more censored data, and converges faster.

Authors