Bayesian Inference with Sampling

Bayesian Inference with Sampling#

Approximating a Known Distribution with Samples#

We’ve seen before that we can compute an empirical distribution from a sample of data points. In this section, we’ll use samples to approximate distributions.

Let’s start by using samples to approximate a known, easy-to-compute distribution: Beta$(3, 4)$. While this is a somewhat silly example because the distribution is easy to reason about, it will help us see how samples can provide a useful approximation to a distribution.

from scipy import stats

distribution = stats.beta(3, 4)
num_samples = 1000

# Compute the exact PDF:
t = np.linspace(0, 1, 500)
pdf = distribution.pdf(t)

# Draw 1000 samples, and look at the empirical distribution of those samples:
samples = distribution.rvs(num_samples)
f, ax = plt.subplots(1, 1)

sns.histplot(
    x=samples, stat='density', bins=20, 
    label=f'Empirical dist. of {num_samples} samples'
)
ax.plot(t, pdf, label='Beta(3, 4) pdf')
ax.legend()

<matplotlib.legend.Legend at 0x7f93d0216570>

../../../_images/2222f50bc48d6800b4685af6bedab97a6541eb9394675d50d9f3344c5847d150.png

We can see that the samples are a good representation for the distribution, as long as we have enough. We can use the mean of the samples to approximate the mean of the distribution:

# The mean of a Beta(a, b) distribution is a/(a+b):
true_mean = 3 / (3 + 4)

approx_mean = np.mean(samples)
print(true_mean, approx_mean)

0.42857142857142855 0.43200499074903015

Similarly, we could use the variance of the samples to estimate the variance of the distribution, and so on.

While this method was extremely easy to implement, it’s also somewhat silly: for this distribution, anything that we can do with samples can also be done analytically. Usually, we want to approximate difficult, intractable distributions that involve impossible-to-compute normalization constants. So, that’s what we’ll do next.

Rejection Sampling#

Rejection sampling works by generating a large number of proposals, and then rejecting those that are unlikely or impossible.

As a warm-up, let’s suppose that we want to sample a pair of random variables ($x_1$, $x_2$) drawn uniformly from the unit circle. In other words, we want the uniform distribution over the blue region below:

How can we go about doing this?

(Hint: first sample uniformly over the unit square.)

x_ = np.linspace(-1, 1, 1000)
semicircle = np.sqrt(1-x_**2)
plt.fill_between(x_, -semicircle, semicircle)
plt.axis('equal');

../../../_images/b2c043b1ca53927a5368d6c6481f5fb2b6b82ac830e5efb3a088f573132f447c.png

# Number of samples
N = 400

# Samples in the unit square
samples = np.random.uniform(-1, 1, [N, 2])

# Which ones are inside the unit circle?
is_in_circle = (samples[:,0]**2 + samples[:, 1]**2) < 1

plt.figure()
plt.scatter(samples[:, 0], samples[:, 1], c=is_in_circle, cmap='viridis')
plt.axis('equal')

good_samples = samples[is_in_circle]
x1 = good_samples[:, 0]
x2 = good_samples[:, 1]
print('Variance of x1 (estimated from samples): %.3f' % np.var(x1))

Variance of x1 (estimated from samples): 0.228

../../../_images/69f9056dbbf9e045ef5bb39280362840bbd59d5c3623a5db6889f3c1d2767c5e.png

Next, let’s think about sampling from a distribution with a complicated density. Suppose we want to sample from the distribution with density $p(\theta|x) \propto \theta \cdot (1.5-\theta) \cdot \sin(\theta)$ for $\theta \in [0,1.5]$:

t = np.linspace(-1, 2.5, 500)
def target(t):
    """The unnormalized distribution we want to sample from"""
    return t * (1.5-t) * np.sin(t) * ((t > 0) & (t < 1.5))
plt.plot(t, target(t))
plt.title('Target distribution')
plt.xlabel(r'$\theta$')
plt.axis([-1,2.5,0,1])
plt.show()

../../../_images/4f6aa434c274c849eb437451ba33b7c73ec4923eef71d804426bb1e385005407.png

How can we make this look like the geometry-based example from before? The solution is an algorithm called rejection sampling. There are two ways of understanding how it works:

The first is to start by generating samples from a uniform distribution. We’ll then throw some away at random (instead of deterministically like in the previous example). Intuitively, looking at the diagram below that compares our target to the Uniform$(0, 1.5)$ distribution, we should try to throw away more samples for very small values of $\theta$, and we should try to keep more samples between $0.5$ and 1. In other words, our probability of rejecting each sample should depend on how likely it is according to the target.

The second is to change our one-dimensional sampling problem (i.e., generate samples for $\theta$) two a two-dimensional sampling problem, where we sample from the orange box in the graph below, and then throw away samples that don’t fall under the blue curve. In order to do this, we must first sample $\theta$ values, and then sample heights for each one. If the height is too large (falls above the corresponding target height for that value of $\theta$, then we will reject the sample.

x = np.linspace(-1, 2.5, 500)
def uniform_sampling_dist(t):
    """PDF of distribution we're sampling from: Uniform[0, 1.5]"""
    return stats.uniform.pdf(t, 0, 1.5)

plt.plot(t, target(t), label='target (unnormalized)')
plt.plot(t, uniform_sampling_dist(t), label='sampling')
plt.axis([-1,2.5,0,1])
plt.legend()
plt.title('Distributions')
plt.xlabel(r'$\theta$')
plt.show()

../../../_images/9aea2eecc19950d15debfa37fee2ed6e266da43f481f7d9aedc70a8c6573b371.png

To be more precise, rejection sampling works as follows: given an unnormalized target distribution and proposal distribution, we generate samples from the normalized version of the target distribution by:

Generate samples from the proposal distribution.
For each generated sample from the previous step, compute the ratio of the target divided by the proposal. This represents the probability that we’ll accept the sample: larger values of the target will lead to higher acceptance probability, and smaller values of the target will lead to lower acceptance probability. In order for this ratio to be treated as a probability, we require that the target always be less than (or equal to) the proposal: this can always be achieved through a good choice of proposal and appropriate scaling of the target distribution.
Randomly accept or reject each sample from step 1, with probability determined by its ratio from step 2. This can be done by generating a Uniform$(0, 1)$ random variable for each sample, and accepting the sample that random variable is less than the acceptance probability. Stop and convince yourself that this is true before continuing! Discard all rejected samples.
The accepted samples will be true samples from the normalized density corresponding to the unnormalized target.

The following code implements rejection sampling. Note that there are only four (substantive) lines of code in this function, corresponding to the four steps above:

def rejection_sample_uniform(num_proposals=100):
    # Generate proposals for samples: these are θ-values.
    # We'll keep some and reject the rest.
    proposals = np.random.uniform(low=0, high=1.5, size=num_proposals)
    
    # Acceptance probability is the ratio of the two curves
    # These had better all be between 0 and 1!
    accept_probs = target(proposals) / uniform_sampling_dist(proposals)
    
    print('Max accept prob: %.3f' % np.max(accept_probs))
    
    # For each sample, we make a decision whether or not to accept.
    # Convince yourself that this line makes that decision for each
    # sample with prob equal to the value in "accept_probs"!
    accept = np.random.uniform(size=num_proposals) < accept_probs
    
    num_accept = np.sum(accept)
    print('Accepted %d out of %d proposals' % (num_accept, num_proposals))
    return proposals[accept]

Let’s apply it to our target distribution from above:

f, axs = plt.subplots(1, 2, figsize=(8, 3), dpi=100)

# Run rejection sampling twice, once with many proposals and once with few proposals
samples_sparse = rejection_sample_uniform(num_proposals=500)
samples_dense = rejection_sample_uniform(num_proposals=10000)

# Plot a true histogram (comparable with density functions) using density=True
axs[0].hist(samples_sparse, bins=np.linspace(-0.25, 1.75, 40), density=True)
axs[1].hist(samples_dense, bins=np.linspace(-0.25, 1.75, 40), density=True)

# Where did this magic number 0.36 come from? What happens if you change it?
axs[0].plot(t, target(t) / 0.36)
axs[1].plot(t, target(t) / 0.36)

axs[0].set_title('Rejection sampling: 500 proposals')
axs[1].set_title('Rejection sampling: 10,000 proposals')
axs[0].axis([-0.25, 1.75, 0, 1.8])
axs[1].axis([-0.25, 1.75, 0, 1.8])

Max accept prob: 0.638
Accepted 188 out of 500 proposals
Max accept prob: 0.638
Accepted 3621 out of 10000 proposals

(np.float64(-0.25), np.float64(1.75), np.float64(0.0), np.float64(1.8))

../../../_images/27bf61bfb859bcef08e5d60a84d41714b5682e66437f4c7ca566731550717d90.png

From the results, we can see that rejection sampling, when run with enough proposals, correctly and accurately gives us samples from the target distribution. But, it can be inefficient: we accepted less than half of the samples each time! Indeed, even with 10,000 proposals (right), the samples don’t perfectly represent the target distribution.

This problem of inefficiency is caused by rejecting samples. Note that the maximum acceptance probability was only around 0.64. If we were to scale the proposal distribution by a factor of $1/0.64 \approx 1.56$, then we’d reject fewer samples overall, while still ensuring that the probabilities never went above 1.

Unfortunately, the inefficiency problem starts to get much worse when working in high dimensions! When dealing with high-dimensional distributions, there is often much more of the space with relatively low density, which means many more samples being rejected.

As a second example, what happens if we want to sample from a target distribution that takes on values in $[0, \infty)$? For instance, suppose our density is $p(\theta|x) \propto \exp(-\theta) |\sin(2\theta)|$ for $\theta \in [0, \infty)$. We can’t use a Uniform$(a, b)$ proposal distribution. Why not? For any value of $b$, we’ll never generate a sample greater than $b$, so our samples won’t match the true distribution.

Instead, we could use the normal or exponential distribution:

1/0.64

1.5625

def decaying_target_distribution(t):
    """Unnormalized target distribution as described above"""
    return np.exp(-t) * np.abs(np.sin(2*t))

def sampling_distribution_exponential(t):
    """Sampling distribution: exponential distribution"""
    # stats.expon has a loc parameter which says how far to shift
    # the distribution from its usual starting point of θ=0
    return stats.expon.pdf(t, loc=0, scale=1.0)

def rejection_sample_exponential(num_proposals=500):
    """Rejection sampling with an exponential distribution with λ=1"""
    proposals = np.random.exponential(scale=1.0, size=num_proposals)
    accept_probs = decaying_target_distribution(proposals) / sampling_distribution_exponential(proposals)
    accept = np.random.uniform(0, 1, num_proposals) < accept_probs
    num_accept = np.sum(accept)
    print('Accepted %d out of %d proposals' % (num_accept, num_proposals))
    return proposals[accept]

samples = rejection_sample_exponential(num_proposals=1000)
plt.hist(samples, bins=np.linspace(0, 6, 100), density=True)
# Find how far the axis goes and draw the unnormalized distribution over it

tmin, tmax, _, _ = plt.axis()
t_inf = np.linspace(tmin, tmax, 100)

# Where did this magic number 0.6 come from? What happens if you change it?
plt.plot(t_inf, decaying_target_distribution(t_inf) / 0.6)
plt.show()

Accepted 608 out of 1000 proposals

../../../_images/5d7f6aa052119cb3ca2860b0ed6d0cf05265427ff86d064a43578485e398cec4.png

To summarize, we saw that rejection sampling can be used to draw samples given only an unnormalized target distribution. Even though it can be used to draw samples from any unnormalized target, we most often will use it (and other sampling methods) with the unnormalized target distribution being the numerator of a posterior distribution $p(\theta|x)p(\theta)$.

Rejection sampling works by using a proposal distribution that’s easy to sample from: while the normal and uniform distributions are most common, we can use any distribution as long as (a) we can draw samples from it and (b) we can guarantee that it will always be greater than (or equal to) a scaled version of our target distribution. It works by (1) generating proposals from the proposal distribution, (2) computing the ratio of target divided by proposal as the acceptance probability, and (3) accepting each sample with its acceptance probability. It’s typically inefficient, especially in high dimensions.

Markov Chain Monte Carlo#

We noticed that the inefficiency of rejection sampling was due to the large number of proposals that we rejected and discarded. Indeed, rejection sampling generates each sample independently, without using any information from previously generated samples about which areas have lower and higher probability.

Markov chain Monte Carlo (MCMC) methods take a different approach by generating a sequence of samples. Each sample depends on the previous one, which lets us generate better samples. We’ll construct the sequence of samples in such a way that they form a Markov chain whose steady state distribution is the true normalized version of the target distribution.

Markov Chains#

You may find it helpful to see the same material presented in an expanded format in Chapter __ of the Data 140 textbook.

Coming soon

Metropolis-Hastings (Optional)#

Coming soon

Gibbs sampling#

Gibbs sampling is an algorithm designed to work with high-dimensional posterior distributions. It works by iteratively resampling each individual variable conditioned on the data and all other random variables.

Note that even though we made reference to posterior distributions and conditioning on data throughout this section, Gibbs sampling is a general method that can be used to sample from any high-dimensional target where the conditional distributions $p(\theta_i