Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

import numpy as np
import pandas as pd
from scipy import stats
from IPython.display import YouTubeVideo

%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

Bayesian Inference

In this section, we’ll focus on computing and using posterior distributions in more sophisticated Bayesian models. We’ll start by discussing why posterior distributions are useful in Bayesian inference, and then explain why they’re hard. Then, in the next section, we’ll learn about approximating distributions using sampling.

Why we need posterior distributions

In general, we need the posterior distribution so that we can make statements and decisions about our unknown quantity of interest, θ\theta. We saw that for simple models like the product review model or the model for heights, it was easy to compute the posterior exactly, because we chose a conjugate prior.

In the product review example:

  • Our parameter of interest θ\theta represents the probability of a positive review.

  • If we chose a Beta prior, i.e., θBeta(α,β)\theta \sim \mathrm{Beta}(\alpha, \beta), then the posterior distribution also belonged to the Beta family: θxBeta(α+xi,β+nxi)\theta | x \sim \mathrm{Beta}\left(\alpha + \sum x_i, \beta + n - \sum x_i\right).

  • This made it easy to determine things like the MAP estimate or LMSE estimate, simply by using known properties of the Beta distribution.

But what if our posterior distribution didn’t have such a convenient form? In that case, we would have to compute the posterior (and any estimates from it) ourselves:

p(θx)=p(xθ)p(θ)p(x)=p(xθ)p(θ)p(xθ)p(θ)dθ\begin{align} p(\theta|x) &= \frac{p(x|\theta)p(\theta)}{p(x)} \\ &= \frac{p(x|\theta)p(\theta)}{\int p(x|\theta)p(\theta)\,d\theta} \\ \end{align}

In general, the integral in the denominator could be impossible to compute. We call the denominator the normalizing constant: it’s a constant because it doesn’t depend on θ\theta, and it’s normalizing because we need it for the distribution or density to sum or integrate to 1.

In the next section, we’ll see a few examples that illustrate why computing the normalizing constant is hard, but first, let’s look at three examples of why we might need to know it in the first place.

Computing probabilities

Suppose we want to know the probability that θ\theta is greater than 0.7, given the observed data. In this case, we can set up an integral to compute this:

P(θ>0.7x)=0.71p(θx)dx=0.71p(xθ)p(θ)p(x)dx=1p(x)0.71p(xθ)p(θ)dx\begin{align} P(\theta > 0.7 | x) &= \int_{0.7}^1 p(\theta|x) \, dx \\ &= \int_{0.7}^1 \frac{p(x|\theta)p(\theta)}{p(x)} \, dx \\ &= \frac{1}{p(x)} \int_{0.7}^1 p(x|\theta)p(\theta) \, dx \end{align}

In the last step, we used the fact that p(x) doesn’t depend on θ\theta.

If we don’t know p(x)p(x), then our probability will be off by an unknown factor. For example, suppose the true probability is 0.9, the integral is 0.0009, and the normally-unknown denominator p(x)p(x) is 0.001. In this case, if we don’t know the normalizing constant, there’s no way we can determine the probability: we’ll always be wrong by an unknown factor, which means that our answer is useless.

Loading...

MAP Estimation

Suppose we want to compute the MAP estimate:

θ^MAP=argmaxθp(θx)=argmaxθp(xθ)p(θ)p(x)=argmaxθp(xθ)p(θ)\begin{align} \hat{\theta}_{MAP} &= \underset{\theta}{\operatorname{argmax}} p(\theta|x) \\ &= \underset{\theta}{\operatorname{argmax}} \frac{p(x|\theta)p(\theta)}{p(x)} \\ &= \underset{\theta}{\operatorname{argmax}} p(x|\theta)p(\theta) \\ \end{align}

In the last step, we used the fact that p(x) doesn’t depend on θ\theta.

If θ\theta is low-dimensional and continuous, we can often optimize this either analytically or sometimes numerically. If θ\theta is discrete and doesn’t take on too many different values, we can search over all possible values. However, if θ\theta is discrete and takes on an intractably large number of possible values, then we’d need to search over all of them, which would be impossible. Similarly, if θ\theta is high-dimensional, then the search can also be intractable.

To summarize: for low-dimensional continuous variables, or discrete random variables with a low number of possible values, we can compute the MAP estimate without needing to know the exact posterior. For higher-dimensional random variables and/or discrete random variables with many possible values, this won’t work.

LMSE Estimation

Suppose we want to compute the LMSE estimate. Recall the definition of conditional expectation (see Data 140 textbook, Chapter 9 and Chapter 15):

θ^LMSE=Eθx[θ]=θp(θx)dθ=θp(xθ)p(θ)p(x)dθ=1p(x)θp(xθ)p(θ)dθ\begin{align} \hat{\theta}_{LMSE} &= E_{\theta|x}[\theta] \\ &= \int \theta \cdot p(\theta|x) \, d\theta \\ &= \int \theta \cdot \frac{p(x|\theta)p(\theta)}{p(x)} \, d\theta \\ &= \frac{1}{p(x)} \int \theta \cdot p(x|\theta)p(\theta)\, d\theta \end{align}

In order to compute the LMSE estimate, we need to compute the denominator, p(x)p(x). If we don’t know it, then our estimate will be off by a multiplicative factor that we don’t know, making it effectively useless.

The same is true for computing the expected value of any other function of θ\theta, or any other probability involving the posterior distribution. For example, answering the question “according to the posterior distribution, what is the variance of θ\theta?” will lead to the same problem.

To summarize: any computations involving the posteriors (probabilities, expectations, etc.) require us to have the full normalized distribution: the numerator in Bayes’ rule isn’t enough.

Why computing posterior distributions is hard

In simple models like our product review model or our model for heights, it was easy to compute the exact posterior for the unknown variable that we were interested in. This happened because we chose a conjugate prior. In most other cases, computing the exact posterior is hard! Here are two examples:

One-dimensional non-conjugate prior

Let’s return to the product review example, but this time, instead of a Beta prior, we choose p(θ)=2πcos(π2θ)p(\theta) = \frac{2}{\pi}\cos\left(\frac{\pi}{2} \theta\right) for θ[0,1]\theta \in [0, 1].

p(θx)p(xθ)p(θ)[θ[ixi](1θ)[i(1xi)]]cos(π2θ)\begin{align} p(\theta|x) &\propto p(x|\theta)p(\theta) \\ &\propto \Big[\theta^{\left[\sum_i x_i\right]}(1-\theta)^{\left[\sum_i (1-x_i)\right]}\Big]\cos\left(\frac{\pi}{2}\theta\right) \end{align}

This distribution looks much more complicated: we can’t reduce it to a known distribution at all. So, in order to properly compute p(θx)p(\theta|x), we’d need to figure out the normalizing constant. This requires solving the integral:

p(x)=01[θ[ixi](1θ)[i(1xi)]]cos(π2θ)dθ\begin{align} p(x) &= \int_0^1 \Big[\theta^{\left[\sum_i x_i\right]}(1-\theta)^{\left[\sum_i (1-x_i)\right]}\Big]\cos\left(\frac{\pi}{2}\theta\right)\,d\theta \end{align}

This integral is difficult to solve in closed form. In this specific example, since this is a one-dimensional problem, we could take advantage of numerical integration. In other words, for a particular sequence of values x1,,xnx_1, \ldots, x_n, we can plug them in and compute a numerical approximation to the integral, and then find the normalizing constant that way. As we saw above, we don’t need the normalizing constant if we’re only interested in the MAP estimate, but we can’t compute the LMSE estimate without it.

Multi-dimensional example

Consider the exoplanet model from last time: xix_i is the (observed) radius of planet ii, ziz_i is whether the planet belongs to group 0 (small, possibly habitable planets) or group 1 (large, possibly inhabitable planets), and μ0\mu_0 and μ1\mu_1 are the mean radii of those two groups, respectively.

ziBernoulli(π)i=1,,nμkN(μp,σp)k=0,1xizi,μ0,μ1N(μzi,σ)i=1,,n\begin{align} z_i &\sim \mathrm{Bernoulli}(\pi) & i = 1, \ldots, n \\ \mu_k &\sim \mathcal{N}(\mu_p, \sigma_p) & k =0, 1 \\ x_i | z_i, \mu_0, \mu_1 &\sim \mathcal{N}(\mu_{z_i}, \sigma) & i = 1, \ldots, n\\ \end{align}

We can write the likelihood and prior. To simplify, we’ll write N(y;m,s)=1s2πexp{12s2(ym)2}\mathcal{N}(y; m, s) = \frac{1}{s \sqrt{2\pi}} \exp\left\{-\frac{1}{2s^2}(y - m)^2\right\}

p(zi)=πzi(1π)1zip(μk)=N(μk;μp,σp)p(xizi,μ0,μ1)=N(xi;μzi,σ)\begin{align} p(z_i) &= \pi^{z_i}(1-\pi)^{1-z_i} \\ p(\mu_k) &= \mathcal{N}(\mu_k; \mu_p, \sigma_p) \\ p(x_i | z_i, \mu_0, \mu_1) &= \mathcal{N}(x_i; \mu_{z_i}, \sigma) \end{align}

We can try computing the posterior over the hidden variables ziz_i, μ0\mu_0, and μ1\mu_1. We’ll use the notation z1:nz_{1:n} to represent z1,,znz_1, \dots, z_n (and similarly for x1:nx_{1:n}).

p(z1:n,μ0,μ1x1:n)p(μ0)p(μ1)i[p(zi)p(xizi,μ0,μ1)]\begin{align} p(z_{1:n}, \mu_0, \mu_1 | x_{1:n}) &\propto p(\mu_0)p(\mu_1)\prod_i \left[p(z_i) p(x_i | z_i, \mu_0, \mu_1)\right] \end{align}

This distribution is more complicated than anything we’ve seen up until now. It’s the joint distribution over n+2n+2 random variables (the group labels z1,,znz_1, \ldots, z_n and the two group means μ0\mu_0 and μ1\mu_1).

Computing the normalization constant p(x1:n)p(x_{1:n}) requires a complicated combination of sums and integrals:

p(x1:n)=z1=01z2=01zn=01p(μ0)p(μ1)i[p(zi)p(xizi,μ0,μ1)]dμ0dμ1\begin{align} p(x_{1:n}) &= \sum_{z_1=0}^1 \sum_{z_2=0}^1 \ldots \sum_{z_n=0}^1 \int \int p(\mu_0)p(\mu_1)\prod_i \left[p(z_i) p(x_i | z_i, \mu_0, \mu_1)\right] d\mu_0 d\mu_1 \end{align}

For our dataset of over 500 planets, the sums alone would require a completely intractable amount of computation:

2**517
429049853758163107186368799942587076079339706258956588087153966199096448962353503257659977541340909686081019461967553627320124249982290238285876768194691072

Worse still, we can’t even compute the MAP estimate for the labels ziz_i: in order to find the one that maximizes the numerator, we’d have to search over all 2517 combinations, which is also completely intractable.

Even in this fairly simple model, with two groups, we’ve found that exact inference is completely hopeless: there’s no way we can compute the exact posterior for all our unknowns. In the rest of this section, we’ll talk about ways to get around this problem using approximations to the posterior distribution.

Specifically, our approximation methods will take advantage of what we’ve learned. We know that the hardest part of computing posterior distributions is computing the normalization constant p(x)p(x). So, we’ll build methods that start with the unnormalized posterior p(xθ)p(θ)p(x|\theta)p(\theta) and use that to give us an approximation of the actual posterior p(θx)=p(xθ)p(θ)p(x)p(\theta|x) = \frac{p(x|\theta)p(\theta)}{p(x)}. While there are multiple families of methods to provide such approximations, in this textbook we’ll focus on ones that use samples to approximate the distribution.

Loading...