Parameter estimation and Bayesian Inference Fundamentals#
Show code cell content
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import YouTubeVideo
sns.set()
%matplotlib inline
from scipy import stats
You may find it helpful to read Chapter 20 and Chapter 21 of the Data 140 textbook, which present a different exposition of similar material, but are not required prerequisites for this chapter.
We’ll begin our exploration of Bayesian and frequentist modeling with a simple example that highlights some key differences between the two approaches. Recall our setup: we observe some data
Estimating product quality:
are the reviews (👍/1 or 👎/0) for a product being sold online, and is a number between 0 and 1 that represents the probability of someone leaving a good review. Intuitively, represents our estimate of the quality of the product.Estimating average height:
are the heights of individuals in a random sample, and is the average height of individuals in the population.
We’ll start with the first example.
Intuition and computation#
Suppose you’re shopping for a new microwave. You find two choices, both of which cost about the same amount:
Microwave A has three positive reviews and no negative reviews
Microwave B has 19 positive reviews and 1 negative review
Which one should you buy?
Intuitively, there isn’t necessarily a right or wrong answer. But, in a scenario like this, most shoppers would probably prefer microwave B. Seeing more data points, even if one is negative, reassures us that the results we’re seeing in the reviews are consistent.
We’ll approach this question computationally from a frequentist and then a Bayesian perspective, and see how well each one can incorporate the intuition above.
When formulating a statistical model (regardless of choosing a Bayesian or frequentist approach), it’s important to choose a model so that the parameter we’re estimating satisfies the following criteria:
The parameter should reflect the underlying quantity that we’re interested in estimating. In this case, we’ll quantify how “good” each microwave is using
, and then we can just pick the microwave that’s “better” (i.e., has a higher value of ). Our choice above, the probability of leaving a positive review, satisfies this, since better products should be more likely to get positive reviews than worse products.The parameter should fit clearly within our probability model: in other words, our probability model should relate the observed data and the parameter. We’ll see more about how this works shortly.
Our goal will be to estimate and draw conclusions about the parameter
Frequentist parameter estimation: maximum likelihood#
Before we can actually do any parameter estimation, we need to first specify the relevant probability distributions. Recall that in the frequentist setting, we’re assuming that the data (our binary reviews,
To set up a probability model in the frequentist setting, all we need is a likelihood
This notation means “conditioned on
While the underlying probability should be familiar to you, there are some important conceptual leaps that we’ve taken:
This notation represents a parametric family of distributions: different values of
lead to different distributions for , but they all fall within the Bernoulli family of distributions.Since our goal in this process is to find a “good” value of
using the observed data , we can view this as a function of : for any value of that we consider, we can plug in our data and get the probability of observing it, if that value of were true.
The simplest way to estimate
Here’s how we’ll go about it:
We’ll write the likelihood for all the data points,
(we already did it for a single point above)We’ll use the log-likelihood instead of the likelihood. This will make things a little easier computationally. Why is this okay? Because
is a monotonically increasing function, so the same value of that maximizes also maximizes .To find the best value of
, we’ll take the derivative of the log-likelihood with respect to , set it equal to 0, and solve.We’ll confirm that step 3 gave us a maximum (rather than a minimum) by computing the second derivative and confirming that it’s negative, and checking the boundaries.
1. Writing the likelihood#
Our likelihood for all the data points is:
We’ll assume that our samples are conditionally i.i.d. (independent and identically distributed) given the parameter
Next, we need to figure out a more mathematically convenient way to write the likelihood for an individual point,
Convince yourself that this is the same by plugging in
Now that we have this notationally convenient way of writing it, when we multiply all of them together, the exponents add:
The first exponent is simply the number of positive reviews, and the second exponent is simply the number of negative reviews. To simplify our expression, we’ll let
2. Taking the log#
3. Maximizing#
Finally, we’ve arrived at our maximum likelihood estimate for the Bernoulli likelihood: it’s simply the fraction of total reviews that are positive. This is a very intuitive result!
At this point, we can see why we chose to use the log likelihood: it was easier to differentiate the expression from step 2 than it would have been to differentiate the expression from step 1 (if you need to, convince yourself that this is true by computing the derivative).
4. Verifying a maximum#
where the last statement is true because
In many cases, it may also be necessary to check the boundary values, especially if the likelihood function is monotonically increasing or decreasing over the range of possible
Applying the maximum likelihood estimate#
Let’s return to our example comparing microwaves A and B. If we apply maximum likelihood for each one, we get that
In other words, the value of
A different way: Bayesian inference#
In the Bayesian setting, we’ll assume that the unknown parameter
When we set up a probability model in the previous section, we chose a likelihood that described the connection between our parameter
Note that we can treat the denominator here as a constant, because
Even though the denominator is a constant, it still matters: more on this later. For now, we’ll just try to find ways to get away without computing it.
We have two important questions left to answer before we can compute a Bayesian estimate:
How do we choose the prior
?Once we get the posterior distribution
, how do we use that to choose a single value of as our estimate?
We’ll answer these two questions in the next section.
Choosing a prior: Beta distributions#
Let’s return to our product review example. For our prior
The normalization constant is quite complicated, but we don’t need to know what it is to use the beta distribution. Since this is a well-studied distribution, we can find out any facts about it that we need just by looking them up. Try it yourself: without doing any computation, what is the expectation of a beta-distributed random variable? What is the mode?
Normalizing the Beta distribution (optional)#
How would we go about computing the normalization constant for this distribution? We know that the formula above is missing a constant of proportionality, which we’ll call
We’ll need to solve for
This integral is difficult to compute, but we can look up that the result is something called the Beta function. Details about this function are beyond the scope of this class, but it’s related to the Gamma function (an extension of the factorial function to non-integer values).
In other words, no matter what the values of
Also note that
Intuition for the Beta distribution#
Before we compute the posterior, let’s build some intuition for how the beta distribution works. The Wikipedia page has some helpful examples of what it looks like for different values of scipy.stats
to compute and visualize probability density functions (PDFs):
thetas = np.linspace(0, 1, 500)
beta_11 = stats.beta.pdf(thetas, 1, 1)
beta_72 = stats.beta.pdf(thetas, 7, 2)
beta_34 = stats.beta.pdf(thetas, 3, 4)
f, ax = plt.subplots(1, 1, figsize=(5, 3))
ax.plot(thetas, beta_11, label=r'$\alpha = 1, \beta = 1$')
ax.plot(thetas, beta_34, label=r'$\alpha = 3, \beta = 4$')
ax.plot(thetas, beta_72, label=r'$\alpha = 7, \beta = 2$')
ax.legend();

We can see that different values of the parameters lead to different shapes for the beta distribution. In particular, when
Computing the posterior#
Now that we’ve set up the prior, we can compute the posterior distribution:
We can see that this is also a beta distribution, because it looks like
By making this observation, we avoided having to compute the denominator
This is because the beta distribution is the conjugate prior for the Bernoulli distribution. This means that whenever we have a Bernoulli likelihood, if we choose a prior from the beta family, then the posterior distribution will also be in the beta family.
Demo: prior and posterior Betas#
Show code cell content
FIGURE_SIZE = (4.5, 3.5)
#def plot_beta_prior_and_posterior(r, s, m, y, show_map=False, show_lmse=False):
def plot_beta_prior_and_posterior(alpha, beta, pos_obs, neg_obs, show_map=False, show_lmse=False, ax=None):
x = np.linspace(0, 1, 100)
prior = stats.beta.pdf(x, alpha, beta)
alpha_new = alpha + pos_obs
beta_new = beta + neg_obs
posterior = stats.beta.pdf(x, alpha_new, beta_new)
# You never have to memorize these: when making this notebook,
# I found them on the wikipedia page for the Beta distribution:
# https://en.wikipedia.org/wiki/Beta_distribution
if show_lmse:
x_lmse = (alpha_new)/(alpha_new + beta_new)
else:
x_lmse = None
if show_map:
x_map = (alpha_new - 1) / (alpha_new + beta_new - 2)
else:
x_map = None
plot_prior_posterior(x, prior, posterior, (-0.02, 1.02),
prior_label=f'Prior: Beta({alpha}, {beta})',
posterior_label=f'Posterior: Beta({alpha_new}, {beta_new})',
x_map=x_map, x_lmse=x_lmse, ax=ax)
def plot_prior_posterior(x, prior, posterior, xlim,
prior_label, posterior_label,
x_map=None, x_lmse=None, ax=None):
if ax is None:
f, ax = plt.subplots(1, 1, figsize=FIGURE_SIZE, dpi=80)
ax.plot(x, prior, ls=':', lw=2.5, label = prior_label)
ax.plot(x, posterior, lw=2.5, label = posterior_label)
if x_map is not None:
map_index = np.argmin(np.abs(x - x_map))
posterior_map = posterior[map_index]
label = f'MAP estimate: {x_map:0.2f}'
ax.plot([x_map, x_map], [0, posterior_map], '--', lw=2.5, color='black', label=label)
if x_lmse is not None:
lmse_index = np.argmin(np.abs(x - x_lmse))
posterior_lmse = posterior[lmse_index]
label = f'LMSE estimate: {x_lmse:0.2f}'
ax.plot([x_lmse, x_lmse], [0, posterior_lmse], '--', lw=1.5, color='red', label=label)
#plt.legend(bbox_to_anchor=(1.32, 1.02))
ax.legend()
ymax = max(max(prior[np.isfinite(prior)]), max(posterior[np.isfinite(posterior)]))
ax.set_ylim(-0.3, ymax+0.3)
ax.set_xlim(*xlim)
ax.set_xlabel(r'$\theta$')
ax.set_title(r'Prior $p(\theta)$ and posterior given observed data $x$: $p(\theta|x)$');
Let’s visualize some examples of prior and posterior beta distributions. We’ll use the plot_beta_prior_and_posterior
function, which takes in the parameters of the prior (
We’ll start with Microwave A, which had 3 positive reviews and 0 negative reviews. For now, we’ll just explore a few different choices and look at the outcomes. Later, we’ll discuss which choice of prior could be a better fit for this particular problem.
First, what happens if we use a
# Plots a beta posterior resulting from a Beta(1, 1) prior (the first two arguments)
# and 3 positive observations and 0 negative observations (the second two arguments)
plot_beta_prior_and_posterior(1, 1, 3, 0)

Looking at the orange density, it assigns higher probability to larger values of
plot_beta_prior_and_posterior(1, 5, 3, 0)

Changing the prior distribution has dramatically changed the posterior distribution. What if we had used an even more “skeptical” prior?
plot_beta_prior_and_posterior(1, 10, 3, 0)

We can see that for microwave A, our choice of prior has a significant effect on the posterior distribution. What about for microwave B?
plot_beta_prior_and_posterior(1, 1, 19, 1)
plot_beta_prior_and_posterior(1, 5, 19, 1)
plot_beta_prior_and_posterior(1, 10, 19, 1)



Here, the prior still has an effect on the posterior distribution: the more “skeptical” the prior is, the more the posterior shifts to the left. But, the effect of the prior is much smaller than it was for Microwave A. This illustrates a general trend in Bayesian inference: with less data, the prior usually has a stronger effect on the posterior.
Point estimates#
Once we have a posterior distribution, we will often want to answer the question: how do we use the distribution to get a single estimate for
We’ll start with two answers for question 1: the Maximum a Posteriori (MAP) estimate and the Minimum/Least Mean Squared Error (MMSE/LMSE) Estimate.
The MAP estimate is the estimate that maximizes the posterior distribution. Remember that the posterior distribution represents our belief about the unknown parameter
Optional exercise: derive the mode of the beta distribution. Hint: instead of maximizing the Beta density with respect to
The LMSE estimate is simply the mean of the posterior distribution. It has this name because we can show that the mean of the posterior distribution minimizes the mean squared error:
The mean of a Beta
Choosing the prior parameters#
Returning to our microwave example, we can see that our choice of prior encodes our beliefs about products and reviews.
For example, if we think that most products are good until proven otherwise by the review data, then we could choose a more “optimistic” prior, such as Beta
The posterior for the quality of microwave A is Beta
, and the MAP estimate is 1.The posterior for the quality of microwave B is Beta
, and the MAP estimate is 0.96. Under this prior, microwave A looks better.
But, if we believe that most products are bad until proven otherwise by the review data, we may choose a Beta
The posterior for the quality of microwave A is Beta
: if we compute the MAP estimate, we get 0.43.The posterior for the quality of microwave B is Beta
, and the MAP estimate is 0.79. So, under this prior, microwave B looks better.
From this, we can take away that our choice of prior can often have a significant effect on our results. So, it’s important to choose priors that capture your assumptions and belief about the world, and to be explicit about communicating those assumptions with others when sharing your results.
Another attempt: a non-conjugate prior#
To see why the conjugate prior is so helpful, let’s see what would have happened if we chose a different prior.
Suppose we instead choose the prior
thetas = np.linspace(0, 1, 500)
f, ax = plt.subplots(1, 1, figsize=(5, 3), dpi=80)
ax.plot(thetas, np.pi/2 * np.cos(thetas * np.pi/2), label=r'$\pi/2\cos(\pi \theta/2)$')
ax.set_xlabel(r'$\theta$')
ax.set_ylabel(r'$p(\theta)$')
ax.legend();

np.trapz(np.pi/2 * np.cos(thetas * np.pi/2), thetas)
/tmp/ipykernel_2176/2433367669.py:1: DeprecationWarning: `trapz` is deprecated. Use `trapezoid` instead, or one of the numerical integration functions in `scipy.integrate`.
np.trapz(np.pi/2 * np.cos(thetas * np.pi/2), thetas)
np.float64(0.9999991742330661)
We can try to compute the posterior just as we did before:
This distribution looks much more complicated: we can’t reduce it to a known distribution at all. Furthermore, there is no way to compute the normalizing constant in closed form: we’ll need to use numerical integration.
This makes it hard to reason about the distribution or use intuition about known distributions. It also means that if we want to compute quantities of interest about it (e.g., the mean/mode of the posterior for the LMSE/MAP estimates respectively), then we have to apply numerical integration.
A continuous example: estimating heights#
Suppose instead that now our parameter of interest
MLE for the Gaussian likelihood#
We can go through a similar process as before to find the MLE for the Gaussian likelihood. Because the normalizing constant
The log-likelihood is:
Differentiating with respect to
Once again, MLE has given us a very intuitive answer: the value for our parameter (population height average) that makes the data most likely is simply the mean of the sample heights. Note that our answer didn’t depend on
Exercise: Suppose instead of the population average, we had been interested in the population standard deviation. What is the MLE for the population standard deviation?
Bayesian inference for heights#
Now suppose we instead take a Bayesian approach, and treat
Since our parameter of interest is also a continuous variable, we could try using a normal distribution for the prior as well:
Note that the parameters of this prior distribution,
Although the algebra is a bit more involved, we can show that the normal distribution is the conjugate prior for the mean of a normal distribution. In other words, the posterior distribution for
It’s not quite as easy to read a quick interpretation from this as it was from the Beta posterior update. Let’s try to visualize some priors and posteriors and see if we can build intuition:
Show code cell content
# You don't need to understand how this function is implemented.
def plot_gaussian_prior_and_posterior(μ_0, σ_0, xs, σ, range_in_σs=3.5, show_map=False, show_lmse=False):
"""
Plots prior and posterior Normal distribution
Args:
μ_0, σ_0: parameters (mean, SD) of the prior distribution
xs: list or array of observations
σ: SD of the likelihood
range_in_σs: how many SDs away from the mean to show on the plot
show_map: whether or not to show the MAP estimate as a vertical line
show_lmse: whether or not to show the LMSE/MMSE estimate as a vertical line
"""
n = len(xs)
posterior_σ = 1/np.sqrt(1/(σ_0**2) + n/(σ**2))
posterior_mean = (posterior_σ**2) * (μ_0/(σ_0**2) + np.sum(xs)/(σ**2))
# Compute range for plot
posterior_min = posterior_mean - (range_in_σs * posterior_σ)
posterior_max = posterior_mean + (range_in_σs * posterior_σ)
prior_min = μ_0 - (range_in_σs * σ)
prior_max = μ_0 + (range_in_σs * σ)
xmin = min(posterior_min, prior_min)
xmax = max(posterior_max, prior_max)
x = np.linspace(xmin, xmax, 100)
if show_lmse:
x_lmse = posterior_mean
else:
x_lmse = None
if show_map:
x_map = posterior_mean
else:
x_map = None
prior = stats.norm.pdf(x, μ_0, σ_0)
posterior = stats.norm.pdf(x, posterior_mean, posterior_σ)
plot_prior_posterior(x, prior, posterior, (xmin, xmax), 'Prior', 'Posterior',
x_map=x_map, x_lmse=x_lmse)
small_sample = [6*12, 6*12+1, 5*12+9]
larger_sample = [6*12, 6*12+1, 5*12+9] * 10
plot_gaussian_prior_and_posterior(5*12+6, 1, small_sample, 1)

plot_gaussian_prior_and_posterior(5*12+6, 1, larger_sample, 1)

plot_gaussian_prior_and_posterior(6*12, 1, small_sample, 1)

plot_gaussian_prior_and_posterior(6*12, 1, larger_sample, 1)

Just as before, we can observe that the prior has a stronger effect when we have less data. We can also see that the posterior distribution becomes narrower as the amount of data increases. This is another general fact: the more data we observe, the narrower our posterior distribution becomes.
Optional Exercise: now that you’ve seen the effect of the prior parameters in a normal prior + normal likelihood model, give an intuitive explanation of how the parameters interact in the posterior distribution above for