Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

import numpy as np
import pandas as pd
from scipy import stats
from IPython.display import YouTubeVideo

%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

Graphical Models, Probability Distributions, and Independence

Graphical Models

A graphical model provides a visual representation of a Bayesian hierarchical model using. These models are sometimes known as Bayesian networks, or Bayes nets.

We represent each random variable with a node (circle), and a directed edge (arrow) between two random variables indicates that the the distribution for the child variable is conditioned on the parent variable. When drawing graphical models, we usually start with the variables that don’t depend on any others. These are usually, but not always, unobserved parameters of interest like θ\theta in this example. Then, we proceed by drawing a node for each variable that depends on those, and so on. Variables that are observed are shaded in.

We’ll draw graphical models for the three examples we’ve seen in previous sections: the product review model, the kidney cancer model, and the exoplanet model.

Graphical model for product reviews

In our product review model, we have the following random variables:

xiθBernoulli(θ)θBeta(a,b)\begin{align} x_i | \theta &\sim \mathrm{Bernoulli}(\theta) \\ \theta &\sim \mathrm{Beta}(a, b) \end{align}

In this case, this means we start with a node for the product quality θ\theta, and then with one node for each review xix_i, all of which depend on θ\theta. The nodes for the observed reviews xix_i are shaded in, while the node for the hidden (unobserved) product quality θ\theta is not:

This visual representation shows us the structure of the model, by making it clear that each review xix_i depends on the quality θ\theta. But just as before, this model is simple enough that we already knew that. Next, we’ll look at the graphical model for a more interesting example.

Graphical model for kidney cancer death risk

Recall the full hierarchical model for the kidney cancer death risk example:

aUniform(0,50)bUniform(0,300000)θiBeta(a,b),i{1,2,,C}yiBinomial(θi,ni),i{1,2,,C}\begin{align*} a &\sim \mathrm{Uniform}(0, 50) \\ b &\sim \mathrm{Uniform}(0, 300000) \\ \theta_i &\sim \mathrm{Beta}(a, b), & i \in \{1, 2, \ldots, C\} \\ y_i &\sim \mathrm{Binomial}(\theta_i, n_i), & i \in \{1, 2, \ldots, C\} \end{align*}
  • yiy_i represents the number of kidney cancer deaths in county ii (out of a population of nin_i).

  • θi\theta_i represents the kidney cancer death rate for county ii.

  • aa and bb represent the parameters of the shared prior for the county-level rates.

In order to draw the graphical model, we need to draw one node per random variable, and draw arrows to indicate dependency. We know that:

  • We need a node for aa and a node for bb.

  • We need one node for each θi\theta_i and one node for each yiy_i.

  • Each θi\theta_i depends on aa and bb.

  • Each yiy_i depends on θi\theta_i and nin_i.

  • Because nin_i is a fixed number, we’ll draw it as a dot.

So, our full graphical model looks like:

(Optional) Example: Graphical Model for Exoplanet Model

Text coming soon: see video

Relating graphical models to probability distributions

When we were drawing graphical models above, we drew one node per variable, and started from the “top,” working with the variables that didn’t depend on any others. We then worked our way through the model, ending with observed variables. When looking at a graphical model to derive the corresponding joint distribution of all the variables in the model, we follow a similar process. For example, in the kidney cancer death rate model, we can write the joint distribution of all the variables in our model by starting at the root (i.e., the nodes that have no parents), and then proceeding through their children, writing the joint distribution as a product.

So, we start with p(a)p(a) and p(b)p(b), then p(θia,b)p(\theta_i | a, b) (for i{1,,C}i \in \{1, \ldots, C\}), then p(yiθi)p(y_i | \theta_i):

p(a,b,θ1,,θC,y1,,yC)=p(a)p(b)i=1Cp(θia,b)p(yiθi)p(a, b, \theta_1, \ldots, \theta_C, y_1, \ldots, y_C) = p(a)p(b) \prod_{i=1}^C p(\theta_i\mid a, b) p(y_i\mid\theta_i)

Factoring the distribution this way helps us understand and mathematically demonstrate the independence and dependence relationships in our graphical models, as we’ll see shortly.

Independence and Conditional Independence

Review: independence and conditional independence

We say that two random variables ww and vv are independent if knowing the value of one tells us nothing about the distribution of the other. Notationally, we write w ⁣ ⁣ ⁣vw \perp\!\!\!\perp v. The following statements are all true for independent random variables ww and vv:

  • If ww and vv are independent (w ⁣ ⁣ ⁣vw \perp\!\!\!\perp v), then the joint distribution p(w,v)p(w, v) can be written as the product of the marginal distributions: p(w,v)=p(w)p(v)p(w, v) = p(w)p(v).

  • If ww and vv are independent (w ⁣ ⁣ ⁣vw \perp\!\!\!\perp v), then the conditional distributions are equal to the marginal distributions: p(wv)=p(w)p(w|v) = p(w) and p(vw)=p(v)p(v|w) = p(v).

Exercise: using the definition of conditional distributions, show that the two conditions above are mathematically equivalent.

We say that two random variables ww and vv are conditionally independent given a third random variable uu if, when we condition on uu, knowing the value of one of vv or ww tells us nothing about the distribution of the other. Notationally, we write w ⁣ ⁣ ⁣vuw \perp\!\!\!\perp v \mid u, and mathematically this means that p(w,vu)=p(wu)p(vu)p(w, v \mid u) = p(w\mid u) p(v \mid u).

For example, suppose x1x_1 and x2x_2 are the heights of two people randomly sampled from a very specific population with some average height μ\mu: this population could be college students, or second-graders, or Olympic swimmers, or some other group entirely.

If we know the value of μ\mu, then x1x_1 and x2x_2 are conditionally independent, because they’re random samples from the same distribution with known mean μ\mu. For example, if we are given that μ=41\mu = 4'1'', then knowing x1x_1 does not tell us anything about x2x_2.

Suppose instead that we don’t know the value of μ\mu. Then, we find out that x1=71x_1 = 7' 1''. In this case, we might guess that the ‘specific population’ is likely a very tall group, such as NBA players. This will affect our belief about the distribution of x2x_2 (i.e., we should expect the second person to be tall too). So, in this case:

  • x1x_1 and x2x_2 are conditionally independent given μ\mu: x1 ⁣ ⁣ ⁣x2μx_1 \perp\!\!\!\perp x_2 \mid \mu.

  • x1x_1 and x2x_2 are not unconditionally independent: it is not true that x1 ⁣ ⁣ ⁣x2x_1 \perp\!\!\!\perp x_2.

Independence and conditional independence in graphical models

The structure of a graphical model can tell us a lot about the independence relationships between the variables in our model. Specifically, we can determine whether two random variables are unconditionally independent or conditionally independent given a third variable, just by looking at the structure of the model. Let’s look at a few examples to illustrate this. We’ll start with the height example we just saw:

From our reasoning above, we know that x1 ⁣ ⁣ ⁣x2μx_1 \perp\!\!\!\perp x_2 \mid \mu, but that x1x_1 and x2x_2 are not unconditionally independent. This is true in general for any three variables in a graphical model in this configuration.

Exercise: mathematically prove the results stated above.

Solution: To show that x1x_1 and x2x_2 are not unconditionally independent, we must show that p(x1,x2)p(x1)p(x2)p(x_1, x_2) \neq p(x_1)p(x_2). We can compute p(x1,x2)p(x_1, x_2) by looking at the joint distribution over all three variables and then marginalizing over μ\mu:

p(x1,x2)=p(x1,x2,μ)dμ=p(μ)p(x1μ)p(x2μ)dμ\begin{align*} p(x_1, x_2) &= \int p(x_1, x_2, \mu) d\mu \\ &= \int p(\mu) p(x_1 \mid \mu) p(x_2 \mid \mu) d\mu \end{align*}

Unfortunately, there is no way to factor the integral that separates terms with x1x_1 and terms with x2x_2, so this does not factor. In other words, in general, the integral above will not equal p(x1)p(x2)p(x_1)p(x_2), so the variables are not unconditionally independent.

What about conditional independence given μ\mu? We need to show that p(x1,x2μ)=p(x1μ)p(x2μ)p(x_1, x_2\mid\mu) = p(x_1\mid\mu) p(x_2\mid\mu):

p(x1,x2μ)=p(x1,x2,μ)p(μ)=p(μ)p(x1μ)p(x2μ)p(μ)=p(x1μ)p(x2μ)\begin{align*} p(x_1, x_2 \mid \mu) &= \frac{p(x_1, x_2, \mu)}{p(\mu)} \\ &= \frac{p(\mu) p(x_1 \mid \mu) p(x_2 \mid \mu)}{p(\mu)} \\ &= p(x_1 \mid \mu) p(x_2 \mid \mu) \end{align*}

This mathematical result aligns with the intuition we built in the previous section.

Let’s look at another example:

In this example, xx and zz are not unconditionally independent. Intuitively, we can see that yy depends on xx, and zz depends on yy, so that xx and zz are dependent.

But, xx and zz are conditionally independent given yy: the lack of an arrow directly from xx to zz tells us that zz only depends on xx through yy.

Exercise: mathematically prove the results stated above.

Let’s look at a third example:

In this example, xx and zz are unconditionally independent, but given yy, they are conditionally dependent. Why? Let’s look at an example that will help us build intuition for this result. Suppose that:

  • yy is whether or not I have a stuffy nose.

  • xx is whether or not I am sick (with a cold, flu, COVID, etc.)

  • zz is whether or not I have seasonal allergies.

First, we can see that the description matches the graphical model: whether or not I have a stuffy nose depends on whether or not I’m sick, and whether or not I have allergies. But, sickness and allergies don’t affect each other. In other words, if I don’t know anything about whether I have a stuffy nose, then my sickness and allergies are independent of each other.

Now, suppose I wake up one morning with a stuffy nose (i.e., y=1y=1), and I’m trying to determine whether I’m sick or have allergies. I look at the weather forecast, and see that the pollen counts are very high. As soon as I hear this information, I’m a lot more certain that z=1z=1. But, even though the weather forecast didn’t directly tell me anything about whether or not I’m sick, my belief that I’m sick drops significantly: my symptoms have been explained away by the explanation that I probably have allergies.

In other words, conditioned on a value of yy (stuffy nose), knowing something about zz (allergies) gives me information about the distribution of xx (sickness). This is precisely the definition of conditional dependence.

Exercise: mathematically prove the results above.

These results can be formalized and generalized in the d-separation or Bayes’ ball algorithm. While this algorithm is beyond the scope of this textbook, we’ll look at a variant of it in a few chapters when we talk about causality.