Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

We’ll start our journey into causal inference by looking at several ways of quantifying association.

Throughout, we’ll assume we have two random variables, ZZ and YY, and sometimes a third variable XX. Although most of the methods we’ll describe can be used regardless of how we interpret them, it will be helpful when we move to causality to think of ZZ as either a treatment or covariate, to think of YY as an outcome, and to think of XX as a potential confounding variable.

Continuous Data: Correlation and Regression

Correlation coefficient

There are a few different ways to measure correlation. The most common is the Pearson correlation (also called the correlation coefficient), usually denoted with rr or ρ\rho (the greek letter rho):

ρZY=cov(Z,Y)var(Z)var(Y)\rho_{ZY} = \frac{\text{cov}(Z, Y)}{\sqrt{\text{var}(Z)\text{var}(Y)}}

This is a good measure of the linear association between ZZ and YY. For a refresher on Pearson correlation, see the Data 8 textbook and the Data 140 textbook.

Linear regression

If we were to fit a linear model to predict YY from ZZ, it would look something like:

Y=α+βZ+ε.Y = \alpha + \beta Z + \varepsilon.

As usual, we assume that ε\varepsilon is zero-mean noise, with the additional property that cov(Z,ε)=0(Z, \varepsilon) = 0. We’ve talked a lot about how to interpret this equation as a predictive model, but now we’ll look at it slightly differently.

We’ll think of this equation as simply a descriptive explanation of a relationship between ZZ and YY, where the most important part of the relationship is β\beta. We can use all the same computational machinery we’ve already developed to fit the model and compute β\beta, and the interpretation is subject to the limitations we’ve already learned about (e.g., it doesn’t capture nonlinear association, it can be impacted by outliers, etc.). While it’s common to describe β\beta as quantifying the “effect” of ZZ on YY, it’s important to understand the limitations of the word “effect” here: linear regression can only tell us the predictive effect, rather than the causal effect.

So, we’ll use the coefficient β\beta as a means to quantify the relationship between ZZ and yy. Starting from our assumption that cov(Z,ε)=0(Z, \varepsilon) = 0 and using properties of covariance, we can show that β=cov(Z,Y)var(Z)\beta = \frac{\text{cov}(Z, Y)}{\text{var}(Z)}. From here, we can also show that β=ρZYvar(Y)var(Z)\beta = \rho_{ZY} \sqrt{\frac{\text{var}(Y)}{\text{var}(Z)}} (as you may have seen empirically in Data 8).

For example, suppose we’re interested in quantifying the relationship between the number of years of schooling an individual has received (ZZ) and their income (YY). If we were to compute the coefficient β\beta, it would provide a way of quantifying the association between these two variables.

Multiple linear regression

Suppose we are now interested in quantifying the relationship between two variables xx and yy, but we also want to account for or “control for” the effect of a third variable, ww. Assuming a linear relationship between them, we can extend our earlier relationship:

Y=α+βZ+γX+ε.Y = \alpha + \beta Z + \gamma X + \varepsilon.

In this case, we can interpret β\beta as a measure of association between ZZ and YY while controlling for or adjusting for the effect of a third variable XX.

Here are some refresher resources on linear regression:

Loading...

Binary Data: (A Different Kind of) Risk

Correlation and regression coefficients are a fine way to measure association between continuous numerical variables, but what if our data are categorical? We’ll restrict ourselves to binary data in this section for simplicity. We’ll look at three commonly used metrics for these cases: risk difference (RD), risk ratio (RR), and odds ratio (OR).

When dealing with categorical data, we’ll often start by visualizing the data in a contingency table. We’ve already seen an example of this when we looked at Simpson’s Paradox.

y=0y=0y=1y=1
z=0z=0n00n_{00}n01n_{01}
z=1z=1n10n_{10}n11n_{11}

Note that these are different from the 2x2 tables that we used at the beginning of the course: there, the rows and columns represented reality and our decisions, respectively. Here, they represent two different observed variables in our data. Just as with those tables, there’s no standard convention about what to put in the rows vs the columns.

For example, suppose we are interested in examining the relationship between receiving a vaccine (zz) for a particular virus and being infected with that virus (yy). We’ll look at a study conducted on that vaccine. We’ll use z=1z=1 to indicate getting the vaccine, and y=1y=1 to indicate being infected with the virus. In this case, for example, n10n_{10} would represent the number of people in the study who received the vaccine and did not get infected.

Most of the metrics we’ll discuss are based on the risk, which represents the probability of y=1y=1 given a particular value of zz: The risk for z=1z=1 is P(y=1z=1)P(y=1 | z=1) and the risk for z=0z=0 is P(y=1z=0)P(y=1|z=0). Note that this definition is completely unrelated to the risk that we learned about in Chapter 1.

In our vaccination example, the term risk has an intuitive interpretation: it represents your risk of being infected given whether or not you were vaccinated.

The risk difference (RD) is defined as follows:

RD=P(Y=1Z=1)risk for Z=1P(Y=1Z=0)risk for Z=0=n11n10+n11n01n00+n01\begin{align} RD &= \underbrace{P(Y=1 \mid Z=1)}_{\text{risk for }Z=1} - \underbrace{P(Y=1 \mid Z=0)}_{\text{risk for }Z=0} \\ &= \quad\,\overbrace{\frac{n_{11}}{n_{10} + n_{11}}}^{} \quad\,\,-\quad\, \overbrace{\frac{n_{01}}{n_{00} + n_{01}}}^{} \end{align}

Returning to the vaccine example: if the vaccine works as intended (i.e., there’s a strong association between being vaccinated and being infected), your risk of being infected should decrease after being vaccinated, and the risk difference should be a negative number far from 0. On the other hand, if there’s little to no relationship between vaccination and infection, then the two terms should be very similar, and the risk difference should be close to 0.

We can see the same fact mathematically. If ZZ and YY are independent, then P(Y=1Z=1)=P(Y=1Z=0)=P(Y=1)P(Y=1 \mid Z=1) = P(Y=1 \mid Z=0) = P(Y=1), so the two terms are equal. This means that they cancel and that the risk difference is 0.

The risk ratio (RR), also sometimes called the relative risk, is defined similarly as the ratio (instead of the difference) between the two quantities above:

RR=P(Y=1Z=1)P(Y=1Z=0)RR = \frac{P(Y=1 \mid Z=1)}{P(Y=1 \mid Z=0)}

We can use similar reasoning as above to conclude that this ratio should 1 when ZZ and YY are independent.

The third commonly used measure is the odds ratio (OR). It’s the ratio of two odds, where each odds is itself a ratio:

OR=P(Y=1Z=1)/P(Y=0Z=1)odds of y in the presence of ZP(Y=1Z=0)/P(Y=0Z=0)odds of y in the absence of ZOR = \frac{% \overbrace{P(Y=1|Z=1)/P(Y=0|Z=1)}^{\text{odds of }y\text{ in the presence of }Z}}{% \underbrace{P(Y=1|Z=0)/P(Y=0|Z=0)}_{\text{odds of }y\text{ in the absence of }Z}}

While this looks more complicated, we can show that it simplifies to:

OR=n00n10n11n01OR = \frac{n_{00}}{n_{10}} \cdot \frac{n_{11}}{n_{01}}
Loading...