Quantifying Association#

We’ll start our journey into causal inference by looking at several ways of quantifying association.

Throughout, we’ll assume we have two random variables, \(Z\) and \(Y\), and sometimes a third variable \(X\). Although most of the methods we’ll describe can be used regardless of how we interpret them, it will be helpful when we move to causality to think of \(Z\) as either a treatment or covariate, to think of \(Y\) as an outcome, and to think of \(X\) as a potential confounding variable.

Continuous Data: Correlation and Regression#

Correlation coefficient#

There are a few different ways to measure correlation. The most common is the Pearson correlation (also called the correlation coefficient), usually denoted with \(r\) or \(\rho\) (the greek letter rho):

\[ \rho_{ZY} = \frac{\text{cov}(Z, Y)}{\sqrt{\text{var}(Z)\text{var}(Y)}} \]

This is a good measure of the linear association between \(Z\) and \(Y\). For a refresher on Pearson correlation, see the Data 8 textbook and the Data 140 textbook.

Linear regression#

If we were to fit a linear model to predict \(Y\) from \(Z\), it would look something like:

\[Y = \alpha + \beta Z + \varepsilon.\]

As usual, we assume that \(\varepsilon\) is zero-mean noise, with the additional property that cov\((Z, \varepsilon) = 0\). We’ve talked a lot about how to interpret this equation as a predictive model, but now we’ll look at it slightly differently.

We’ll think of this equation as simply a descriptive explanation of a relationship between \(Z\) and \(Y\), where the most important part of the relationship is \(\beta\). We can use all the same computational machinery we’ve already developed to fit the model and compute \(\beta\), and the interpretation is subject to the limitations we’ve already learned about (e.g., it doesn’t capture nonlinear association, it can be impacted by outliers, etc.). While it’s common to describe \(\beta\) as quantifying the “effect” of \(Z\) on \(Y\), it’s important to understand the limitations of the word “effect” here: linear regression can only tell us the predictive effect, rather than the causal effect.

So, we’ll use the coefficient \(\beta\) as a means to quantify the relationship between \(Z\) and \(y\). Starting from our assumption that cov\((Z, \varepsilon) = 0\) and using properties of covariance, we can show that \(\beta = \frac{\text{cov}(Z, Y)}{\text{var}(Z)}\). From here, we can also show that \(\beta = \rho_{ZY} \sqrt{\frac{\text{var}(Y)}{\text{var}(Z)}}\) (as you may have seen empirically in Data 8).

For example, suppose we’re interested in quantifying the relationship between the number of years of schooling an individual has received (\(Z\)) and their income (\(Y\)). If we were to compute the coefficient \(\beta\), it would provide a way of quantifying the association between these two variables.

Multiple linear regression#

Suppose we are now interested in quantifying the relationship between two variables \(x\) and \(y\), but we also want to account for or “control for” the effect of a third variable, \(w\). Assuming a linear relationship between them, we can extend our earlier relationship:

\[Y = \alpha + \beta Z + \gamma X + \varepsilon.\]

In this case, we can interpret \(\beta\) as a measure of association between \(Z\) and \(Y\) while controlling for or adjusting for the effect of a third variable \(X\).

Here are some refresher resources on linear regression:

Binary Data: (A Different Kind of) Risk#

Correlation and regression coefficients are a fine way to measure association between continuous numerical variables, but what if our data are categorical? We’ll restrict ourselves to binary data in this section for simplicity. We’ll look at three commonly used metrics for these cases: risk difference (RD), risk ratio (RR), and odds ratio (OR).

When dealing with categorical data, we’ll often start by visualizing the data in a contingency table. We’ve already seen an example of this when we looked at Simpson’s Paradox.

\(y=0\)

\(y=1\)

\(z=0\)

\(n_{00}\)

\(n_{01}\)

\(z=1\)

\(n_{10}\)

\(n_{11}\)

Note that these are different from the 2x2 tables that we used at the beginning of the course: there, the rows and columns represented reality and our decisions, respectively. Here, they represent two different observed variables in our data. Just as with those tables, there’s no standard convention about what to put in the rows vs the columns.

For example, suppose we are interested in examining the relationship between receiving a vaccine (\(z\)) for a particular virus and being infected with that virus (\(y\)). We’ll look at a study conducted on that vaccine. We’ll use \(z=1\) to indicate getting the vaccine, and \(y=1\) to indicate being infected with the virus. In this case, for example, \(n_{10}\) would represent the number of people in the study who received the vaccine and did not get infected.

Most of the metrics we’ll discuss are based on the risk, which represents the probability of \(y=1\) given a particular value of \(z\): The risk for \(z=1\) is \(P(y=1 | z=1)\) and the risk for \(z=0\) is \(P(y=1|z=0)\). Note that this definition is completely unrelated to the risk that we learned about in Chapter 1.

In our vaccination example, the term risk has an intuitive interpretation: it represents your risk of being infected given whether or not you were vaccinated.

The risk difference (RD) is defined as follows:

\[\begin{split} \begin{align} RD &= \underbrace{P(Y=1 \mid Z=1)}_{\text{risk for }Z=1} - \underbrace{P(Y=1 \mid Z=0)}_{\text{risk for }Z=0} \\ &= \quad\,\overbrace{\frac{n_{11}}{n_{10} + n_{11}}}^{} \quad\,\,-\quad\, \overbrace{\frac{n_{01}}{n_{00} + n_{01}}}^{} \end{align} \end{split}\]

Returning to the vaccine example: if the vaccine works as intended (i.e., there’s a strong association between being vaccinated and being infected), your risk of being infected should decrease after being vaccinated, and the risk difference should be a negative number far from 0. On the other hand, if there’s little to no relationship between vaccination and infection, then the two terms should be very similar, and the risk difference should be close to 0.

We can see the same fact mathematically. If \(Z\) and \(Y\) are independent, then \(P(Y=1 \mid Z=1) = P(Y=1 \mid Z=0) = P(Y=1)\), so the two terms are equal. This means that they cancel and that the risk difference is 0.

The risk ratio (RR), also sometimes called the relative risk, is defined similarly as the ratio (instead of the difference) between the two quantities above:

\[ RR = \frac{P(Y=1 \mid Z=1)}{P(Y=1 \mid Z=0)} \]

We can use similar reasoning as above to conclude that this ratio should 1 when \(Z\) and \(Y\) are independent.

The third commonly used measure is the odds ratio (OR). It’s the ratio of two odds, where each odds is itself a ratio:

\[ OR = \frac{% \overbrace{P(Y=1|Z=1)/P(Y=0|Z=1)}^{\text{odds of }y\text{ in the presence of }Z}}{% \underbrace{P(Y=1|Z=0)/P(Y=0|Z=0)}_{\text{odds of }y\text{ in the absence of }Z}} \]

While this looks more complicated, we can show that it simplifies to:

\[ OR = \frac{n_{00}}{n_{10}} \cdot \frac{n_{11}}{n_{01}} \]