Causality#

Now that we understand association and correlation (and their limitations), we are ready to begin our journey into causal inference. We’ll start with a discussion of the concept of causality: what it means, and why it matters. Then, we’ll define a specific way of thinking about causality, using counterfactual reasoning, that will guide the remainder of our study of causal inference.

What is causality? (Optional)#

To understand causality, let’s explore the meaning of the word “cause”. Consider the following sentences: all of them use words like “causes” or “because”, and they are all valid sentences in modern English, but the meaning of “cause” is slightly different in each one:

  1. The soccer ball moved because I kicked it.

  2. The couple broke up because one of them moved to another country.

  3. I am who I am today because of my parents.

  4. Gravity causes objects to fall to the Earth.

  5. Down Syndrome is caused by having 3 copies of chromosome 21.

  6. Humans burning fossil fuels is causing climate change.

  7. Barbarian invasions caused the Roman Empire to fall.

  8. Smoking causes lung cancer.

Most modern understanding of the word “cause” aligns with what the Greek philosopher Aristotle called “agent” or “efficient” cause: where one agent or object brings about change in another. Taking a closer look at the examples above, if we limit ourselves to this understanding, we should eliminate sentences 3 and 4. In sentence 3, the thing being “caused” is “who I am today”, which doesn’t fall under the idea of a “change”. Similarly, in sentence 4, the thing doing the causing is “gravity”, which isn’t a specific thing or object, but rather a conceptual physical theory. Note that this is a subjective determination!

One way to think about cause is to divide the statements above into deterministic and probabilistic cause. What do these mean? If A is a deterministic cause of B, then whenever A happens, B must happen. In contrast, if A is a probabilistic cause of B, then whenever A happens, the probability of B increases. Going back to our examples above, sentences 1, 5, and 6 are all statements of deterministic causality. This doesn’t mean that the causality in the remaining sentences (2, 7, and 8) isn’t real! For example, consider statement 7: most historians believe that the Roman Empire was in a prolonged state of decline before the barbarian invasions. However, those invasions still had a real, tangible impact on the empire’s fall.

There are many other ways of thinking about causality: for example, we can divide the statements above into instance causality and class causality; single cause and multiple cause; simple action and compound action; and much more.

Why does causality matter?#

Coming soon

Why is causality hard?#

Coming soon

Potential outcomes and counterfactuals#

Some of this material is covered in Section 12.2 of the Data 8 textbook from a more conceptual lens.

Suppose you have a fever, and you decide to take an aspirin. If your fever went away an hour later, how can we know whether the aspirin caused your fever to go down? To help us answer this question, consider the following thought experiment: imagine yourself in two parallel universes. In Universe 1, you took the aspirin. In Universe 0, you didn’t take the aspirin. If your fever improved by the same amount in both universes, then that seems like strong evidence that the aspirin did not cause your fever to go down. On the other hand, if your fever went down in Universe 1 and not in Universe 0, then that seems like strong evidence that the aspirin did cause your fever to go down.

Notationally, we’ll use \(Y(1)\) to denote the outcome (in this case, your temperature) in Universe 1, and \(Y(0)\) for the outcome in Universe 0 (Note: You may also see people use superscripts, e.g., \(Y^0\) and \(Y^1\), but in this class we’ll consistently use \(Y(0)\) and \(Y(1)\)). In any real-world data, we only get to observe one of these: in the example above, if you decided to take the aspirin, then you would only ever observe \(Y(1)\). For this reason, the other universe is often called the counterfactual universe. This brings us to the fundamental problem of causal inference: that we can never observe both \(Y(1)\) and \(Y(0)\) for the same individual.

Notation#

Suppose we’re interested in whether a particular treatment causes a particular outcome. We’ll mostly limit ourselves to the case where the treatment is binary. We’re interested in whether applying the treatment to a particular unit causes a change in the outcome. Here, the word unit refers to the individuals in our study. They could be individual people, groups of people, or even inanimate objects (e.g., in a study on whether fertilizer causes increased crop yields, the units might be fields, or they they might be individual plants). We’ll use the following notation:

  • \(Y_i(0)\) and \(Y_i(1)\) are the potential outcomes for unit \(i\): they’re the outcomes for the universe where unit \(i\) wasn’t/was treated, respectively.

  • \(Z_i\) tells us whether unit \(i\) was treated or not.

    • For example, if \(Z_7 = 1\), that means unit 7 was treated, and we only got to observe \(Y_7(1)\) (and did not get to observe \(Y_7(0)\)).

  • \(Y_{i, obs}\) is shorthand for “the outcome in the universe that we actually observed”.

    • In other words, \(Y_{i, obs} = Y_i(Z_i)\).

    • In cases where \(Z_i\) is binary, we can write \(Y_{i,obs} = Z_i \cdot Y_i(1) + (1-Z_i) \cdot Y_i(0)\).

    • We’ll sometimes use the notation \(Y_i\) as shorthand for \(Y_{i,obs}\).

    • We’ll also use the notation \(Y_i(\cdot)\) as shorthand for the pair \(\big(Y_i(0), Y_i(1)\big)\).

We can visualize the potential outcomes in a table:

Unit (\(i\))

Outcome if not treated

Outcome if treated

Treated or not?

1

\(Y_1(0)\)

\(Y_1(1)\)

\(Z_1\)

2

\(Y_2(0)\)

\(Y_2(1)\)

\(Z_2\)

3

\(Y_3(0)\)

\(Y_3(1)\)

\(Z_3\)

4

\(Y_4(0)\)

\(Y_4(1)\)

\(Z_4\)

5

\(Y_5(0)\)

\(Y_5(1)\)

\(Z_5\)

In reality, we never get to observe the entire table. Instead, the data we observe look more like:

Unit

Outcome if not treated

Outcome if treated

Treated or not?

1

?

\(Y_1(1)\)

\(Z_1=1\)

2

\(Y_2(0)\)

?

\(Z_2=0\)

3

\(Y_3(0)\)

?

\(Z_3=0\)

4

?

\(Y_4(1)\)

\(Z_4=1\)

5

\(Y_5(0)\)

?

\(Z_5=0\)

It’s important to keep the right intuition in mind for \(Z_i\): this variable does not represent the treatment itself; instead, it represents the decision on whether or not to treat unit \(i\). The effect of the treatment itself is captured in the difference between \(Y_i(0)\) and \(Y_i(1)\). Let’s look at two examples that make this distinction clearer. These examples also preview some of the insights we’ll develop further in the next two sections.

  • In a randomized controlled trial, we randomly assign units to the treatment and control groups. Because the trial is randomized, the treatment/control assignment \(Z_i\) is independent of the potential outcomes \((Y_i(0), Y_i(1))\). This doesn’t mean that the treatment is independent of the outcome we observe! (Remeber, the observed outcome is \(Y_{i,obs} = Y_i(Z_i)\)) It only means that in a randomized controlled trial, the decision on whether or not to treat is independent of the potential outcomes. We’ll see in this next section how this fact can be very useful.

  • Suppose we look at an observational study measuring the effect of yacht ownership on attitudes toward taxation. Here, \(Z_i = 1 \) if respondent \(i\) owns a yacht; \(Y_i(0)\) represents respondent \(i\)’s attitude toward taxation if they did not own a yacht; and \(Y_i(0)\) represents respondent \(i\)’s attitude toward taxation if they did own a yacht. In this case, we can intuitively see that if someone is wealthy enough to own a yacht, regardless of whether they do or not (i.e., whether they’re in Universe 1 or Universe 0), they will probably prefer lower taxes.

Average treatment effect#

We’ve already seen that we can never know the difference \(Y_i(1) - Y_i(0)\) for an individual unit \(i\), since we only get to observe one of the two. So, what we’ll do instead is look at the Average Treatment Effect (ATE), which we’ll denote with the greek letter tau \(\tau\):

\[\text{ATE} = \tau = E[Y_i(1) - Y_i(0)]\]

Remember that the \(Y_i(\cdot)\)s are i.i.d. random variables, so the expectation here is over the randomness in that distribution. We’ll often drop the \(i\) in the notation for convenience.

If we knew all the data in the table above (including the counterfactual outcomes, marked with “?”), then we could take the difference in means of the two columns to find the ATE. But if we take the average of only the values we actually see, then we’re implicitly conditioning on the the treatment decision. In other words, if we computed the average of the second column (“Outcome if not treated”), it would be \(E[Y(0) | Z = 0]\). How do we relate this conditional expectation to the unconditional one in the ATE? We’ve already seen how to do this: the tower property (also known as iterated expectation):

\[\begin{split} \begin{align} \tau &= E[Y(1) - Y(0) | Z = 1]P(Z=1) + E[Y(1) - Y(0)|Z=0]P(Z=0) \\ &= \Big(\overbrace{E[Y(1) | Z = 1]}^{\text{observed}} - \overbrace{E[Y(0) | Z = 1]}^{\text{counterfactual}}\Big)P(Z=1) \, + \\ & \quad\,\Big(\underbrace{E[Y(1)|Z=0]}_{\text{counterfactual}} - \underbrace{E[Y(0)|Z=0]}_{\text{observed}}\Big)P(Z=0) \end{align} \end{split}\]

Taking a closer look at these four terms, we see that two of them (the first and the last) correspond to data that we got to observe. Unfortunately, the middle two terms are counterfactual terms: if \(Z_i=1\), then we don’t get to observe \(Y_i(0)\)!

In the next few sections, we’ll see ways of working around this problem.

Outcomes: fixed or random?#

There are two different schools of thought around how to model the potential outcomes \(Y_i(0)\) and \(Y_i(1)\):

  1. In the fixed-sample approach, we assume that the potential outcomes \(Y_i(0)\) and \(Y_i(1)\) are fixed, and only \(Z_i\) is random. We observe one outcome \(Y_{i,obs} = Y_i(Z_i)\), which is fixed and known, and the other is fixed and unknown.

  2. In the superpopulation model, we assume that the tuples \((Z_i, Y_i(0), Y_i(1))\) are random and i.i.d.: in other words, there’s some joint distribution over treatment and potential outcomes, and each unit’s data (treament decision and potential outcomes) are independent of each other unit’s data.

Historically, the fixed-sample approach was developed to analyze randomized experiments, and is still an active area of research. The superpopulation model is newer, and has been expanded in the last 40 years. It can be applied to both randomized experiments and observational studies.

In this book, we’ll focus on the superpopulation model.

Stable Unit Treatment Value Assumption (SUTVA)#

Coming soon