Causal Inference in Observational Studies: Unconfoundedness#
So far, we’ve looked at ways of measuring association in any dataset. We’ve also looked at how we can draw conclusions about causality in randomized controlled trials. But in many cases, randomized controlled trials are not viable. Some reasons for that might be:
Ethical considerations: suppose we believe that a certain treatment might harm people. In order to test this belief in a randomized controlled trial, we would have to subject people to a treatment that we believe would harm them, which is profoundly unethical. In such cases, we need alternative ways of drawing causal conclusions: we have an obligation as data scientists and researchers to avoid making the mistakes of the past.
Cost: Randomized controlled trials can be expensive to run: we must find randomly chosen individuals that are representative the population; apply a treatment to half of them; and measure the outcomes, potentially over a long range of time. In some cases, this may be infeasible.
Sample size: due to cost considerations, randomized controlled trials may not have enough subjects to draw strong statistical conclusions.
There is a wealth of data available on observational studies. In this section, we’ll look at ways we can use this data to draw causal conclusions. As we saw when looking at association, the most common problem when trying to draw conclusions about causality is the presence of confounding variables. We’ll focus our attention on methods that try to account for these confounding variables.
Why it’s hard#
When dealing with data from observational studies, we can’t just take the simple difference in means anymore. To see why not, let’s return to our restaurants example. We’ll do the same computations we did before, but this time from the potential outcomes perspective.
Consider the science table version of the data (where we’ve converted 👍 to 1 and 👎 to 0, so that averages correspond to percentages liked):
import pandas as pd
food = pd.read_csv('restaurants_counterfactuals.csv')
food.head()
Dish rating for A | Dish rating for B | Year | |
---|---|---|---|
0 | 0.0 | NaN | 2020 |
1 | NaN | 1.0 | 2019 |
2 | 0.0 | NaN | 2020 |
3 | 1.0 | NaN | 2020 |
4 | 1.0 | NaN | 2020 |
The NaN
values indicate missing counterfactual outcomes that we don’t get to observe: for example, dish 0 was ordered from Restaurant A, so we don’t know what rating the critics would have given a dish from Restaurant B on that day.
Displaying the data this way requires us to think about it slightly differently, and highlights the implications of the potential outcomes framework! In particular:
\(Z_i\) corresponds to whether the critic ordered dish \(i\) from restaurant A or B.
\(Y_i(B)\) and \(Y_i(A)\) represent what-ifs: if the critics had ordered dish \(i\) from a particular restaurant, would they have liked it?
The year is our confounder (or “confounding variable”): we’ll use \(X_i\) to denote the year that dish \(i\) was ordered in.
Our representation of the data this way makes it clear that we can’t compute the ATE:
food_ATE = (food["Dish rating for B"] - food["Dish rating for A"]).mean()
food_ATE
nan
In trying to understand a causal relationship between choice of restaurant and liking the food, naively using the simple difference in observed means (SDO) can mislead us, as we already saw:
food_sdo = food["Dish rating for B"].mean() - food["Dish rating for A"].mean()
food_sdo
np.float64(0.20000000000000007)
Confounding variables and the superpopulation model#
In the case of a randomized controlled trial, we saw that we could use the difference in means estimator because of the independence between the treatment assignment \(Z_i\) and the potential outcomes \(\big(Y_i(0), Y_i(1)\big)\). But, in the case of an observational study, this independence assumption no longer holds. We’ve seen a few examples of this in the last few sections, where confounding variables affect both the treatment and the outcome.
In this section, we’ll focus on a few methods that answer the question: if we also observe the confounding variable, can we determine whether the treatment has an effect on the outcome?
Our notation will be similar to what we’ve used before, and we’ll add a new variable \(X_i\) for each unit that represents the confounding information. We have tuples \((Z_i, Y_i(0), Y_i(1), X_i)\) for each unit that are i.i.d. (that is, the tuple for one unit is independent from the tuple for the next unit). This is called the superpopulation model.
Conditional average treatment effect and unconfoundedness#
We’ll define a new quantity called the conditional average treatment effect (CATE). This is like the ATE we saw before, but conditioned on a particular value of \(X\):
Introducing the conditioning alone doesn’t help us: if we were to expand this using the tower property to condition on \(Z\), we’ll still have counterfactual terms. So why did we do it? It turns out that we only need to make one assumption for this to help us. This assumption is called unconfoundedness:
This is a statement of conditional independence. The potential outcomes \(Y(0)\) and \(Y(1)\) might not be independent of the treatment decision \(Z\), but this assumption states that they are conditionally independent if we know the value of a confounding variable \(X\).
In the restaurant example above, unconfoundedness would mean that within each year, the decision whether to order from A or B (treatment) is independent of the ratings that dishes from A and B would get (potential outcomes). It does not state that the restaurant is conditionally independent of the rating: this is a subtle but important distinction.
If we assume unconfoundedness, then we can safely estimate the CATE using the difference in conditional means:
From CATE to ATE#
In most cases, what we’re really interested in is the ATE. How do we get from the CATE to the ATE? As we’ve done so many times before, we can use the law of iterated expectation (tower property):
In other words,
(where we would use an integral instead of a sum if \(X\) is continuous).
What does this look like in practice? Let’s go back to the restaurant example. Instead of computing the difference in means for the entire dataset, we’ll do it separately, once for 2020 and once for 2019:
food_2020 = food[food['Year'] == 2020]
food_2019 = food[food['Year'] == 2019]
food_sdo_2020 = food_2020["Dish rating for B"].mean() - food_2020["Dish rating for A"].mean()
food_sdo_2019 = food_2019["Dish rating for B"].mean() - food_2019["Dish rating for A"].mean()
print(food_sdo_2020, food_sdo_2019)
-0.05555555555555558 -0.125
These correspond to the CATEs \(\tau(2020)\) and \(\tau(2019)\) respectively. Just like earlier, we see that these values have the opposite sign from the SDO computed on the entire data: Restaurant A was better-liked in 2020 and 2019. But, because the critics ordered from Restaurant A more in 2020 (which had lower ratings overall), Restaurant A looks worse in the aggregate data.
How do we combine these to estimate the ATE? We need the marginal probabilities \(P(2020)\) and \(P(2019)\):
p_2020 = (food['Year'] == 2020).mean()
p_2019 = 1 - p_2020
est_food_ATE = food_sdo_2020 * p_2020 + food_sdo_2019 * p_2019
est_food_ATE
np.float64(-0.07870370370370372)
This is our first causal inference from observational data! To state it formally:
If we assume that the year is the only confounder for the effect of restaurant choice on reviews, then choosing Restaurant A causes the critics to like dishes about 8% more.
When we assume that the year is the only confounder, we’re assuming *unconfoundedness: that the pair of potential outcomes (i.e., the what-ifs for their reviews) are independent of which restaurant they actually chose, given the year.
Note that we can’t just make a blanket statement about causality: this statement depends heavily on our use of the unconfoundedness assumption. This is true in general: in order to draw causal inferences outside of randomized controlled experiments, we need to make assumptions. When reporting your causal conclusions, it’s always important to communicate the assumptions you made!
Outcome Regression#
Coming soon
Inverse propensity weighting#
In inverse propensity weighting, we’re intuitively trying to undo or compensate for the effect of the confounder by reweighting the outcome variables based on the treatment and confounders. To illustrate the effect, we’ll use a synthetic dataset from a treatment for high blood pressure for twenty people. The data contain three columns:
treated
, which says whether or not each person received the treatmentis_old
, whether the patient is older than 45 (1) or not (0)bp_change
, how much the person’s (systolic) blood pressure dropped (in mm Hg)
bp = pd.read_csv('bp_treatment.csv')
bp
treated | is_old | bp_change | |
---|---|---|---|
0 | 0 | 0 | -0.2 |
1 | 0 | 0 | -3.4 |
2 | 0 | 0 | -4.0 |
3 | 0 | 0 | -5.6 |
4 | 0 | 0 | -2.9 |
5 | 0 | 0 | -3.8 |
6 | 0 | 0 | -4.7 |
7 | 0 | 1 | -0.7 |
8 | 0 | 1 | -0.9 |
9 | 1 | 0 | -5.2 |
10 | 1 | 0 | -4.8 |
11 | 1 | 0 | -5.9 |
12 | 1 | 0 | -6.1 |
13 | 1 | 1 | -1.1 |
14 | 1 | 1 | -1.5 |
15 | 1 | 1 | -2.1 |
16 | 1 | 1 | -0.8 |
17 | 1 | 1 | -1.7 |
18 | 1 | 1 | -4.9 |
19 | 1 | 1 | -5.4 |
bp.groupby(['is_old', 'treated']).mean()
bp_change | ||
---|---|---|
is_old | treated | |
0 | 0 | -3.514286 |
1 | -5.500000 | |
1 | 0 | -0.800000 |
1 | -2.500000 |
From a quick look at the data, we can see a few facts:
Most of the people who received the treatment were older
Almost all the younger peoples’ BP improved
Most of the older peoples’ BP did not improve
We have very few data points: only twenty people overall, and, e.g., only two older people who took the treatment.
We’ll ignore the last point for now, for the sake of illustration. The first three points make it very clear that age is a confounding variable for the effect of the treatment. One way to quantify the degree of confounding is using the conditional probability of receving the treatment given the confounder: this is called the propensity score.
bp.groupby(['is_old', 'treated']).count()
bp_change | ||
---|---|---|
is_old | treated | |
0 | 0 | 7 |
1 | 4 | |
1 | 0 | 2 |
1 | 7 |
For the older group, the probability of receiving the treatment is \(7/9 \approx 0.78\). For the younger group, the probability is \(4/11 \approx 0.36\). We can add these into the table:
bp['propensity_score'] = bp['is_old'] * 7/9 + (1-bp['is_old']) * 4/11
bp
treated | is_old | bp_change | propensity_score | |
---|---|---|---|---|
0 | 0 | 0 | -0.2 | 0.363636 |
1 | 0 | 0 | -3.4 | 0.363636 |
2 | 0 | 0 | -4.0 | 0.363636 |
3 | 0 | 0 | -5.6 | 0.363636 |
4 | 0 | 0 | -2.9 | 0.363636 |
5 | 0 | 0 | -3.8 | 0.363636 |
6 | 0 | 0 | -4.7 | 0.363636 |
7 | 0 | 1 | -0.7 | 0.777778 |
8 | 0 | 1 | -0.9 | 0.777778 |
9 | 1 | 0 | -5.2 | 0.363636 |
10 | 1 | 0 | -4.8 | 0.363636 |
11 | 1 | 0 | -5.9 | 0.363636 |
12 | 1 | 0 | -6.1 | 0.363636 |
13 | 1 | 1 | -1.1 | 0.777778 |
14 | 1 | 1 | -1.5 | 0.777778 |
15 | 1 | 1 | -2.1 | 0.777778 |
16 | 1 | 1 | -0.8 | 0.777778 |
17 | 1 | 1 | -1.7 | 0.777778 |
18 | 1 | 1 | -4.9 | 0.777778 |
19 | 1 | 1 | -5.4 | 0.777778 |
This tells us that person 0 had a \(0.36\) probability of receiving the treatment, given that they were younger, and so on.
Reweighting#
If we try to naively compute a causal effect using the data above, we’ll be led astray:
treatment_mean_confounded = bp.loc[bp['treated'] == 1, 'bp_change'].mean()
control_mean_confounded = bp.loc[bp['treated'] == 0, 'bp_change'].mean()
treatment_mean_confounded - control_mean_confounded
np.float64(-0.6797979797979803)
The effect appears to be very small: only a drop of 0.68mm Hg. This is because of the confounding effect: younger people are overrepresented in the control group, and underrepresented in the treatment group (and vice versa for older people).
What if we try to adjust for the overrepresentation/underrepresentation by re-weighting each data point? Let’s start by looking at the treatment group. For this group, we should increase the weight on the younger people (because there are too few of them). There are many ways we could do this: one is to divide by the propensity score. In particular, the approach we’re going to take will give us an unbiased estimate of the ATE, even in the presence of confounding!
bp['outcome_reweighted'] = bp['bp_change'] / bp['propensity_score']
bp.loc[bp['treated'] == 0, 'outcome_reweighted'] = pd.NA # only look at treatment group for now
bp
treated | is_old | bp_change | propensity_score | outcome_reweighted | |
---|---|---|---|---|---|
0 | 0 | 0 | -0.2 | 0.363636 | NaN |
1 | 0 | 0 | -3.4 | 0.363636 | NaN |
2 | 0 | 0 | -4.0 | 0.363636 | NaN |
3 | 0 | 0 | -5.6 | 0.363636 | NaN |
4 | 0 | 0 | -2.9 | 0.363636 | NaN |
5 | 0 | 0 | -3.8 | 0.363636 | NaN |
6 | 0 | 0 | -4.7 | 0.363636 | NaN |
7 | 0 | 1 | -0.7 | 0.777778 | NaN |
8 | 0 | 1 | -0.9 | 0.777778 | NaN |
9 | 1 | 0 | -5.2 | 0.363636 | -14.300000 |
10 | 1 | 0 | -4.8 | 0.363636 | -13.200000 |
11 | 1 | 0 | -5.9 | 0.363636 | -16.225000 |
12 | 1 | 0 | -6.1 | 0.363636 | -16.775000 |
13 | 1 | 1 | -1.1 | 0.777778 | -1.414286 |
14 | 1 | 1 | -1.5 | 0.777778 | -1.928571 |
15 | 1 | 1 | -2.1 | 0.777778 | -2.700000 |
16 | 1 | 1 | -0.8 | 0.777778 | -1.028571 |
17 | 1 | 1 | -1.7 | 0.777778 | -2.185714 |
18 | 1 | 1 | -4.9 | 0.777778 | -6.300000 |
19 | 1 | 1 | -5.4 | 0.777778 | -6.942857 |
This gives a significantly increased weight to the improvements from the younger group (and only a very slightly increased weight to the improvements from the older group).
What about the control group? There, we want to increase the weight on the older people. Dividing by the propensity score would have the opposite effect from what we want: instead, we’ll divide by 1 minus the propensity score.
control = bp['treated'] == 0
bp.loc[control, 'outcome_reweighted'] = (
bp.loc[control, 'bp_change'] / (1 - bp.loc[control, 'propensity_score'])
)
bp
treated | is_old | bp_change | propensity_score | outcome_reweighted | |
---|---|---|---|---|---|
0 | 0 | 0 | -0.2 | 0.363636 | -0.314286 |
1 | 0 | 0 | -3.4 | 0.363636 | -5.342857 |
2 | 0 | 0 | -4.0 | 0.363636 | -6.285714 |
3 | 0 | 0 | -5.6 | 0.363636 | -8.800000 |
4 | 0 | 0 | -2.9 | 0.363636 | -4.557143 |
5 | 0 | 0 | -3.8 | 0.363636 | -5.971429 |
6 | 0 | 0 | -4.7 | 0.363636 | -7.385714 |
7 | 0 | 1 | -0.7 | 0.777778 | -3.150000 |
8 | 0 | 1 | -0.9 | 0.777778 | -4.050000 |
9 | 1 | 0 | -5.2 | 0.363636 | -14.300000 |
10 | 1 | 0 | -4.8 | 0.363636 | -13.200000 |
11 | 1 | 0 | -5.9 | 0.363636 | -16.225000 |
12 | 1 | 0 | -6.1 | 0.363636 | -16.775000 |
13 | 1 | 1 | -1.1 | 0.777778 | -1.414286 |
14 | 1 | 1 | -1.5 | 0.777778 | -1.928571 |
15 | 1 | 1 | -2.1 | 0.777778 | -2.700000 |
16 | 1 | 1 | -0.8 | 0.777778 | -1.028571 |
17 | 1 | 1 | -1.7 | 0.777778 | -2.185714 |
18 | 1 | 1 | -4.9 | 0.777778 | -6.300000 |
19 | 1 | 1 | -5.4 | 0.777778 | -6.942857 |
Now, we can compare the difference in reweighted means:
treatment_total = bp.loc[bp['treated'] == 1, 'outcome_reweighted'].sum()
control_total = bp.loc[bp['treated'] == 0, 'outcome_reweighted'].sum()
(treatment_total - control_total)/20
np.float64(-1.857142857142857)
We can see that this does a much better job of capturing the real effect!
Inverse Propensity Score Weighting (IPW)#
Using the logic above, and our ad hoc processing of the table, we’re now ready to define the IPW (inverse propensity weighted) estimator for the ATE:
This is a particularly useful estimator because if the unconfoundedness assumption (also known as the conditional independence assumption or CIA) is true, then this will be an unbiased estimate of the true ATE: \(E[\hat{\tau}_{IPW}] = \tau\).
Intuitively, why does it make sense to divide by the propensity score (for the treatment group) and one minus the propensity score (for the control group)? We’re trying to find units that were less likely to end up in the group that they did, and give those higher weight. Recall that the propensity score and its complement are conditional probabilities:
For the treatment group, we want to find units that, because of confounding, were unlikely to end up in the treatment group (in the example above, this corresponds to finding younger people). Those units will have a very small propensity score, so we can divide by it to increase the weight on those points.
For the control group, we want to find units that, because of confounding, were unlikely to end up in the control group (in the example above, this corresponds to finding older people). Those units will have a very large propensity score (close to 1), so we can divide by 1 minus the propensity score.
The same process will give less weight to units that are overrepresented in their group.
We can prove that the IPW estimator above gives an unbiased estimate of the true ATE.
Proof coming soon: see Causal Inference: The Mixtape for a proof using slightly different notation
Computing propensity scores using logistic regression#
In this example, our confounder was binary, which made it very easy to compute the propensity scores using simple arithmetic. But in many applications, our confounders could be continuous, or we could have multiple confounders. In such cases, computing propensity scores is not as trivial.
In these cases, we typically use logistic regression to predict them. Note that although this is the same logistic regression you’re familiar with, the purpose of using it is very different: here, we aren’t trying to make predictions, or achieve high prediction accuracy. We’re instead just using it to determine for each row, what is the probability (based on the confounders) of that row ending up in the treatment group?
This means that we don’t have to worry about splitting our data into testing and training sets: instead, we fit the model on our entire dataset, and use the predicted probabilities as propensity scores.