Causal Inference in Observational Studies: Unconfoundedness

Causal Inference in Observational Studies: Unconfoundedness#

So far, we’ve looked at ways of measuring association in any dataset. We’ve also looked at how we can draw conclusions about causality in randomized controlled trials. But in many cases, randomized controlled trials are not viable. Some reasons for that might be:

Ethical considerations: suppose we believe that a certain treatment might harm people. In order to test this belief in a randomized controlled trial, we would have to subject people to a treatment that we believe would harm them, which is profoundly unethical. In such cases, we need alternative ways of drawing causal conclusions: we have an obligation as data scientists and researchers to avoid making the mistakes of the past.
Cost: Randomized controlled trials can be expensive to run: we must find randomly chosen individuals that are representative the population; apply a treatment to half of them; and measure the outcomes, potentially over a long range of time. In some cases, this may be infeasible.
Sample size: due to cost considerations, randomized controlled trials may not have enough subjects to draw strong statistical conclusions.

There is a wealth of data available on observational studies. In this section, we’ll look at ways we can use this data to draw causal conclusions. As we saw when looking at association, the most common problem when trying to draw conclusions about causality is the presence of confounding variables. We’ll focus our attention on methods that try to account for these confounding variables.

Why it’s hard#

When dealing with data from observational studies, we can’t just take the simple difference in means anymore. To see why not, let’s return to our restaurants example. We’ll do the same computations we did before, but this time from the potential outcomes perspective.

Consider the science table version of the data (where we’ve converted 👍 to 1 and 👎 to 0, so that averages correspond to percentages liked):

import pandas as pd

food = pd.read_csv('data/restaurants_counterfactuals.csv')
food.head()

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[1], line 3
import pandas as pd
----> 3 food = pd.read_csv('data/restaurants_counterfactuals.csv')
food.head()

File /opt/hostedtoolcache/Python/3.12.9/x64/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
kwds_defaults = _refine_defaults_read(
   dialect,
   delimiter,
   (...)   1022     dtype_backend=dtype_backend,
)
kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)

File /opt/hostedtoolcache/Python/3.12.9/x64/lib/python3.12/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
_validate_names(kwds.get("names", None))
# Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
if chunksize or iterator:
   return parser

File /opt/hostedtoolcache/Python/3.12.9/x64/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
   self.options["has_index_names"] = kwds["has_index_names"]
self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)

File /opt/hostedtoolcache/Python/3.12.9/x64/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
   if "b" not in mode:
       mode += "b"
-> 1880 self.handles = get_handle(
   f,
   mode,
   encoding=self.options.get("encoding", None),
   compression=self.options.get("compression", None),
   memory_map=self.options.get("memory_map", False),
   is_text=is_text,
   errors=self.options.get("encoding_errors", "strict"),
   storage_options=self.options.get("storage_options", None),
)
assert self.handles is not None
f = self.handles.handle

File /opt/hostedtoolcache/Python/3.12.9/x64/lib/python3.12/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
elif isinstance(handle, str):
   # Check whether the filename is to be opened in binary mode.
   # Binary mode does not support 'encoding' and 'newline'.
   if ioargs.encoding and "b" not in ioargs.mode:
       # Encoding
--> 873         handle = open(
           handle,
           ioargs.mode,
           encoding=ioargs.encoding,
           errors=errors,
           newline="",
       )
   else:
       # Binary mode
       handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: 'data/restaurants_counterfactuals.csv'

The NaN values indicate missing counterfactual outcomes that we don’t get to observe: for example, dish 0 was ordered from Restaurant A, so we don’t know what rating the critics would have given a dish from Restaurant B on that day.

Displaying the data this way requires us to think about it slightly differently, and highlights the implications of the potential outcomes framework! In particular:

\(Z_i\) corresponds to whether the critic ordered dish \(i\) from restaurant A or B.
\(Y_i(A)\) and \(Y_i(B)\) represent what-ifs: if the critics had ordered dish \(i\) from a particular restaurant, would they have liked it?
The year is our confounder (or confounding variable): we’ll use \(X_i\) to denote the year that dish \(i\) was ordered in.

Our representation of the data this way makes it clear that we can’t compute the ATE:

food_ATE = (food["Dish rating for B"] - food["Dish rating for A"]).mean()
food_ATE

nan

In trying to understand a causal relationship between choice of restaurant and liking the food, naively using the simple difference in observed means (SDO) can mislead us, as we already saw:

food_sdo = food["Dish rating for B"].mean() - food["Dish rating for A"].mean()
food_sdo

0.20000000000000007

Confounding variables and the superpopulation model#

In the case of a randomized controlled trial, we saw that we could use the difference in means estimator because of the independence between the treatment assignment \(Z_i\) and the potential outcomes \(\big(Y_i(0), Y_i(1)\big)\). But, in the case of an observational study, this independence assumption no longer holds. We’ve seen a few examples of this in the last few sections, where confounding variables affect both the treatment and the outcome.

In this section, we’ll focus on a few methods that answer the question: if we also observe the confounding variable, can we determine whether the treatment has an effect on the outcome?

Our notation will be similar to what we’ve used before, and we’ll add a new variable \(X_i\) for each unit that represents the confounding information. We have tuples \((Z_i, Y_i(0), Y_i(1), X_i)\) for each unit that are i.i.d. (that is, the tuple for one unit is independent from the tuple for the next unit). This is called the superpopulation model.

Conditional average treatment effect and unconfoundedness#

We’ll define a new quantity called the conditional average treatment effect (CATE). This is like the ATE we saw before, but conditioned on a particular value of \(X\):

\[ \text{CATE} = \tau(x) = E[Y(1) - Y(0) | X = x] \]

Introducing the conditioning alone doesn’t help us: if we were to expand this using the tower property to condition on \(Z\), we’ll still have counterfactual terms. So why did we do it? It turns out that we only need to make one assumption for this to help us. This assumption is called unconfoundedness:

\[ \big(Y(0), Y(1)\big) \perp \!\!\! \perp Z \mid X \]

This is a statement of conditional independence. The potential outcomes \(Y(0)\) and \(Y(1)\) might not be independent of the treatment decision \(Z\), but this assumption states that they are conditionally independent if we know the value of a confounding variable \(X\).

In the restaurant example above, unconfoundedness would mean that within each year, the decision whether to order from A or B (treatment) is independent of the ratings that dishes from A and B would get (potential outcomes). It does not state that the restaurant is conditionally independent of the rating: this is a subtle but important distinction.

If we assume unconfoundedness, then we can safely estimate the CATE using the difference in conditional means:

\[\begin{split} \begin{align} \tau(x) &= E[Y(1) - Y(0) | X = x] &\\ &= E[Y(1) | X = x] - E[Y(0) | X = x] & {\tiny\text{(linearity of (conditional) expectation)}}\\ &= E[Y(1) | X = x, Z = 1] - E[Y(0) | X = x, Z = 0] & {\tiny\text{(unconfoundedness)}} \end{align} \end{split}\]

From CATE to ATE#

In most cases, what we’re really interested in is the ATE. How do we get from the CATE to the ATE? We can use the law of iterated expectation (tower property):

\[ E[Y(1) - Y(0)] = E_X[E_Y[Y(1)-Y(0) | X]] \]

In other words,

\[\begin{split} \begin{align} ATE &= E[\tau(X)] \\ &= \sum_x \tau(x) P(X=x) \end{align} \end{split}\]

(where we would use an integral instead of a sum if \(X\) is continuous).

What does this look like in practice? Let’s go back to the restaurant example. Instead of computing the difference in means for the entire dataset, we’ll do it separately, once for 2020 and once for 2019:

food_2020 = food[food['Year'] == 2020]
food_2019 = food[food['Year'] == 2019]

food_sdo_2020 = food_2020["Dish rating for B"].mean() - food_2020["Dish rating for A"].mean()
food_sdo_2019 = food_2019["Dish rating for B"].mean() - food_2019["Dish rating for A"].mean()
print(f"{food_sdo_2020}, {food_sdo_2019}")

These correspond to the CATEs \(\tau(2020)\) and \(\tau(2019)\) respectively. We can interpret \(\tau(2019)\) in English as: “given that a dish was ordered in 2019, what is the causal effect of choosing restaurant B (rather than A) on the probability of the critics liking their food?”, and similarly for \(\tau(2020)\).

Just like earlier, we see that these values have the opposite sign from the SDO computed on the entire data: Restaurant A was better-liked in 2020 and 2019. But, because the critics ordered from Restaurant A more in 2020 (which had lower ratings overall), Restaurant A looks worse in the aggregate data.

How do we combine these to estimate the ATE? We need the marginal probabilities \(P(2020)\) and \(P(2019)\):

p_2020 = (food['Year'] == 2020).mean()
p_2019 = 1 - p_2020
print(f"{p_2019=}, {p_2020=}")

est_food_ATE = food_sdo_2020 * p_2020 + food_sdo_2019 * p_2019
print("Estimated ATE:", est_food_ATE)

This is our first causal inference from observational data! To state it formally:

If we assume that the year is the only confounder for the effect of restaurant choice on reviews, then choosing Restaurant A causes the critics to like dishes about 8% more.

When we assume that the year is the only confounder, we’re assuming *unconfoundedness: that the pair of potential outcomes (i.e., the what-ifs for their reviews) are independent of which restaurant they actually chose, given the year.

Note that we can’t just make a blanket statement about causality: this statement depends heavily on our use of the unconfoundedness assumption. This is true in general: in order to draw causal inferences outside of randomized controlled experiments, we need to make assumptions. When reporting your causal conclusions, it’s always important to communicate the assumptions you made!

Outcome Regression#

Coming soon

Inverse propensity weighting#

In inverse propensity weighting, we’re intuitively trying to undo or compensate for the effect of the confounder by reweighting the outcome variables based on the treatment and confounders. To illustrate the effect, we’ll use a synthetic dataset from a treatment for high blood pressure for twenty people. The data contain three columns:

treated, which says whether or not each person received the treatment
is_old, whether the patient is older than 45 (1) or not (0)
bp_change, how much the person’s (systolic) blood pressure dropped (in mm Hg)

bp = pd.read_csv('bp_treatment.csv')
bp

bp.groupby(['is_old', 'treated']).mean()

From a quick look at the data, we can see a few facts:

Most of the people who received the treatment were older
Almost all the younger peoples’ BP improved (i.e., went down)
Most of the older peoples’ BP did not improve
We have very few data points. There are only twenty people overall, and even fewer once we slice the data up. For example, there are only two older people who took the treatment.

We’ll ignore the last point for now, for the sake of illustration. The first three points make it very clear that age is a confounding variable for the effect of the treatment. One way to quantify the degree of confounding is using the conditional probability of receving the treatment given the confounder: this is called the propensity score.

bp.groupby(['is_old', 'treated']).count()

For the older group, the probability of receiving the treatment is \(7/9 \approx 0.78\). For the younger group, the probability is \(4/11 \approx 0.36\). We can add these into the table:

bp['propensity_score'] = bp['is_old'] * 7/9 + (1-bp['is_old']) * 4/11
bp

This tells us that person 0 had a \(0.36\) probability of receiving the treatment, given that they were younger, and so on.

Reweighting#

If we try to naively compute a causal effect using the data above, we’ll be led astray:

treatment_mean_confounded = bp.loc[bp['treated'] == 1, 'bp_change'].mean()
control_mean_confounded = bp.loc[bp['treated'] == 0, 'bp_change'].mean()
treatment_mean_confounded - control_mean_confounded

The effect appears to be very small: only a drop of 0.68mm Hg. This is because of the confounding effect: younger people are overrepresented in the control group, and underrepresented in the treatment group (and vice versa for older people).

What if we try to adjust for the overrepresentation/underrepresentation by re-weighting each data point? Let’s start by looking at the treatment group. For this group, we should increase the weight on the younger people (because there are too few of them). There are many ways we could do this: one is to divide by the propensity score. In particular, the approach we’re going to take will give us an unbiased estimate of the ATE, even in the presence of confounding!

bp['outcome_reweighted'] = bp['bp_change'] / bp['propensity_score']
bp.loc[bp['treated'] == 0, 'outcome_reweighted'] = pd.NA  # only look at treatment group for now
bp

This gives a significantly increased weight to the improvements from the younger group (and only a very slightly increased weight to the improvements from the older group).

What about the control group? There, we want to increase the weight on the older people. Dividing by the propensity score would have the opposite effect from what we want: instead, we’ll divide by 1 minus the propensity score.

control = bp['treated'] == 0
bp.loc[control, 'outcome_reweighted'] = (
    bp.loc[control, 'bp_change'] / (1 - bp.loc[control, 'propensity_score'])
)
bp

Now, we can compare the difference in reweighted means:

treatment_total = bp.loc[bp['treated'] == 1, 'outcome_reweighted'].sum()
control_total = bp.loc[bp['treated'] == 0, 'outcome_reweighted'].sum()

(treatment_total - control_total)/20

We can see that this does a much better job of capturing the real effect!

Inverse Propensity Score Weighting (IPW)#

Using the logic above, and our ad hoc processing of the table, we’re now ready to define the IPW (inverse propensity weighted) estimator for the ATE:

\[ \hat{\tau}_{IPW} = \frac{1}{n} \underbrace{% \sum_{i: Z_i = 1} \frac{Y_i}{e(X_i)} }_{\text{reweighted treated rows}} - \frac{1}{n} \underbrace{% \sum_{i: Z_i = 0} \frac{Y_i}{1-e(X_i)} }_{\text{reweighted control rows}} \]

This is a particularly useful estimator because if the unconfoundedness assumption (also known as the conditional independence assumption or CIA) is true, then this will be an unbiased estimate of the true ATE: \(E[\hat{\tau}_{IPW}] = \tau\).

Intuitively, why does it make sense to divide by the propensity score (for the treatment group) and one minus the propensity score (for the control group)? We’re trying to find units that were less likely to end up in the group that they did, and give those higher weight. Recall that the propensity score and its complement are conditional probabilities:

\[\begin{split} \begin{align*} e(x) &= P(Z=1 | X=x) \\ 1 - e(x) &= P(Z=0 | X=x) \end{align*} \end{split}\]

For the treatment group, we want to find units that, because of confounding, were unlikely to end up in the treatment group (in the example above, this corresponds to finding younger people). Those units will have a very small propensity score, so we can divide by it to increase the weight on those points.

For the control group, we want to find units that, because of confounding, were unlikely to end up in the control group (in the example above, this corresponds to finding older people). Those units will have a very large propensity score (close to 1), so we can divide by 1 minus the propensity score.

The same process will give less weight to units that are overrepresented in their group.

For a proof that the IPW estimator above gives an unbiased estimate of the true ATE, see one of these sources. Try the proof yourself as an exercise before looking at the answer!

A First Course in Causal Inference by Peng Ding (Section 11.2.1, pg. 159) has a proof that uses the same notation as this book.
Causal Inference: The Mixtape by Scott Cunningham has a proof that uses slightly different notation.

Computing propensity scores using logistic regression#

In this example, our confounder was binary, which made it very easy to compute the propensity scores using simple arithmetic. But in many applications, our confounders could be continuous, or we could have multiple confounders. In such cases, computing propensity scores is not as trivial.

In these cases, we typically use logistic regression to predict them. Note that although this is the same logistic regression you’re familiar with, the purpose of using it is very different: here, we aren’t trying to make predictions, or achieve high prediction accuracy. We’re instead just using it to determine for each row, what is the probability (based on the confounders) of that row ending up in the treatment group?

This means that we don’t have to worry about splitting our data into testing and training sets: instead, we fit the model on our entire dataset, and use the predicted probabilities as propensity scores.