Causal Inference in Randomized Experiments#
In this section, we’ll build a deeper understanding of randomized experiments. Recall from the last section the table of potential outcomes:
Unit |
Outcome if not treated |
Outcome if treated |
Treated or not? |
---|---|---|---|
1 |
? |
\(Y_1(1)\) |
\(Z_1=1\) |
2 |
\(Y_2(0)\) |
? |
\(Z_2=0\) |
3 |
\(Y_3(0)\) |
? |
\(Z_3=0\) |
4 |
? |
\(Y_4(1)\) |
\(Z_4=1\) |
5 |
\(Y_5(0)\) |
? |
\(Z_5=0\) |
Randomized experiments and potential outcomes#
In a randomized experiment, we protect ourselves from dealing with confounding variables by randomizing units into treatment and control. In other words, we choose each \(Z_i\) randomly, and we choose it independent of whatever \(Y_i(0)\) and \(Y_i(1)\) might be. Mathematically, we can write:
Remember, this doesn’t mean that the treatment is independent from the observed outcome! It only means that the treatment is independent from the pair of potential outcomes. The observed outcome \(Y_{i,obs} = Y_i(0)(1-Z_i) + Y_i(1) Z_i\) always depends on the treatment decision. In other words, knowing the treatment decision \(Z_i\) always gives us information about which of the two potential outcomes (\(Y_i(0)\) or \(Y_i(1)\)) we observed (except in the uninteresting scenario where the treatment is completely unrelated to the outcome), but
For example, consider a double-blind vaccine trial. We can consider the potential outcomes for a particular subject: they represent what happens to that subject if they get the vaccine (\(Y_i(1)\)) or if they don’t get the vaccine (\(Y_i(0)\)). This is the pair of potential outcomes, \(\big(Y_i(0), Y_i(1)\big)\). Next, consider the treatment decision \(Z_i\): this represents whether the subject got the vaccine (\(Z_i = 1\)) or a placebo (\(Z_i = 0\)). These are independent: knowing whether or not a subject got the vaccine/placebo gives us no information about the pair of potential outcomes: it only gives us information about which one of the two we observe in the real world.
Computing the average treatment effect#
You may remember learning that in a randomized controlled trial, we can determine causality by using the difference in means between the treatment and control groups. Let’s now show mathematically that this is true. Recall from the previous section our definition of the ATE \(\tau\):
If \(Z_i\) and \(Y_i\) are independent, then \(E[Y_i(\cdot)] = E[Y_i(\cdot)|Z_i]\). In other words, conditioning on \(Z_i\) shouldn’t change the expectation, as long as \(Z_i\) and \(Y_i(\cdot)\) are independent.
These two terms correspond to the mean outcomes in the treatment and control groups, respectively. If we have \(n\) observations \((Z_1, Y_{1,obs}), \ldots, (Z_n, Y_{n,obs})\), then our empirical estimate for the ATE is just:
where \(n_1\) is the number of treated units and \(n_0\) is the number of untreated units, and \(\bar{Y}_{obs,1}\) and \(\bar{Y}_{obs,0}\) are the means of the treatment and control groups respectively. This quantity \(\hat{\tau}\), which you’ve most likely seen and used before (e.g., in Data 8), has many names. Here are a few of them:
The difference in means
The simple difference in mean outcomes / simple difference in observed means (SDO)
The Neyman estimator
The prima facie causal effect, \(\tau_{PF}\) (prima facie is latin for “at first sight”).
(Optional) Fixed-sample assumption: Fisher and Neyman#
In this section, we’ll analyze randomized experiments and the Neyman estimator under the fixed-sample assumption. Recall that in this setting, we assume that \(Z_i\) (which we observe) is random, but that \(Y_i(0)\) and \(Y_i(1)\) (which are unknown) are fixed. In this case, the statement of independence above doesn’t really make sense, since \(Y_i(0)\) and \(Y_i(1)\) are not random. However, we can still compute the Neyman estimator. We’ll develop some properties of the estimator, then use those to construct a confidence interval for the estimated ATE. Finally, we’ll look at two different hypothesis tests used in randomized experiments.
Properties of the Neyman estimator#
The Neyman estimator has two useful properties: the first is that it’s unbiased, and the second is that its variance depends on the estimated variances within the two groups:
where \(\hat{\sigma}_k\) is the sample standard deviation of the potential treatment outcomes \(Y_1(k), \ldots, Y_n(k)\):
Since this depends on the counterfactual outcomes that we don’t get to observe, we’ll typically approximate it by replacing the true sample variances with the sample variances within each observed group in our data, and call this \(\hat{V}\):
It can be shown that under certain regularity conditions, as the number of samples grows larger, the distribution of the quantity \(\frac{\hat{\tau} - \tau}{\sqrt{\hat{V}}}\) converges to a normal distribution \(N(0, \sigma^2)\), where \(\sigma^2 < 1\).
Confidence intervals for the ATE#
Given the fact above, we can construct an asymptotically valid \(95\%\) confidence interval for \(\tau\) as follows: $\( \begin{align*} \left(\hat{\tau} - 1.96\sqrt{\hat{V}}, \hat{\tau} + 1.96\sqrt{\hat{V}}\right) \end{align*} \)$
Hypothesis testing for causal effects#
In the classic null hypothesis significance testing framework, there are two different null hypotheses commonly used when measuring causal effects in randomized trials:
Fisher’s strong null (also known as Fisher’s sharp null) states that for every unit, the treatment effect \(Y_i(1) - Y_i(0) = 0\).
Neyman’s weak null states that the average treatment effect is 0. In other words, even if some individual treatment effects are positive and some are negative, they average out to 0.
The first is a much stricter null hypothesis (hence the name “strong null”), since it states that all the treatment effects are 0. The second is looser, requiring only that the treatment effects average out to 0. Because of this, the strong null is often easier to reject.
To test a hypothesis against Neyman’s weak null, we construct the test statistic \(\frac{\hat{\tau}}{\sqrt{\hat{V}}}\). Under the Neyman weak null hypothesis, this should follow a standard normal distribution.
To test a hypothessis against Fisher’s strong null, we need some stronger mathematical machinery. We’ll limit ourself to cases where the treatment and outcome are both binary, and use a technique called permutation testing. In this technique, we randomly shuffle the treatment/control labels, and use that to build a null distribution for the difference in outcomes (for a refresher on this technique, see the Data 8 textbook). Instead of randomly shuffling, however, Fisher’s exact test instead looks at every single possible permutation, and computes a \(p\)-value in closed form.
(Optional) Complications with randomized experiments#
Randomized experiments present several challenges that make them infeasible in some circumstances. Here are some of the issues that come up with them. Note that this list is far from exhaustive!
Compliance#
Carrying out a randomized experiment requires that the units follow their treatment/no-treatment assignment. This is easier to ensure in some experiments than others. Consider the following examples:
An experiment to determine whether a new fertilizer increases crop yield
A double-blind vaccine trial
An experiment on whether using a new app for at least 20 minutes a day causes better sleep
An experiment on whether eating a certain amount of chocolate causes improved heart function
For the first two, we can guarantee that the treatment will be properly followed. As experimenters, we know that if a certain unit (plant, person, etc.) is assigned to the treatment group, that they will receive the treatment.
For the last two, however, this is more difficult to ensure. While we can ask subjects in a research study to use an app for a certain amount of time per day, we can’t guarantee that every subject will follow the instructions perfectly.
In particular, in some randomized experiments, we can’t guarantee that units will be compliant with the treatment assignment. We can’t solve this by simply removing the units that were non-compliant, since this could introduce bias and/or confounding.
External validity#
External validity refers to whether or not a finding from a randomized experiment will apply to a broader population of interest. This often arises due to sampling bias, contrived situations in an experiment that don’t reflect real-world conditions, or other similar effects.
Consider the following example:
Suppose we want to determine whether watching a 15-minute video about common pitfalls and misunderstandings of probability helps people make better decisions about whether news stories they read are valid. We recruit a sample of Data 102 students, randomly assign half to watch the video, and assign the other half not to watch the video. We then evaluate how well they can critically evaluate several news stories. We find that the videos have no effect: everyone in our sample does an excellent job of evaluation, regardless of whether or not they watched the video.
In this case, our randomized experiment shows no causal effect, but our sample is not representative of the population at large: we can expect that most Data 102 students, who are probability experts, already know the common pitfalls in the 15-minute video. Among a larger population, however, the same might not be true!
Ethical considerations#
Randomized experiments might not always be ethical. Suppose we want to determine whether or not the death penalty acts as a deterrent: in other words, does instituting the death penalty cause a reduction in crime?
Random assignment of the death penalty is profoundly unethical, and no randomized experiment can determine this.