Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

In this section, we’ll build a deeper understanding of randomized experiments. Recall from the last section the table of potential outcomes:

UnitOutcome if not treatedOutcome if treatedTreated or not?
1?Y1(1)Y_1(1)Z1=1Z_1=1
2Y2(0)Y_2(0)?Z2=0Z_2=0
3Y3(0)Y_3(0)?Z3=0Z_3=0
4?Y4(1)Y_4(1)Z4=1Z_4=1
5Y5(0)Y_5(0)?Z5=0Z_5=0

Randomized experiments and potential outcomes

In a randomized experiment, we protect ourselves from dealing with confounding variables by randomizing units into treatment and control. In other words, we choose each ZiZ_i randomly, and we choose it independent of whatever Yi(0)Y_i(0) and Yi(1)Y_i(1) might be. Mathematically, we can write:

(Yi(0),Yi(1)) ⁣ ⁣ ⁣Zi\begin{align} \big(Y_i(0), Y_i(1)\big) \perp \!\!\! \perp Z_i \end{align}

Remember, this doesn’t mean that the treatment is independent from the observed outcome! It only means that the treatment is independent from the pair of potential outcomes. The observed outcome Yi,obs=Yi(0)(1Zi)+Yi(1)ZiY_{i,obs} = Y_i(0)(1-Z_i) + Y_i(1) Z_i always depends on the treatment decision. In other words, knowing the treatment decision ZiZ_i always gives us information about which of the two potential outcomes (Yi(0)Y_i(0) or Yi(1)Y_i(1)) we observed (except in the uninteresting scenario where the treatment is completely unrelated to the outcome), but

For example, consider a double-blind vaccine trial. We can consider the potential outcomes for a particular subject: they represent what happens to that subject if they get the vaccine (Yi(1)Y_i(1)) or if they don’t get the vaccine (Yi(0)Y_i(0)). This is the pair of potential outcomes, (Yi(0),Yi(1))\big(Y_i(0), Y_i(1)\big). Next, consider the treatment decision ZiZ_i: this represents whether the subject got the vaccine (Zi=1Z_i = 1) or a placebo (Zi=0Z_i = 0). These are independent: knowing whether or not a subject got the vaccine/placebo gives us no information about the pair of potential outcomes: it only gives us information about which one of the two we observe in the real world.

Computing the average treatment effect

You may remember learning that in a randomized controlled trial, we can determine causality by using the difference in means between the treatment and control groups. Let’s now show mathematically that this is true. Recall from the previous section our definition of the ATE τ\tau:

τ=E[Yi(1)Yi(0)]=E[Yi(1)]E[Yi(0)]\tau = E[Y_i(1) - Y_i(0)] = E[Y_i(1)] - E[Y_i(0)]

If ZiZ_i and YiY_i are independent, then E[Yi()]=E[Yi()Zi]E[Y_i(\cdot)] = E[Y_i(\cdot)|Z_i]. In other words, conditioning on ZiZ_i shouldn’t change the expectation, as long as ZiZ_i and Yi()Y_i(\cdot) are independent.

τ=E[Yi(1)]E[Yi(0)]=E[Yi(1)Zi=1]E[Yi(0)Zi=0](if (Yi(0),Yi(1))Zi)\begin{align} \tau &= E[Y_i(1)] - E[Y_i(0)] \\ &= E[Y_i(1)|Z_i=1] - E[Y_i(0)|Z_i=0] \quad{\scriptsize (\text{if }(Y_i(0), Y_i(1)) \perp \!\! \perp Z_i)} \end{align}

These two terms correspond to the mean outcomes in the treatment and control groups, respectively. If we have nn observations (Z1,Y1,obs),,(Zn,Yn,obs)(Z_1, Y_{1,obs}), \ldots, (Z_n, Y_{n,obs}), then our empirical estimate for the ATE is just:

τ^=[1n1i:Zi=1Yi]=Yˉobs,1[1n0i:Zi=0Yi]=Yˉobs,0,\begin{align} \hat{\tau} &= \underbrace{\left[\frac{1}{n_1} \sum_{i: Z_i = 1} Y_i\right]}_{=\bar{Y}_{obs,1}} - \underbrace{\left[\frac{1}{n_0} \sum_{i: Z_i = 0} Y_i\right]}_{=\bar{Y}_{obs,0}}, \end{align}

where n1n_1 is the number of treated units and n0n_0 is the number of untreated units, and Yˉobs,1\bar{Y}_{obs,1} and Yˉobs,0\bar{Y}_{obs,0} are the means of the treatment and control groups respectively. This quantity τ^\hat{\tau}, which you’ve most likely seen and used before (e.g., in Data 8), has many names. Here are a few of them:

  • The difference in means

  • The simple difference in mean outcomes / simple difference in observed means (SDO)

  • The Neyman estimator

  • The prima facie causal effect, τPF\tau_{PF} (prima facie is latin for “at first sight”).

from IPython.display import YouTubeVideo
YouTubeVideo("0o_m_GIfe6I")

(Optional) Fixed-sample assumption: Fisher and Neyman

In this section, we’ll analyze randomized experiments and the Neyman estimator under the fixed-sample assumption. Recall that in this setting, we assume that ZiZ_i (which we observe) is random, but that Yi(0)Y_i(0) and Yi(1)Y_i(1) (which are unknown) are fixed. In this case, the statement of independence above doesn’t really make sense, since Yi(0)Y_i(0) and Yi(1)Y_i(1) are not random. However, we can still compute the Neyman estimator. We’ll develop some properties of the estimator, then use those to construct a confidence interval for the estimated ATE. Finally, we’ll look at two different hypothesis tests used in randomized experiments.

Properties of the Neyman estimator

The Neyman estimator has two useful properties: the first is that it’s unbiased, and the second is that its variance can be bounded by the estimated variances within the two groups:

E[τ^]=τvar(τ^)σ^12n1+σ^02n0,\begin{align*} E[\hat{\tau}] &= \tau \\ \text{var}(\hat{\tau}) &\leq \frac{\hat{\sigma}_1^2}{n_1} + \frac{\hat{\sigma}_0^2}{n_0}, \end{align*}

where σ^k\hat{\sigma}_k is the sample standard deviation of the potential treatment outcomes Y1(k),,Yn(k)Y_1(k), \ldots, Y_n(k):

σ^k=1n1i(Yi(k)Yˉ(k))\begin{align*} \hat{\sigma}_k &= \frac{1}{n - 1} \sum_{i} \big(Y_i(k) - \bar{Y}(k)\big) \end{align*}

Since this depends on the counterfactual outcomes that we don’t get to observe, we’ll typically approximate it by replacing the true sample variances with the sample variances within each observed group in our data, and call this V^\hat{V}:

V^=[1n11n11i:Zi=1(YiYˉobs,1)2sample std. dev. of treatment group]+[1n01n01i:Zi=0(YiYˉobs,0)2sample std. dev. of control group]\begin{align*} \hat{V} &= \Bigg[\frac{1}{n_1} \underbrace{% \frac{1}{n_1 - 1} \sum_{i: Z_i = 1} \big(Y_i - \bar{Y}_{obs,1}\big)^2}_{% \text{sample std. dev. of treatment group}}\Bigg] + \Bigg[\frac{1}{n_0} \underbrace{% \frac{1}{n_0 - 1} \sum_{i: Z_i = 0} \big(Y_i - \bar{Y}_{obs,0}\big)^2}_{% \text{sample std. dev. of control group}}\Bigg] \end{align*}

It can be shown that under certain regularity conditions, as the number of samples grows larger, the distribution of the quantity τ^τV^\frac{\hat{\tau} - \tau}{\sqrt{\hat{V}}} converges to a normal distribution N(0,σ2)N(0, \sigma^2), where σ2<1\sigma^2 < 1.

Confidence intervals for the ATE

Given the fact above, we can construct an asymptotically valid 95%95\% confidence interval for τ\tau as follows: $$

(τ^1.96V^,τ^+1.96V^)\begin{align*} \left(\hat{\tau} - 1.96\sqrt{\hat{V}}, \hat{\tau} + 1.96\sqrt{\hat{V}}\right) \end{align*}

$$

Hypothesis testing for causal effects

In the classic null hypothesis significance testing framework, there are two different null hypotheses commonly used when measuring causal effects in randomized trials:

  1. Fisher’s strong null (also known as Fisher’s sharp null) states that for every unit, the treatment effect Yi(1)Yi(0)=0Y_i(1) - Y_i(0) = 0.

  2. Neyman’s weak null states that the average treatment effect is 0. In other words, even if some individual treatment effects are positive and some are negative, they average out to 0.

The first is a much stricter null hypothesis (hence the name “strong null”), since it states that all the treatment effects are 0. The second is looser, requiring only that the treatment effects average out to 0. Because of this, the strong null is often easier to reject.

To test a hypothesis against Neyman’s weak null, we construct the test statistic τ^V^\frac{\hat{\tau}}{\sqrt{\hat{V}}}. Under the Neyman weak null hypothesis, this should follow a standard normal distribution.

To test a hypothessis against Fisher’s strong null, we need some stronger mathematical machinery. We’ll limit ourself to cases where the treatment and outcome are both binary, and use a technique called permutation testing. In this technique, we randomly shuffle the treatment/control labels, and use that to build a null distribution for the difference in outcomes (for a refresher on this technique, see the Data 8 textbook). Instead of randomly shuffling, however, Fisher’s exact test instead looks at every single possible permutation, and computes a pp-value in closed form.

(Optional) Complications with randomized experiments

Randomized experiments present several challenges that make them infeasible in some circumstances. Here are some of the issues that come up with them. Note that this list is far from exhaustive!

Compliance

Carrying out a randomized experiment requires that the units follow their treatment/no-treatment assignment. This is easier to ensure in some experiments than others. Consider the following examples:

  1. An experiment to determine whether a new fertilizer increases crop yield

  2. A double-blind vaccine trial

  3. An experiment on whether using a mindfulness app for at least 20 minutes a day causes better sleep

  4. An experiment on whether eating a certain amount of chocolate causes improved heart function

For the first two, we can guarantee that the treatment will be properly followed. As experimenters, we know that if a certain unit (plant, person, etc.) is assigned to the treatment group, that they will receive the treatment.

For the last two, however, this is more difficult to ensure. While we can ask subjects in a research study to use an app for a certain amount of time per day, we can’t guarantee that every subject will follow the instructions perfectly.

In particular, in some randomized experiments, we can’t guarantee that units will be compliant with the treatment assignment. We can’t solve this by simply removing the units that were non-compliant, since this could introduce bias and/or confounding.

External validity

External validity refers to whether or not a finding from a randomized experiment will apply to a broader population of interest. This often arises due to sampling bias, contrived situations in an experiment that don’t reflect real-world conditions, or other similar effects.

Consider the following example:

Suppose we want to determine whether watching a 15-minute video about common pitfalls and misunderstandings of probability helps people make better decisions about whether news stories they read are valid. We recruit a random sample of Data 102 students, randomly assign half to watch the video, and assign the other half not to watch the video. We then evaluate how well the students can critically evaluate several news stories. We find that the videos have no effect: everyone in our sample does an excellent job of evaluation, regardless of whether or not they watched the video.

In this case, our randomized experiment shows no causal effect, but our sample is not representative of the population at large: we can expect that most Data 102 students, who are probability experts, already know the common pitfalls in the 15-minute video. Among a larger population, however, the same might not be true!

Ethical considerations

Randomized experiments might not always be ethical. Suppose we want to determine whether or not the death penalty acts as a deterrent: in other words, does instituting the death penalty cause a reduction in crime?

Random assignment of the death penalty is profoundly unethical, and no randomized experiment can determine this.