Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Special thanks to Yan Shuo Tan, who wrote most of this section’s content.

Review and introduction

To briefly recap what we have learnt so far:

  1. We defined a superpopulation model, i.e. a distribution for (Xi,Zi,Yi(0),Yi(1))(X_i,Z_i,Y_i(0),Y_i(1)):

    • ZZ is the (binary) treatment decision,

    • Y(0)Y(0) and Y(1)Y(1) are the potential outcomes in the universes where the unit wasn’t/was treated,

    • XX is a confounding variable (in other words, it has a causal effect on ZZ and on YY) So far, we haven’t needed to make any assumptions about the distribution of these variables in general (only that it exists).

  2. We defined our quantity of interest, the average treatment effect (ATE): τ=E[Y(1)Y(0)]\tau = E[Y(1) - Y(0)], which tells us the average effect of the treatment. We saw that this is impossible to estimate unless we make further assumptions.

  3. We saw that in a randomized experiment, we have the following:

    • The treatment decisions are random, and therefore are independent of the potential outcomes.

    • In other words, (Y(0),Y(1)) ⁣ ⁣Z\big(Y(0),Y(1)\big)\perp\!\!\perp Z.

In this section, we’ll investigate how we can estimate the ATE in situations where we have unknown confounding variables. We’ll rely on natural experiments to help us. Note that you’ve probably seen natural experiments before in Data 8, when learning about John Snow’s study of cholera.

Linear structural model (LSM)

In some fields (such as economics), it is typical to work with structural models, which place some restrictions on the joint distribution of all the variables, and in doing so, make it easier to estimate the parameters of the model.

We will work with the linear structural model relating our outcome YY to our treatment ZZ and confounding variable(s) XX:

Y=α+τZ+βTX+ϵ,Y = \alpha + \tau Z + \beta^TX + \epsilon,

where ϵ\epsilon has mean zero, and is independent of ZZ and XX (in economics, we say that ϵ\epsilon is exogenous). We sometimes further assume that ϵN(0,σ2)\epsilon \sim \mathcal{N}(0,\sigma^2), but this is not necessary for any of the analysis we’re going to do.

Note: in general, we often add the further structural equation Z=f(X,δ)Z = f(X, \delta) where δ\delta is an exogenous noise variable, and ff encodes the structural relationship between XX and ZZ. We won’t go into this level of detail, but when reading this equation, you should assume that Cov(Z,X)\textrm{Cov}(Z,X) is not necessarily 0.

This is not quite the same as the linear model that we have seen when we learned about GLMs, and that you’ve seen in previous classes! While it looks very similar, the linear model we worked with before is a statement about associations and predictions, while this linear structural model is a statement about intervention and action.

Specifically, this model assumes that if for unit ii, if we could set Zi=1Z_i = 1, we will observe Yi(1)=τ+βTXi+ϵiY_i(1) = \tau + \beta^TX_i + \epsilon_i, and if we could set Zi=0Z_i = 0, we will observe Yi(0)=βTXi+ϵiY_i(0) = \beta^TX_i + \epsilon_i. (If ZZ is not binary, then there will be a potential outcome for each possible value of ZZ.) This is a subtle but important point, that also situates the linear structural model as a special case of the potential outcomes framework!

From this, we see that the average treatment effect in this model is τ\tau (can you show this is true?), and furthermore, that the individual treatment effect for every unit is

Yi(1)Yi(0)=τ.Y_i(1) - Y_i(0) = \tau.

In other words, the linear structural model is making an implicit assumption that the treatment effect is constant across all units.

Causal graphs and LSMs

Apart from the causal effect of ZZ on YY, the linear structural model also does something new from before. It asserts the causal relationships between the other variables, i.e. it tells us how ZZ and YY change if we manipulate XX.

The above linear structural model can be represented graphically as follows:

As a reminder, the arrows from XX into ZZ and YY assert that XX causes both ZZ and YY (i.e. intervening on XX changes the values of ZZ and YY), and the arrow from ZZ into YY asserts that ZZ causes YY.

Confounding and omitted variable bias

In many scenarios, confounding is complicated and involves many different variables, and it may be impossible to collect, observe, or describe all of them. In that case we must assume that XX is unobserved. If this happens, then just as before, we’re in trouble because of confounding. Here are some examples. In each one, we’ve only listed one possible confounder XX, but there are likely many more: can you think of at least one for each example?

Treatment ZZOutcome YYPossible confounder(s) XX
Health insuranceHealth outcomesSocioeconomic background
Military serviceSalarySocioeconomic background
Family sizeWhether the mother is in the labor forceSocioeconomic background
Years of schoolingSalarySocioeconomic background
SmokingLung cancerSocioeconomic background

Note that in most of these examples, socioeconomic background is a confounder. This is particularly common in economics and econometrics, where most of the methods in this section originated.

Let’s be a bit more precise about quantifying the effect of confounding. Specifically, we’ll assume the linear structural model above, and then see what happens when we naively try to fit a linear regression to YY using ZZ, without accounting for XX.

Let τ^OLS\hat{\tau}_{OLS} be the solution of the least squares problem minτ,αE[(α+τZY)2]\min_{\tau,\alpha} \mathbb{E}[(\alpha + \tau Z - Y)^2]. We then get

τ^OLS=Cov(Y,Z)Var(Z)=Cov(α+τZ+βTX+ϵ,Z)Var(Z)=Cov(τZ,Z)Var(Z)+Cov(βTX,Z)Var(Z)=τtrue ATE+βTCov(X,Z)Var(Z)bias involving X.\begin{align} \hat{\tau}_{OLS} & = \frac{\text{Cov}(Y,Z)}{\text{Var}(Z)} \\ & = \frac{\text{Cov}(\alpha + \tau Z + \beta^TX + \epsilon,Z)}{\text{Var}(Z)} \\ & = \frac{\text{Cov}(\tau Z,Z)}{\text{Var}(Z)} + \frac{\text{Cov}(\beta^TX,Z)}{\text{Var}(Z)} \\ & = \underbrace{\tau}_\text{true ATE} + \underbrace{\beta^T\frac{\text{Cov}(X,Z)}{\text{Var}(Z)}}_{\text{bias involving }X}. \end{align}

The second term is a bias in the τOLS\tau_{OLS} estimator: in other words, it’s the difference between the true value and the estimator, and it depends on the omitted (i.e., unobserved) variable XX. So, we’ll call this term βTCov(X,Z)Var(Z)\beta^T\frac{\text{Cov}(X,Z)}{\text{Var}(Z)} the omitted variable bias.

Remark: Cov(Y,Z)Var(Z)\frac{\text{Cov}(Y,Z)}{\text{Var}(Z)} is the infinite population version of the typical formula τ^OLS=(ZTZ)1ZTY\hat{\tau}_{OLS} = (Z^TZ)^{-1}Z^TY, where we now use ZZ and YY to denote matrices/vectors.

Why can’t we just adjust for confounding? Having such confounders is problematic because in order to avoid omitted variable bias, we need to have observed them, and added them to our regression (collection of such data may not always be feasible for a number of reasons.) Furthermore, there could always be other confounders that we are unaware of, which leaves our causal conclusions under an inescapable cloud of doubt.

Instrumental Variables

Is there a middle way between a randomized experiment and assuming unconfoundedness, which is sometimes unrealistic?

One way forward is when nature provides us with a “partial” natural experiment, i.e. we have a truly randomized “instrument” that injects an element of partial randomization into the treatment variable of interest. This is the idea of instrumental variables. We will first define the concept mathematically, and then illustrate what it means for a few examples.

Definition: Assume the linear structural model defined above. We further assume a variable WW such that Z=α+γW+(β)TX+δZ = \alpha' + \gamma W + (\beta')^TX + \delta, with γ0\gamma \neq 0 (relevance), WW independent of XX, δ\delta and ϵ\epsilon (exogeneity). Such a WW is called an instrumental variable.

Remark: This replaces the earlier equation from before that Z=f(X,δ)Z = f(X,\delta).

Let us now see how to use WW to identify the ATE τ\tau.

Cov(Y,W)=Cov(α+τZ+βTX+ϵ,W)=τCov(Z,W)=τCov(α+γW+(β)TX+δ,W)=τγVar(W).\begin{align} \textrm{Cov}(Y,W) & = \textrm{Cov}(\alpha + \tau Z + \beta^TX + \epsilon,W) \\ & = \tau \textrm{Cov}(Z,W) \\ & = \tau \textrm{Cov}(\alpha' + \gamma W + (\beta')^TX + \delta, W) \\ & = \tau\gamma \textrm{Var}(W). \end{align}

Where the second equality follows from the exogeneity of WW. Meanwhile, a similar computation with ZZ and WW gives us

Cov(Z,W)=γVar(W).\textrm{Cov}(Z,W) = \gamma\textrm{Var}(W).

Putting everything together gives

τ=Cov(Y,W)Var(W)Cov(Z,W)Var(W).\tau = \frac{\frac{\textrm{Cov}(Y,W)}{\textrm{Var}(W)}}{\frac{\textrm{Cov}(Z,W)}{\textrm{Var}(W)}}.

In other words, τ\tau is the ratio between the (infinite population) regression coefficient of WW on YY, and that of WW on ZZ.

This motivates the instrumental variable estimator of the ATE in finite samples:

τ^IV=(WTW)1WTYOLS coeff. of W for Y(WTW)1WTZOLS coeff. of W for Z,\hat{\tau}_{IV} = \frac{\overbrace{(W^TW)^{-1}W^TY}^{\text{OLS coeff. of W for Y}}}{\underbrace{(W^TW)^{-1}W^TZ}_{\text{OLS coeff. of W for Z}}},

where again, abusing notation, WW, ZZ and YY refer to the vectors of observations. If α=0\alpha' = 0, then this is a plug-in estimator of τ\tau, and is consistent.

Further interpretation for binary WW: When WW is binary, we can show that

τ=E[YW=1]E[YW=0]E[ZW=1]E[ZW=0].\tau = \frac{\mathbb{E}[Y|W=1] - \mathbb{E}[Y|W=0]}{\mathbb{E}[Z|W=1] - \mathbb{E}[Z|W=0]}.

Hence, we can think of IV as measuring the ratio of the prima facie treatment effect of WW on YY and that of WW on ZZ.

Causal graph for instrumental variables

The relationships between W,Z,XW, Z, X, and YY can be represented as the following causal graph:

How to read this graph:

  • The arrow from WW into ZZ shows that WW has a causal effect on ZZ

  • The absence of any arrow into WW means that WW is exogeneous, i.e. no variable in the diagram causes WW, and in particular WW is independent of XX.

  • The absence of an arrow from WW into YY means that the only effect of WW on YY is through ZZ.

  • We shaded in WW, ZZ and YY because these nodes are observed, but XX is unshaded because it is latent (unobserved).

Note that we do not need to know or even be aware of what XX is in order for instrumental variables to work! It doesn’t matter how many confounders there are, or whether we’re even able to list all of them: as long as we can guarantee that they do not have any causal relationship to the instrument (exclusion restriction), instrumental variables will work.

Examples of instrumental variables

Let’s examine what we might use as instrumental variables for the five examples from the table in the previous section. The first four are taken from the econometrics literature:

Example 1: ZZ is health insurance, YY is health outcomes, XX is socioeconomic background. Baicker et al. (2013) used the 2008 expansion of Medicaid in Oregon via lottery. The instrument WW here was the lottery assignment. We previously talked about why this was an imperfect experiment because of compliance reasons (only a fraction of individuals who won the lottery enrolled into Medicaid), so IV provides a way of overcoming this limitation.

Example 2: ZZ is military service, YY is salary, XX is socioeconomic background. Angrist (1990) used the Vietnam era draft lottery as the instrument WW, and found that among white veterans, there was a 15% drop in earnings compared to non-veterans.

Example 3: ZZ is family size, YY is mother’s employment, XX is socioeconomic background. Angrist and Evans (1998) used sibling sex decomposition (in other words, the assigned sexes at birth of a sibling) as the IV. This is plausible because of the pseudo randomization of the sibling sex composition. This is based on the fact that parents in the US with two children of the same sex are more likely to have a third child than those parents with two children of different sex.

Example 4: ZZ is years of schooling, YY is salary, XX is socioeconomic background. Card (1993) used geographical variation in college proximity as the instrumental variable.

Example 5: ZZ is smoking, YY is lung cancer, XX is socioeconomic background. Unfortunately, this example does not lend itself well to an instrumental variable: despite decades of searching, nobody has yet found one that is convincing. This leads to an important lesson: not every problem is amenable to the use of instrumental variables, or even natural experiments!

As we see in these examples, sometimes you need to be quite ingenious to come up with an appropriate instrumental variable. Joshua Angrist, David Card, and Guido Imbens, who is named in several of these examples, are phenomenally good at this: in fact, they won the Nobel Prize in economics for their collected body of work!

Extensions

Multiple treatments / instruments, and two-stage least squares.

So far, we have considered scalar treatment and instrumental variables ZZ and WW. It is also possible to consider vector-valued instruments and treatments. To generalize IV to this setting, we need to recast the IV estimator in the previous sections as follows.

First define the conditional expectation Z~=E[ZW]\tilde{Z} = \mathbb{E}[Z|W], and observe that Z~=α+Wγ\tilde{Z} = \alpha' + W\gamma.

If we regress YY on Z~\tilde{Z}, the regression coefficient we obtain is $$

Cov(Z~,Y)Var(Z~)=Cov(Z~,α+τZ+βTX+ϵ)Var(Z~)=Cov(Z~,τZ)Var(Z~)=τCov(Z~,Z)Var(Z~)=τ.\begin{align} \frac{\textrm{Cov}(\tilde{Z},Y)}{\textrm{Var}(\tilde{Z})} & = \frac{\textrm{Cov}(\tilde{Z}, \alpha + \tau Z + \beta^TX + \epsilon)}{\textrm{Var}(\tilde{Z})} \\ & = \frac{\textrm{Cov}(\tilde{Z}, \tau Z)}{\textrm{Var}(\tilde{Z})} \\ & = \tau\frac{\textrm{Cov}(\tilde{Z}, Z)}{\textrm{Var}(\tilde{Z})} \\ & = \tau. \end{align}

$$

Here, the 2nd equality holds because WW is independent of all XX and ϵ\epsilon, while the 4th equality holds because of a property of conditional expectations (one can also check this by hand by expanding out Z=α+γW+(β)TX+δZ = \alpha' + \gamma W + (\beta')^TX + \delta.)

In finite samples, we thus arrive at the following algorithm:

Two-stage least squares algorithm (2SLS):

  • Step 1: Regress ZZ on WW to get Z~=Wγ^=W(WTW)1WTZ\tilde{Z} = W\hat{\gamma} = W(W^TW)^{-1}W^TZ.

  • Step 2: Regress YY on Z~\tilde{Z} to get τ^2SLS=(Z~TZ~)1Z~TY\hat{\tau}_{2SLS} = (\tilde{Z}^T\tilde{Z})^{-1}\tilde{Z}^TY.

For the scalar setting, it is easy to see that τ^2SLS=τ^IV\hat{\tau}_{2SLS} = \hat{\tau}_{IV}, but the benefit of this formulation is that it directly applies for vector-valued ZZ and WW.

(Optional) A non-parametric perspective on instrumental variables

In this notebook, we have introduced instrumental variables in the context of structural linear models. What if our model is nonlinear?

In an amazing coincidence, for binary treatment ZZ, the expression

τ=E[YW=1]E[YW=0]E[ZW=1]E[ZW=0].\tau = \frac{\mathbb{E}[Y|W=1] - \mathbb{E}[Y|W=0]}{\mathbb{E}[Z|W=1] - \mathbb{E}[Z|W=0]}.

has a meaning beyond the linear model setting. This is the subject of this groundbreaking paper by Angrist and Imbens in 1996.