Instrumental variables#
Review and introduction#
To briefly recap what we have learnt so far:
We defined a non-parametric superpopulation model, i.e. a distribution for \((X,Z,Y(0),Y(1))\):
\(Z\) is the (binary) treatment decision,
\(Y(0)\) and \(Y(1)\) are the potential outcomes in the universes where the unit wasn’t/was treated,
\(X\) is a confounding variable (in other words, it has a causal effect on \(Z\) and on \(Y\))
So far, we haven’t needed to make any assumptions about the distribution in general (only that it exists)
We defined a quantity of interest, the average treatment effect (ATE): \(\tau = \mathbb{E}[Y(1) - Y(0)]\), which is impossible to estimate unless we make further assumptions
We saw that in a randomized experiment, we have the following:
The treatment decisions are random, and independent of the potential outcomes
In other words, \(\big(Y(0),Y(1)\big)\perp\!\!\perp Z\)
In this section, we’ll investigate how we can estimate the ATE in situations where we have unknown confounding variables. We’ll rely on natural experiments to help us. Note that you’ve probably seen natural experiments before in Data 8 when learning about John Snow’s study of cholera.
Linear structural model (LSM)#
In some fields (such as economics), it is typical to work with structural models, which place some restrictions on the joint distribution of all the variables, but also make it easier to estimate the parameters of the model.
Definition: We will work with the linear structural model $\( Y = \alpha + \tau Z + \beta^TX + \epsilon, \)$
where \(\epsilon\) has mean zero, and is independent of \(Z\) and \(X\) (in economics, we say that \(\epsilon\) is exogenous). We sometimes further assume that \(\epsilon \sim \mathcal{N}(0,\sigma^2)\), but this is not necessary for any of the analysis we’re going to do. \(Y\) is still our outcome variable, \(Z\) is our treatment variable of interest (typically binary), and \(X\) is a confounding variable (or variables).
Note: We usually add the further structural equation \(Z = f(X, \delta)\) where \(\delta\) is an exogenous noise variable, and \(f\) encodes the structural relationship between \(X\) and \(Z\), but all you need to know for now is that we allow for \(\textrm{Cov}(Z,X) \neq 0\).
This is not the same as the linear model that we have seen when we learned about GLMs, and that you’ve seen in previous classes. In those settings, we were assuming a model about the joint distribution of the variables, which is a statement about associations and predictions.
On the other hand, a linear structural model encodes an assumption about intervention. It assumes that if for unit \(i\), if we could set \(Z_i = 1\), we will observe \(Y_i(1) = \tau + \beta^TX_i + \epsilon_i\), and if we could set \(Z_i = 0\), we will observe \(Y_i(0) = \beta^TX_i + \epsilon_i\). If \(Z\) is not binary, then there will be a potential outcome for each possible value of \(Z\). This is a subtle but important point, that also situates the linear structural model as a special case of the potential outcomes framework.
From this, we see that the average treatment effect in this model is \(\tau\) (can you show this is true?), and furthermore, that the individual treatment effect for every unit is
In other words, the linear structural model is making an implicit assumption that the treatment effect is constant.
Causal graphs and LSMs#
Apart from the causal effect of \(Z\) on \(Y\), the linear structural model also does something new from before. It asserts the causal relationships between the other variables, i.e. it tells us how \(Z\) and \(Y\) change if we manipulate \(X\). The above linear structural model can be represented graphically as follows:
As a reminder, the arrows from \(X\) into \(Z\) and \(Y\) assert that \(X\) causes both \(Z\) and \(Y\) (i.e. intervening on \(Z\) changes the values of \(X\) and \(Y\)), and the arrow from \(Z\) into \(Y\) asserts that \(Z\) causes \(Y\).
Confounding and omitted variable bias#
In many scenarios, confounding is complicated and involves many different variables, and it may be impossible to describe all of them. In that case (i.e., if we don’t observe \(X\)), then we are in trouble because of confounding. Since we’re assuming a linear model, let’s see what happens when we try to fit a linear regression to \(Y\) using \(Z\). We \(\tau_{OLS}\) be the solution of the least squares problem \(\min_{\tau,\alpha} \mathbb{E}[(\alpha + \tau Z - Y)^2]\). We then get $\( \begin{align} \tau_{OLS} & = \frac{\text{Cov}(Y,Z)}{\text{Var}(Z)} \\ & = \frac{\text{Cov}(\alpha + \tau Z + \beta^TX + \epsilon,Z)}{\text{Var}(Z)} \\ & = \frac{\text{Cov}(\tau Z,Z)}{\text{Var}(Z)} + \frac{\text{Cov}(\beta^TX,Z)}{\text{Var}(Z)} \\ & = \tau + \beta^T\frac{\text{Cov}(X,Z)}{\text{Var}(Z)}. \end{align} \)$
The first term is the true ATE. The second term is a bias in the \(\tau_{OLS}\) estimator, There is hence the bias term \(\beta^T\frac{\text{Cov}(X,Z)}{\text{Var}(Z)}\), which we call the omitted variable bias.
Remark: \(\frac{\text{Cov}(Y,Z)}{\text{Var}(Z)}\) is the infinite population version of the typical formula \(\hat{\tau}_{OLS} = (Z^TZ)^{-1}Z^TY\), where we now use \(Z\) and \(Y\) to denote matrices/vectors.
Here are some examples:
Treatment \(Z\) |
Outcome \(Y\) |
Possible confounder(s) \(X\) |
---|---|---|
Health insurance |
Health outcomes |
Socioeconomic background |
Military service |
Salary |
Socioeconomic background |
Family size |
Whether the mother is in the labor force |
Socioeconomic background |
Years of schooling |
Salary |
Socioeconomic background |
Smoking |
Lung cancer |
Socioeconomic background |
Note that in most of these examples, socioeconomic background is a confounder. This is particularly common in economics and econometrics, where most of the methods in this section originated.
Why can’t we just adjust for confounding? Having such confounders is problematic because in order to avoid omitted variable bias, we need to have observed them, and added them to our regression (collection of such data may not always be feasible for a number of reasons.) Furthermore, there could always be other confounders that we are unaware of, which leaves our causal conclusions under an inescapable cloud of doubt.
Instrumental Variables#
Is there a middle way between a randomized experiment and assuming unconfoundedness, which is sometimes unrealistic?
One way forward is when nature provides us with a “partial” natural experiment, i.e. we have a truly randomized “instrument” that injects an element of partial randomization into the treatment variable of interest. This is the idea of instrumental variables. We will first define the concept mathematically, and then illustrate what it means for a few examples.
Definition: Assume the linear structural model defined above. We further assume a variable \(W\) such that \(Z = \alpha' + \gamma W + (\beta')^TX + \delta\), with \(\gamma \neq 0\) (relevance), \(W\) independent of \(X\), \(\delta\) and \(\epsilon\) (exogeneity). Such a \(W\) is called an instrumental variable.
Remark: This replaces the earlier equation from before that \(Z = f(X,\delta)\).
Let us now see how to use \(W\) to identify the ATE \(\tau\).
Where the second equality follows from the exogeneity of \(W\). Meanwhile, a similar computation with \(Z\) and \(W\) gives us $\( \textrm{Cov}(Z,W) = \gamma\textrm{Var}(W). \)$
Putting everything together gives $\( \tau = \frac{\frac{\textrm{Cov}(Y,W)}{\textrm{Var}(W)}}{\frac{\textrm{Cov}(Z,W)}{\textrm{Var}(W)}}. \)$
In other words, \(\tau\) is the ratio between the (infinite population) regression coefficient of \(W\) on \(Y\), and that of \(W\) on \(Z\).
This motivates the instrumental variable estimator of the ATE in finite samples:
where again, abusing notation, \(W\), \(Z\) and \(Y\) refer to the vectors of observations. If \(\alpha' = 0\), then this is a plug-in estimator of \(\tau\), and is consistent.
Further interpretation for binary \(W\): When \(W\) is binary, we can show that $\( \tau = \frac{\mathbb{E}[Y|W=1] - \mathbb{E}[Y|W=0]}{\mathbb{E}[Z|W=1] - \mathbb{E}[Z|W=0]}. \)\( Hence, we can think of IV as measuring the ratio of the prima facie treatment effect of \)W\( on \)Y\( and that of \)W\( on \)Z$.
Causal graph for instrumental variables#
The relationships between \(W, Z, X\), and \(Y\) can be represented as the following causal graph:
How to read this graph:
The arrow from \(W\) into \(Z\) shows that \(W\) has a causal effect on \(Z\)
The absence of any arrow into \(W\) means that \(W\) is exogeneous, i.e. no variable in the diagram causes \(W\), and in particular \(W\) is independent of \(X\).
The absence of an arrow from \(W\) into \(Y\) means that the only effect of \(W\) on \(Y\) is through \(Z\).
We shaded in \(W\), \(Z\) and \(Y\) because these nodes are observed, but \(X\) is unshaded because it is latent (unobserved).
Note that we do not need to know or even be aware of what \(X\) is in order for instrumental variables to work! It doesn’t matter how many confounders there are, or whether we’re even able to list all of them: as long as we can guarantee that they do not have any causal relationship to the instrument (exclusion restriction), instrumental variables will work.
Examples of instrumental variables#
Let’s examine what we might use as instrumental variables for the five examples from the table in the previous section. The first four are taken from the econometrics literature:
Example 1: \(Z\) is health insurance, \(Y\) is health outcomes, \(X\) is socioeconomic background. Baicker et al. (2013) used the 2008 expansion of Medicaid in Oregon via lottery. The instrument \(W\) here was the lottery assignment. We previously talked about why this was an imperfect experiment because of compliance reasons (only a fraction of individuals who won the lottery enrolled into Medicaid), so IV provides a way of overcoming this limitation.
Example 2: \(Z\) is military service, \(Y\) is salary, \(X\) is socioeconomic background. Angrist (1990) used the Vietnam era draft lottery as the instrument \(W\), and found that among white veterans, there was a 15% drop in earnings compared to non-veterans.
Example 3: \(Z\) is family size, \(Y\) is mother’s employment, \(X\) is socioeconomic background. Angrist and Evans (1998) used sibling sex decomposition (in other words, the assigned sexes at birth of a sibling) as the IV. This is plausible because of the pseudo randomization of the sibling sex composition. This is based on the fact that parents in the US with two children of the same sex are more likely to have a third child than those parents with two children of different sex.
Example 4: \(Z\) is years of schooling, \(Y\) is salary, \(X\) is socioeconomic background. Card (1993) used geographical variation in college proximity as the instrumental variable.
Example 5: \(Z\) is smoking, \(Y\) is lung cancer, \(X\) is socioeconomic background. Unfortunately, this example does not lend itself well to an instrumental variable: despite decades of searching, nobody has yet found one that is convincing.
As we see in these examples, sometimes you need to be quite ingenious to come up with an appropriate instrumental variable. Joshua Angrist, David Card, and Guido Imbens, who is named in several of these examples, are phenomenally good at this: in fact, they won the Nobel Prize in economics for their collected body of work!
Extensions#
Multiple treatments / instruments, and two-stage least squares.#
So far, we have considered scalar treatment and instrumental variables \(Z\) and \(W\). It is also possible to consider vector-valued instruments and treatments. To generalize IV to this setting, we need to recast the IV estimator in the previous sections as follows.
First define the conditional expectation \(\tilde{Z} = \mathbb{E}[Z|W]\), and observe that \(\tilde{Z} = \alpha' + W\gamma\).
If we regress \(Y\) on \(\tilde{Z}\), the regression coefficient we obtain is $\( \begin{align} \frac{\textrm{Cov}(\tilde{Z},Y)}{\textrm{Var}(\tilde{Z})} & = \frac{\textrm{Cov}(\tilde{Z}, \alpha + \tau Z + \beta^TX + \epsilon)}{\textrm{Var}(\tilde{Z})} \\ & = \frac{\textrm{Cov}(\tilde{Z}, \tau Z)}{\textrm{Var}(\tilde{Z})} \\ & = \tau\frac{\textrm{Cov}(\tilde{Z}, Z)}{\textrm{Var}(\tilde{Z})} \\ & = \tau. \end{align} \)$
Here, the 2nd equality holds because \(W\) is independent of all \(X\) and \(\epsilon\), while the 4th equality holds because of a property of conditional expectations (one can also check this by hand by expanding out \(Z = \alpha' + \gamma W + (\beta')^TX + \delta\).)
In finite samples, we thus arrive at the following algorithm:
Two-stage least squares algorithm (2SLS):
Step 1: Regress \(Z\) on \(W\) to get \(\tilde{Z} = W\hat{\gamma} = W(W^TW)^{-1}W^TZ\).
Step 2: Regress \(Y\) on \(\tilde{Z}\) to get \(\hat{\tau}_{2SLS} = (\tilde{Z}^T\tilde{Z})^{-1}\tilde{Z}^TY\).
For the scalar setting, it is easy to see that \(\hat{\tau}_{2SLS} = \hat{\tau}_{IV}\), but the benefit of this formulation is that it directly applies for vector-valued \(Z\) and \(W\).
(Optional) A non-parametric perspective on instrumental variables#
In this notebook, we have introduced instrumental variables in the context of structural linear models. What if our model is nonlinear?
In an amazing coincidence, for binary treatment \(Z\), the expression
has a meaning beyond the linear model setting. This is the subject of this groundbreaking paper by Angrist and Imbens in 1996.