Instrumental variables

Instrumental variables#

This section is a work in progress and is subject to change. Special thanks to Yan Shuo Tan, who wrote most of this section’s content.

Review and introduction#

To briefly recap what we have learnt so far:

We defined a superpopulation model, i.e. a distribution for $(X_i,Z_i,Y_i(0),Y_i(1))$:
- $Z$ is the (binary) treatment decision,
- $Y(0)$ and $Y(1)$ are the potential outcomes in the universes where the unit wasn’t/was treated,
- $X$ is a confounding variable (in other words, it has a causal effect on $Z$ and on $Y$) So far, we haven’t needed to make any assumptions about the distribution of these variables in general (only that it exists).
We defined our quantity of interest, the average treatment effect (ATE): $\tau = E[Y(1) - Y(0)]$, which tells us the average effect of the treatment. We saw that this is impossible to estimate unless we make further assumptions.
We saw that in a randomized experiment, we have the following:
- The treatment decisions are random, and therefore are independent of the potential outcomes.
- In other words, $\big(Y(0),Y(1)\big)\perp\!\!\perp Z$.

In this section, we’ll investigate how we can estimate the ATE in situations where we have unknown confounding variables. We’ll rely on natural experiments to help us. Note that you’ve probably seen natural experiments before in Data 8, when learning about John Snow’s study of cholera.

Linear structural model (LSM)#

In some fields (such as economics), it is typical to work with structural models, which place some restrictions on the joint distribution of all the variables, and in doing so, make it easier to estimate the parameters of the model.

We will work with the linear structural model relating our outcome $Y$ to our treatment $Z$ and confounding variable(s) $X$: $$ Y = \alpha + \tau Z + \beta^TX + \epsilon, $$

where $\epsilon$ has mean zero, and is independent of $Z$ and $X$ (in economics, we say that $\epsilon$ is exogenous). We sometimes further assume that $\epsilon \sim \mathcal{N}(0,\sigma^2)$, but this is not necessary for any of the analysis we’re going to do.

Note: in general, we often add the further structural equation $Z = f(X, \delta)$ where $\delta$ is an exogenous noise variable, and $f$ encodes the structural relationship between $X$ and $Z$. We won’t go into this level of detail, but when reading this equation, you should assume that $\textrm{Cov}(Z,X)$ is not necessarily 0.

This is not quite the same as the linear model that we have seen when we learned about GLMs, and that you’ve seen in previous classes! While it looks very similar, the linear model we worked with before is a statement about associations and predictions, while this linear structural model is a statement about intervention and action.

Specifically, this model assumes that if for unit $i$, if we could set $Z_i = 1$, we will observe $Y_i(1) = \tau + \beta^TX_i + \epsilon_i$, and if we could set $Z_i = 0$, we will observe $Y_i(0) = \beta^TX_i + \epsilon_i$. (If $Z$ is not binary, then there will be a potential outcome for each possible value of $Z$.) This is a subtle but important point, that also situates the linear structural model as a special case of the potential outcomes framework!

From this, we see that the average treatment effect in this model is $\tau$ (can you show this is true?), and furthermore, that the individual treatment effect for every unit is

\[Y_i(1) - Y_i(0) = \tau.\]

In other words, the linear structural model is making an implicit assumption that the treatment effect is constant across all units.

Causal graphs and LSMs#

Apart from the causal effect of $Z$ on $Y$, the linear structural model also does something new from before. It asserts the causal relationships between the other variables, i.e. it tells us how $Z$ and $Y$ change if we manipulate $X$. The above linear structural model can be represented graphically as follows:

As a reminder, the arrows from $X$ into $Z$ and $Y$ assert that $X$ causes both $Z$ and $Y$ (i.e. intervening on $X$ changes the values of $Z$ and $Y$), and the arrow from $Z$ into $Y$ asserts that $Z$ causes $Y$.

Confounding and omitted variable bias#

In many scenarios, confounding is complicated and involves many different variables, and it may be impossible to collect, observe, or describe all of them. In that case we must assume that $X$ is unobserved. If this happens, then just as before, we’re in trouble because of confounding. Here are some examples. In each one, we’ve only listed one possible confounder $X$, but there are likely many more: can you think of at least one for each example?

Treatment $Z$	Outcome $Y$	Possible confounder(s) $X$
Health insurance	Health outcomes	Socioeconomic background
Military service	Salary	Socioeconomic background
Family size	Whether the mother is in the labor force	Socioeconomic background
Years of schooling	Salary	Socioeconomic background
Smoking	Lung cancer	Socioeconomic background

Note that in most of these examples, socioeconomic background is a confounder. This is particularly common in economics and econometrics, where most of the methods in this section originated.

Let’s be a bit more precise about quantifying the effect of confounding. Specifically, we’ll assume the linear structural model above, and then see what happens when we naively try to fit a linear regression to $Y$ using $Z$, without accounting for $X$.

Let $\hat{\tau}_{OLS}$ be the solution of the least squares problem $\min_{\tau,\alpha} \mathbb{E}[(\alpha + \tau Z - Y)^2]$. We then get $$ \begin{align} \hat{\tau}_{OLS} & = \frac{\text{Cov}(Y,Z)}{\text{Var}(Z)} \\ & = \frac{\text{Cov}(\alpha + \tau Z + \beta^TX + \epsilon,Z)}{\text{Var}(Z)} \\ & = \frac{\text{Cov}(\tau Z,Z)}{\text{Var}(Z)} + \frac{\text{Cov}(\beta^TX,Z)}{\text{Var}(Z)} \\ & = \underbrace{\tau}_\text{true ATE} + \underbrace{\beta^T\frac{\text{Cov}(X,Z)}{\text{Var}(Z)}}_{\text{bias involving }X}. \end{align} $$

The second term is a bias in the $\tau_{OLS}$ estimator: in other words, it’s the difference between the true value and the estimator, and it depends on the omitted (i.e., unobserved) variable $X$. So, we’ll call this term $\beta^T\frac{\text{Cov}(X,Z)}{\text{Var}(Z)}$ the omitted variable bias.

Remark: $\frac{\text{Cov}(Y,Z)}{\text{Var}(Z)}$ is the infinite population version of the typical formula $\hat{\tau}_{OLS} = (Z^TZ)^{-1}Z^TY$, where we now use $Z$ and $Y$ to denote matrices/vectors.

Why can’t we just adjust for confounding? Having such confounders is problematic because in order to avoid omitted variable bias, we need to have observed them, and added them to our regression (collection of such data may not always be feasible for a number of reasons.) Furthermore, there could always be other confounders that we are unaware of, which leaves our causal conclusions under an inescapable cloud of doubt.

Instrumental Variables#

Is there a middle way between a randomized experiment and assuming unconfoundedness, which is sometimes unrealistic?

One way forward is when nature provides us with a “partial” natural experiment, i.e. we have a truly randomized “instrument” that injects an element of partial randomization into the treatment variable of interest. This is the idea of instrumental variables. We will first define the concept mathematically, and then illustrate what it means for a few examples.

Definition: Assume the linear structural model defined above. We further assume a variable $W$ such that $Z = \alpha' + \gamma W + (\beta')^TX + \delta$, with $\gamma \neq 0$ (relevance), $W$ independent of $X$, $\delta$ and $\epsilon$ (exogeneity). Such a $W$ is called an instrumental variable.

Remark: This replaces the earlier equation from before that $Z = f(X,\delta)$.

Let us now see how to use $W$ to identify the ATE $\tau$.

\[\begin{split} \begin{align} \textrm{Cov}(Y,W) & = \textrm{Cov}(\alpha + \tau Z + \beta^TX + \epsilon,W) \\ & = \tau \textrm{Cov}(Z,W) \\ & = \tau \textrm{Cov}(\alpha' + \gamma W + (\beta')^TX + \delta, W) \\ & = \tau\gamma \textrm{Var}(W). \end{align} \end{split}\]

Where the second equality follows from the exogeneity of $W$. Meanwhile, a similar computation with $Z$ and $W$ gives us $$ \textrm{Cov}(Z,W) = \gamma\textrm{Var}(W). $$

Putting everything together gives $$ \tau = \frac{\frac{\textrm{Cov}(Y,W)}{\textrm{Var}(W)}}{\frac{\textrm{Cov}(Z,W)}{\textrm{Var}(W)}}. $$

In other words, $\tau$ is the ratio between the (infinite population) regression coefficient of $W$ on $Y$, and that of $W$ on $Z$.

This motivates the instrumental variable estimator of the ATE in finite samples:

\[ \hat{\tau}_{IV} = \frac{\overbrace{(W^TW)^{-1}W^TY}^{\text{OLS coeff. of W for Y}}}{\underbrace{(W^TW)^{-1}W^TZ}_{\text{OLS coeff. of W for Z}}}, \]

where again, abusing notation, $W$, $Z$ and $Y$ refer to the vectors of observations. If $\alpha' = 0$, then this is a plug-in estimator of $\tau$, and is consistent.

Further interpretation for binary $W$: When $W$ is binary, we can show that $$ \tau = \frac{\mathbb{E}[Y|W=1] - \mathbb{E}[Y|W=0]}{\mathbb{E}[Z|W=1] - \mathbb{E}[Z|W=0]}. $$ Hence, we can think of IV as measuring the ratio of the prima facie treatment effect of $W$ on $Y$ and that of $W$ on $Z$.

Causal graph for instrumental variables#

The relationships between $W, Z, X$, and $Y$ can be represented as the following causal graph:

How to read this graph:

The arrow from $W$ into $Z$ shows that $W$ has a causal effect on $Z$
The absence of any arrow into $W$ means that $W$ is exogeneous, i.e. no variable in the diagram causes $W$, and in particular $W$ is independent of $X$.
The absence of an arrow from $W$ into $Y$ means that the only effect of $W$ on $Y$ is through $Z$.
We shaded in $W$, $Z$ and $Y$ because these nodes are observed, but $X$ is unshaded because it is latent (unobserved).

Note that we do not need to know or even be aware of what $X$ is in order for instrumental variables to work! It doesn’t matter how many confounders there are, or whether we’re even able to list all of them: as long as we can guarantee that they do not have any causal relationship to the instrument (exclusion restriction), instrumental variables will work.

Examples of instrumental variables#

Let’s examine what we might use as instrumental variables for the five examples from the table in the previous section. The first four are taken from the econometrics literature:

Example 1: $Z$ is health insurance, $Y$ is health outcomes, $X$ is socioeconomic background. Baicker et al. (2013) used the 2008 expansion of Medicaid in Oregon via lottery. The instrument $W$ here was the lottery assignment. We previously talked about why this was an imperfect experiment because of compliance reasons (only a fraction of individuals who won the lottery enrolled into Medicaid), so IV provides a way of overcoming this limitation.

Example 2: $Z$ is military service, $Y$ is salary, $X$ is socioeconomic background. Angrist (1990) used the Vietnam era draft lottery as the instrument $W$, and found that among white veterans, there was a 15% drop in earnings compared to non-veterans.

Example 3: $Z$ is family size, $Y$ is mother’s employment, $X$ is socioeconomic background. Angrist and Evans (1998) used sibling sex decomposition (in other words, the assigned sexes at birth of a sibling) as the IV. This is plausible because of the pseudo randomization of the sibling sex composition. This is based on the fact that parents in the US with two children of the same sex are more likely to have a third child than those parents with two children of different sex.

Example 4: $Z$ is years of schooling, $Y$ is salary, $X$ is socioeconomic background. Card (1993) used geographical variation in college proximity as the instrumental variable.

Example 5: $Z$ is smoking, $Y$ is lung cancer, $X$ is socioeconomic background. Unfortunately, this example does not lend itself well to an instrumental variable: despite decades of searching, nobody has yet found one that is convincing. This leads to an important lesson: not every problem is amenable to the use of instrumental variables, or even natural experiments!

As we see in these examples, sometimes you need to be quite ingenious to come up with an appropriate instrumental variable. Joshua Angrist, David Card, and Guido Imbens, who is named in several of these examples, are phenomenally good at this: in fact, they won the Nobel Prize in economics for their collected body of work!

Extensions#

Multiple treatments / instruments, and two-stage least squares.#

So far, we have considered scalar treatment and instrumental variables $Z$ and $W$. It is also possible to consider vector-valued instruments and treatments. To generalize IV to this setting, we need to recast the IV estimator in the previous sections as follows.

First define the conditional expectation $\tilde{Z} = \mathbb{E}[Z|W]$, and observe that $\tilde{Z} = \alpha' + W\gamma$.

If we regress $Y$ on $\tilde{Z}$, the regression coefficient we obtain is $$ \begin{align} \frac{\textrm{Cov}(\tilde{Z},Y)}{\textrm{Var}(\tilde{Z})} & = \frac{\textrm{Cov}(\tilde{Z}, \alpha + \tau Z + \beta^TX + \epsilon)}{\textrm{Var}(\tilde{Z})} \\ & = \frac{\textrm{Cov}(\tilde{Z}, \tau Z)}{\textrm{Var}(\tilde{Z})} \\ & = \tau\frac{\textrm{Cov}(\tilde{Z}, Z)}{\textrm{Var}(\tilde{Z})} \\ & = \tau. \end{align} $$

Here, the 2nd equality holds because $W$ is independent of all $X$ and $\epsilon$, while the 4th equality holds because of a property of conditional expectations (one can also check this by hand by expanding out $Z = \alpha' + \gamma W + (\beta')^TX + \delta$.)

In finite samples, we thus arrive at the following algorithm:

Two-stage least squares algorithm (2SLS):

Step 1: Regress $Z$ on $W$ to get $\tilde{Z} = W\hat{\gamma} = W(W^TW)^{-1}W^TZ$.
Step 2: Regress $Y$ on $\tilde{Z}$ to get $\hat{\tau}_{2SLS} = (\tilde{Z}^T\tilde{Z})^{-1}\tilde{Z}^TY$.

For the scalar setting, it is easy to see that $\hat{\tau}_{2SLS} = \hat{\tau}_{IV}$, but the benefit of this formulation is that it directly applies for vector-valued $Z$ and $W$.

(Optional) A non-parametric perspective on instrumental variables#

In this notebook, we have introduced instrumental variables in the context of structural linear models. What if our model is nonlinear?

In an amazing coincidence, for binary treatment $Z$, the expression

\[ \tau = \frac{\mathbb{E}[Y|W=1] - \mathbb{E}[Y|W=0]}{\mathbb{E}[Z|W=1] - \mathbb{E}[Z|W=0]}. \]

has a meaning beyond the linear model setting. This is the subject of this groundbreaking paper by Angrist and Imbens in 1996.