Chapter 21 - Partial Identification

21.1 How Does It Work?

21.1.1 Strength Through Relaxation

Causal identification comes from a theoretical model of the world. We write down what variables we think are important and how they affect each other, and then the necessary steps to isolate a causal relationship come out of that model. However, the process of writing this model requires we make a bunch of assumptions: $A$ affects $B$ but $B$ doesn’t affect $A$, $C$ doesn’t affect $A$ or $B$, and $D$ doesn’t even belong on the causal diagram.650650 The process of estimating our effect once we have our model and research design also requires a bunch of assumptions, like “the error term is normally distributed” or “the relationship between these two variables is linear.” That’s not really what this chapter is about, but a lot of the methods here can be applied to statistical assumptions just like we apply them to causal assumptions.

But if you’ve actually taken the time to sit down and draw a causal diagram (and I hope by this point in the book that you have), you realize how hard it is to make those assumptions. Saying that $C$ doesn’t affect $A$, for example, is equivalent to saying that $C$ has an effect on $A$ of 0. Wow, exactly zero? Not just a small effect or an unimportant effect or less-than-.00001, but zero. That seems like a strong assumption.

Strong and weak assumptions. A strong assumption is an assumption that we may make out of necessity but seems implausible. A weak assumption is an assumption we make that may not be proven but seems likely to be true.

Figure 21.1: A Basic Back Door

A causal diagram in which C causes both A and B, and A causes B

But it’s an assumption we have to make! Let’s say we are looking at Figure 21.1. If we haven’t measured $C$ and so can’t control for it, then the only way we can identify the effect of $A$ on $B$ is to assume that those arrows from $C$ to $A$ and/or $B$ have an effect of zero.

Or do we have to make that assumption? Let’s consider something analogous, a basic algebra problem:

\[\begin{equation} \tag{21.1} x + y = 10 \end{equation}\]

If you want to know the value of $x$, then I need to tell you the value of $y$, right? If $y = 6$, then $x = 4$, and so on. Before we say what $y$ is, $x$ could be anything. But saying $y = 6$ narrows down the possible range of values for $x$ to a single point. Saying $y = 6$ point identifies $x$ to be 4.651651 Mathematicians also call this identification, by the way. But what if we don’t know what $y$ is? We’re hopeless, right?

Not entirely. Maybe we can’t say what $y$ is exactly, but maybe we know enough to be able to say something like “$y$ is between 4 and 7.” In that case, we won’t know exactly what $x$ is, but we can confidently claim that $x$ is between 3 and 6. “$y$ is between 4 and 7” is a weaker assumption than “$y = 6$” and so the results it gets us aren’t quite as conclusive, but we do get results.

Partial identification, also known as set identification, is the process of doing this same thing in a causal identification context. Instead of figuring out what set of possibly strong assumptions need to be true in order for us to pin down an exact answer (which will usually back us into a corner of having to assume something we aren’t really sure about), we relax those strong assumptions and instead try to see what weaker assumptions we can make that we actually believe in. Then, we see what claims we can make given those assumptions.652652 “Partial identification,” like many terms in statistics and causal inference, is a concept with different names across fields (“sensitivity analysis”, which I’ll discuss more below, and “bounding” are closely-related concepts, and many of the methods later in this chapter would fall under that label, although it’s not exactly the same), and also a name that refers to different concepts across fields (for example, epidemiology uses “partial identification” to refer to whenever you’ve identified a treatment effect average other than the average treatment effect). Terminology is the worst. In this book, though, I am using the “partial identification” term to refer to the concept I’m describing here.

Research done with more-credible assumptions will itself be more trustworthy and credible, even if the results won’t come with quite as much certitude (almost as though it’s honestly reflecting the uncertainty we have about the world!) (⊕Manski 2020Manski, Charles F. 2020. “The Lure of Incredible Certitude.” Economics & Philosophy 36 (2): 216–45.). Sometimes, point identification is too much to plausibly ask for. So we ask what we do know, and what we can learn from it.

Point identification. A causal analysis that includes enough assumptions such that there is only a single answer (like how assuming $y = 6$ and $x+y = 10$ requires that we conclude $x = 4$) is point identified.

So what about that identification problem from Figure 21.1? Perhaps we can’t confidently say that $C$ has a zero effect on both $A$ and $B$. But maybe we are confident that the effects of $C$ on $A$ and $B$ are each no larger than .1, for example. Using one of our partial identification methods, we might end up saying something like “we don’t know exactly what the effect of $A$ on $B$ is, but if the effects of $C$ aren’t any bigger than .1 each, then the effect of $A$ on $B$ is somewhere between .03 and .15,” to make up some numbers.

21.1.2 Establishing Clear Boundaries

So how can we convert weak assumptions into ranges of possible answers? There are quite a few ways. There’s no single method for partial identification, but we can start with a demonstration.653653 Probably the most widely-applicable use case for partial identification concerns unobserved back-door variables (see the “How Is It~Performed?” section), rather than sample selection as we have here. But this case is way easier to show the actual mathematical steps for. Specifically, let’s demonstrate how partial identification can help us figure out a thorny issue of sample selection that we typically wouldn’t be able to do much about.654654 See Tamer (⊕2010Tamer, Elie. 2010. “Partial Identification in Econometrics.” Annual Review of Economics 2 (1): 167–95.) section 3.1.3 for a more general version of this example.

Sample selection occurs when some of the relevant variables on your causal diagram cause you to be included or excluded from the sample. For example, let’s say that we have a nicely-randomized treatment, where we randomly give 5,000 parents $1,000 and ask them nicely to keep their kids in school, and we give another 5,000 parents $0 and ask them nicely to keep their kids in school. Then we check in a month later to see if their kids are attending school. However, parents who get nothing may be more likely to quit the study and drop out of the sample, and also people capable of keeping their kids in school may be more capable of following up with the study. This gives us a diagram like Figure 21.2.

Figure 21.2: Sample Selection

A causal diagram in which Given Money and Kids in School both cause In Sample, and Given Money causes Kids in School

This is a problem because we can only estimate our results based on the people in the sample! This means we are controlling for “in sample” and have collider bias. If we didn’t have to worry about people being in the sample, since the treatment is randomized we could get the effect by just comparing means:

\[\begin{equation} \tag{21.2} \text{Effect of Treatment} = \text{Share in School conditional on being Treated} - \end{equation}\]

\[ \text{Share in School conditional on being Untreated} \]

But we don’t observe either of those values, since we don’t observe all the treated and untreated people. We just see the ones in the sample! But let’s think about what we can see, using some hypothetical numbers.

We know that 5,000 people started out treated, and in our final sample we have 4,500 of them.
We know that 5,000 people started out untreated, and in our final sample we have 4,000 of them.
Among the treated people still in the sample, 80% of their kids are in school.
Among the untreated people still in the sample, 60% of their kids are in school.

We can still say something about the results using just these actual observations. To do this, we notice that the actual value is just an average of the observed and unobserved people:

\[\begin{equation} \tag{21.3} \text{Share in School conditional on being Treated} = (Prob(\text{Observed conditional on Treated}) \end{equation}\]

\[ \times\text{Share in School conditional on Treated and Observed}) + \]

\[ ((1-Prob(\text{Observed conditional on Treated})) \times\text{Share in School conditional on Treated and Unobserved}) \]

We have three numbers here,655655 And note that the $1 - Prob($Observed conditional on Treated$)$ in there is the probability of not being observed, conditional on being treated. 1 minus the probability of a thing is the probability of that thing not happening. and conveniently we know two of them! Since we started with 5,000 treated people and still have 4,500 of them, the probability of being observed conditional on being treated is .9. And we know that 80% of kids are in school for the treated group that’s observed.

So we get:

\[\begin{equation} \tag{21.4} \text{Share of Kids in School conditional on being Treated} = \end{equation}\]

\[ .9\times .8 + .1\times\text{Share in School conditional on Treated and Unobserved} \]

Only one piece of the puzzle left! We don’t know the share of kids in school among those who were treated but not observed. But we do know it’s a proportion, so it’s got to be between 0 and 1 by definition. At the lowest it’s 0, and at the highest it’s 1. We get:

\[\begin{equation} \tag{21.5} \text{Share in School cond. on being Treated} \geq .9\times .8 + .1\times 0= .72 \end{equation}\]

\[ \text{Share in School cond. on being Treated} \leq .9\times .8 + .1\times 1 = .82 \]

The share for the treated group must be between .72 and .82, and we didn’t have to make any additional assumptions about our causal diagram or our data to figure it out! We can do the same calculation on the untreated group (try it yourself - it’s not hard!) and find that the share for the untreated group must be between .48 and .68.

So what does this mean for the effect of treatment? The effect is still the difference between the two shares. The biggest this can be is when the treated-group share is at its maximum of .82 and the untreated-group share is at its lowest at .48, for a difference of $.82-.48 = .34$. Take the other end of both ranges for a difference of $.72 - .68 = .04$.

The effect of that $1,000 check must then be somewhere between a 4 percentage-point increase in the share of kids in school and a 34 percentage-point increase. We can call this the identified set. Values outside that range would be inconsistent with our data, so we can reject them. But any value in that range could be consistent with our data, so we can’t really distinguish between them without making further assumptions. We’ve unsnarled a collider-bias problem, and without having to have data on any of our missing people! Seems impressive to me.

Identified set. The range of values that our combination of assumptions and data can support.

Sure, those are still pretty wide bounds. But we could narrow them further by making additional assumptions like “the share of kids in school is higher in the observed group than the unobserved group” (which would narrow the bounds to between .12 and .32), “the percentage-point difference in share of kids in school between observed and unobserved people is the same in the treated and untreated conditions” (bounds between .2 and .26) or “there is no difference on average between the observed and unobserved groups” (this point identifies the effect to be exactly .2). Do you buy any of these additional assumptions? The first one doesn’t seem too strong, and maybe even the second one is okay, but also I’m not sure!

We could add some of these assumptions to narrow our bounds, but the bonus here is that we already have an answer of some sort without having to make them, and we can decide for ourselves on the basis of whether the assumption is plausible enough to justify adding it, instead of thinking “dang, I have to make that assumption to get an answer at all, guess I better make it.” If we really wanted point identification rather than partial identification, we’d have to pick an assumption as strong as “there is no difference between the observed and unobserved groups” to get there. And do we really believe that? I don’t!

21.2 How Is It Performed?

21.2.1 Being Sensitive

Partial identification is an approach more than it is a particular method. As such, you could take pretty much any of the methods covered in this book and make a partial-identification version of it, allowing you to relax some or all of the assumptions that go into the method. Obviously, I can’t actually do that here for everything, but we’ll cover a few things in the “How the Pros Do It” section below.

In this section I will cover two applications of partial identification that I think are very broadly useful for causal inference in cases where you’re not applying some fancy quasi-experimental research design, but just using some covariates to close back doors and isolate a causal effect. The two methods in this section are about how you can use partial identification to get an answer even if you have back doors you can’t close because you can’t measure all the variables you need to control for. In the first subsection we’ll cover how to do this when you’re using regression (by using what we know about omitted variable bias and how big it can be), and second how to do this when you’re using propensity score matching (by using what we know about imperfect matches and how much bias they can introduce).

These approaches fall under the banner of “sensitivity analysis,” which sounds to me like a generic way of describing partial identification as a whole, but in practice often refers to one of these methods. There’s more to partial identification than just sensitivity analysis: in the “How the Pros Do It” section we’ll cover some approaches that are specific to certain research designs like difference-in-differences. However, I think that sensitivity analysis approaches are broadly useful, and you should consider giving them a go nearly every time you’re trying to identify an effect solely by controlling for a set of covariates.656656 Or doing things that are basically just that, like fixed effects. I think these ideas are quite powerful! As discussed in Chapter 11, it’s a pretty strong claim to say that you’ve modeled the data generating process accurately enough that you can identify an effect just using control variables. Lots of people instinctively reject that this is even possible outside of the simplest cases. I’m sympathetic to that view.

Suspicion of attempts to identify causal effects using only controls is one reason why designs like regression discontinuity or instrumental variables, which use random-ish variation to mimic experiments, are so popular. But they can only be done in very special and lucky circumstances! Using sensitivity analysis to be able to produce results even without a quasi-experiment, in a way that’s more plausible than just saying “yep, I’m pretty sure I controlled for everything important,” means that we can study all sorts of settings where there is no useful real-world experiment-like shock to the system.

These kinds of approaches go way back, and were highly useful in one of the earliest high-profile cases of causal inference using observational data: smoking. Back in the 1950s, there was a major scientific attempt to establish, to a level of certainty that could be used to justify government policy, that smoking tobacco causes lung cancer. However, it would be unethical to randomly assign people to smoke and then check whether it gave them cancer. This meant scientists were stuck trying to prove the causal relationship using observational data, as we do in this book. Even after adjusting for variables on obvious back doors: income, demographics, and so on, cancer rates were still far higher among smokers, as you’d expect. But cigarette companies argued (correctly!) that the effect still was not identified because some background factors were left out. So Cornfield et al. (1959) did methods much like those in the section (⊕Cornfield et al. 1959Cornfield, Jerome, William Haenszel, E. Cuyler Hammond, Abraham M. Lilienfeld, Michael B. Shimkin, and Ernst L. Wynder. 1959. “Smoking and Lung Cancer: Recent Evidence and a Discussion of Some Questions.” Journal of the National Cancer Institute 22 (1): 173–203.; ⊕Rosenbaum 2010Rosenbaum, Paul R. 2010. Design of Observational Studies. Vol. 10. Springer.). They admitted that there were still open back doors in the analysis. But they showed that those open back doors would need to have enormous impact in order to explain away the apparent effect of smoking on lung cancer. We don’t know what the effect is exactly, but using weak assumptions we can at least tell it’s quite far above zero!

Estimating a causal effect using only a set of covariates worked for smoking. In my opinion, we should be much more comfortable with covariates-only approaches to estimating causal effects… as long as we can account for the very high likelihood that we’ve left something out. Partial identification can do that.

21.2.2 Resolving Back Doors with Partial Identification in Regression

Remember all the way back in Chapter 13 when we talked about omitted variable bias? Omitted variable bias happens when you run a regression without controlling for the proper set of variables to close all back door paths.

Thankfully, we have a very good picture of how bad our bias will be if we have the proper model but fail to control for something important. Let’s say we have a data generating process:

\[\begin{equation} \tag{21.6} X = \gamma_0 + \gamma_1Z + \nu \end{equation}\] \[\begin{equation} \tag{21.7} Y = \beta_0 + \beta_1X + \beta_2Z + \varepsilon \end{equation}\]

where $X$ and $\varepsilon$ are unrelated. If we just regress $Y$ on $X$, then the estimate we’ll get for $\beta_1$ will be biased, since $Z$ is clearly on a back door path that we didn’t close. The exact value of the bias in expectation will be

\[\begin{equation} \tag{21.8} \frac{cov(X, \beta_2Z + \varepsilon)}{var(X)} \rightarrow \frac{\beta_2cov(X,Z)}{var(X)} \end{equation}\]

where $cov()$ is the covariance, and with the $\varepsilon$ part dropping out in that second step because we assumed that $X$ and $\varepsilon$ were unrelated. If you prefer correlation to covariance, we can rewrite this as $\beta_2corr(X, Z)\times sd(\beta_2Z)/sd(X)$. So, our bias is bigger the more strongly $X$ and $Z$ are related (the bigger $\gamma_1$ is in absolute value and thus the bigger $cov(X, Z)$ is), the more strongly $Z$ affects $Y$ (the bigger $\beta_2$ is) and smaller the more variance there is in $X$ overall. And, heck, we can measure $var(X)$. That leaves us only with $\beta_2$ and $cov(X, Z)$!

Since those are the only unknowns, if we can make some reasonable assumptions about how strong those relationships are, that will give us a reasonable range for the bias, which will give us a reasonable range for the estimate. Things might get more complex if there’s more than one $X$ or more than one $Z$, but that’s the basic idea.

This idea for bounding regression estimates goes back a fair ways, perhaps most prominently in Imbens (⊕2003Imbens, Guido W. 2003. “Sensitivity to Exogeneity Assumptions in Program Evaluation.” American Economic Review 93 (2): 126–32.) and Altonji, Elder, and Taber (⊕2005Altonji, Joseph G., Todd E. Elder, and Christopher R. Taber. 2005. “Selection on Observed and Unobserved Variables: Assessing the Effectiveness of Catholic Schools.” Journal of Political Economy 113 (1): 151–84.), who do partial identification by trying to pin down what reasonable assumptions about the relationship between $X$ and $Z$. Since then, Oster (⊕2019Oster, Emily. 2019. “Unobservable Selection and Coefficient Stability: Theory and Evidence.” Journal of Business & Economic Statistics 37 (2): 187–204.) offered a popular variation on Altonji’s method.

However, the specific method we’ll be using is from Cinelli and Hazlett (⊕2020Cinelli, Carlos, and Chad Hazlett. 2020. “Making Sense of Sensitivity: Extending Omitted Variable Bias.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82 (1): 39–67.) This method uses a few different approaches to bounding the relationship between $X$ and $Z$, and then asks questions like “how strong does that bias need to be for us to think that our estimate is all bias and the true effect might be 0?” or “if we were able to measure all the predictors of $Y$ other than $X$, how strong would their relationship with $X$ need to be for the true effect of $X$ to be 0?”

Cinelli and Hazlett operate by thinking carefully about partial $R^2$ values. A partial $R^2$ value is sort of like a regular $R^2$ value (see Chapter 13), except that instead of measuring the share of variation statistically explained by all the predictors, it instead asks “after we remove the share of variation explained by one set of predictors, how much of the residual share is explained by this other set of predictors?” For example, to get the partial $R^2$ for $X$ in explaining $Y$, we’d first regress $Y$ on all the other predictors in the model, take the residuals, regress those residuals on $X$, and get the $R^2$ from that final regression.

Cinelli and Hazlett’s approach is based on thinking about two partial $R^2$ values: the partial $R^2$ of the omitted variables in explaining the outcome $Y$, and also the partial $R^2$ of the omitted variables in explaining the treatment variable $X$. From our omitted variable bias calculation above, these conceptually (although not exactly) represent $\beta_2$ and $cov(X, Z)$, respectively. If both partial $R^2$s are low, then you’ve got low bias. If one is high, you might be able to offset it with the other being low and still have low bias. There’s a tradeoff between the two. We can’t actually measure these relationships in the data, but Cinelli and Hazlett use graphs to show how different hypothetical combinations of partial $R^2$ values would lead to different levels of bias.

Let’s give Cinelli and Hazlett a spin. For this chapter we’ll be using as an example a study by Pina-Sánchez and Harris (⊕2020Pina-Sánchez, Jose, and Lyndon Harris. 2020. “Sentencing Gender? Investigating the Presence of Gender Disparities in Crown Court Sentences.” Criminal Law Review 2020 (1): 3–28.). In this study, the authors are interested in whether there are gender disparities in prison sentencing in England and Wales. They look at data from 2011 to 2015 for several different offenses, and ask whether a man or a woman is more likely to be sentenced to custody for the same crime. The “treatment” here is being treated as a male in the courts, and the outcome is whether or not someone is sentenced to custody for the given crime. They also add 43 control variables to adjust for sentencing-relevant factors that might be different between men and women and affect their sentencing for reasons other than their gender, including the severity of the crime, their previous convictions, and whether they are the only caretaker for any children, among other factors.

One of their analyses looks at drug crimes, where men are 13.3 percentage points more likely than women to be taken into custody after adjusting for controls.657657 We are using linear regression here, while the original study uses logit, so the results slightly differ. But in the case of the drug crime analysis, the marginal effects from logit and the linear regression results are nearly identical. Our analysis will focus here. The original paper includes plenty of controls when isolating this gender disparity, but surely they cannot adjust for all relevant factors! How much confidence can we really put in that 13.3 percentage point number, and what might be a more reasonable range of results?

We can take a look at how sensitive these results are in Figure 21.3. How can we read this figure? Along the x-axis we have the partial $R^2$ of omitted confounders with the treatment, and along the y-axis we have the same for the outcome. Then, on the graph itself we have some curves labeled with treatment effects. Keep in mind those x- and y-axis “partial $R^2$” values refer to the amount of variation explained by the omitted confounders after accounting for the other controls we already have.

To interpret this, let’s take as an example the value .1 on the x-axis and .2 on the y-axis, which rougly meet at the line labeled -.1. We can read this as “if there is a set of omitted variables that explain 10% of the residual variation in gender, and 20% of the residual variation in being taken into custody, then controlling for those omitted variables could take the estimated effect all the way down to -.1.” Our original estimate is positive (men are more likely to be taken into custody), shown at the bottom-left of the graph, so the graph shows how adjusting for a control of that level of importance could switch the sign of the effect.658658 The bias could just as easily be in the other direction, making the effect bigger than the original value instead of smaller and eventually negative. But what we’re worried about is the effect we have going away, so that’s what this method focuses on.

Figure 21.3: Sensitivity of Gender Disparities in Drug-Crime Sentencing to Omitted Variable

Graph with 'Partial R2 of the confounders with treatment' on the x-axis and 'Partial R2 of the confounders with outcome' on the y-axis. There is an unadjusted effect size estimate in the bottom-left, with a dashed line above it showing the set of partial R2 values that would drive the effect to 0. Right next to Unadjusted is 1x Culpability, a bit above it is 1x Drug Class, and way up to the right, past the dashed line, is 1x All Covariates.

Looking for combinations of x- and y-axis values that push us over that thick dotted 0 line are of particular interest, since if we have omitted variables that contribute that much bias, that can explain away all of the effect we found. As a simple metric, Cinelli and Hazlett suggest a “robustness value” which is the value on the zero-treatment-effect-line where the x-axis value and y-axis value are the same. On this graph, that’s .083. We can then say that our effect is robust to omitted variable bias if the omitted variables explain less than 8.3% of the residual variation of both treatment and outcome. But if it’s higher than 8.3% for both (or perhaps higher for one but lower for the other, for some tradeoff), we’re in trouble. The question then is if we think there is a set of unobserved controls as strong as that.

Is there? It’s hard to get a sense of what exactly that would mean. That’s where the points on the graph come in. We can say “we don’t know exactly how much the unobserved variables bias us, but we do know how much we’d be biased if we removed some of the controls we do have.” So we take some of the controls already in the model and check how sensitive the results are to there being another set of unobserved controls as strong as the ones we have.

On the graph we see four points, the first for the original Unadjusted estimate. Right next to it we see “$1\times$ Culpability.” Culpability is whether the suspect had a leading, significant, or lesser role in the drug crime. This point is right next to the unadjusted point, saying that if there were some other set of unobserved controls equally as strong ($1\times$) as Culpability, the result wouldn’t change much. The results are robust! Or at least they’re robust to that one. If instead there’s another set of unobserved controls that contribute as much bias as Drug Class (cocaine/marijuana/etc.) does, the actual effect is still positive, with men receiving higher sentences, but now only by 8 percentage points, not 13. What if there are so many omitted variables biasing us that they collectively have the strength of all the 43 covariates we already have? In that case our original positive result doesn’t survive at all, and we find ourselves all the way down at -.37.

This exercise gives us a sense of how robust our original result is to additional sources of omitted variable bias. We can pick one of our controls that we might think would have been a fairly strong contributor to bias, like the type of drug, and say “even if there is another set of confounders out there as strong as this, the effect is still of the same sign we started with.” We might even say that we are willing to bound the effect as being at least as large as .08, if we think that something as strong as drug class is the strongest unmeasured variable still out there. Alternately, we do need to claim at least that there isn’t a set out there as strong as the full list we already have. But given our list already covers quite a lot of bias, we may be comfortable saying there isn’t another set out there equally as strong.

Another way to put sensitivity into context is with an extreme-value analysis. This approach imagines that we’ve found variables that fully explain all the residual variation in our outcome, or some other share like 50% of the residual variation.659659 Heck, something must! The outcome came from somewhere. The question is not whether there’s other stuff out there that explains the outcome; there always is. The question is whether that other stuff is also related to treatment, giving us an open back door. Then it asks how strongly those variables would need to be related to treatment in order to drive the estimated effect to .

We can see this analysis in action in Figure 21.4. This one doesn’t look quite so rosy! Looking at the “100%” line, we see that it starts positive but crosses 0 at a value of only about .01. This means that our result is not robust to all the residual variation in being taken into custody explaining even 1% of the residual variation in being male.660660 Note that this is being able to predict 1% of the statistical variation in being male, in other words things that are different on average between men and women. It doesn’t need to be the case that these things cause you to be male. Maybe this is a claim we’re willing to make. After all, regressing “male” on the rather lengthy and comprehensive set of predictors we have only gives us an $R^2$ of .099. Would we claim that the remaining variation in custody wouldn’t move that needle by another .01? Maybe. But at the very least, now we know it’s a claim we need to make.

Figure 21.4: Gender Disparities

Graph with 'Partial R2 of the Confounder with Treatment' on the x-axis and Adjusted Effect Estimate on the y-axis. There is a dashed horizontal line at 0 on the y axis. There are three downward-sloping lines indicating Partial R2 of the Confounder with the Outcome values of 100%, 75%, and 50%. They cross the 0 line between an x-axis value of 0.01 and 0.02.

Implementing a Cinelli and Hazlett sensitivity analysis is fairly straightforward to do in code because the method has downloadable packages in R, Stata, and Python, all called sensemakr, although for Python it is installed with pip install PySensemakr. The below code replicates the analysis and both graphs from this section.

R Code

library(tidyverse); library(sensemakr)
cc <- causaldata::ccdrug

# The . means "include all the variables you haven't
# mentioned yet" which is just custody
m <- lm(custody ~ ., data = cc)

# The full set of covariates, everything but intercept and treatment
all_covs <- m %>% coef() %>% names()
all_covs <- all_covs[3:length(all_covs)]

# Create bounds with strength similar to different covariates
all_bounds <- ovb_bounds(m, treatment = 'male',
                        benchmark_covariates = list(
                          'All Covariates' = all_covs,
                          'Culpability' = all_covs[6:8],
                          'Drug Class' = all_covs[12:16]))

# Make a graph, with a big range for the "All Covariates" adjustment
ovb_contour_plot(m, treatment = 'male', 
                 lim = .3, lim.y = 1)
# And add our adjusted estimates to the plot
add_bound_to_contour(bounds = all_bounds)
# Find the robustness value from the graph
robustness_value(m, 'male')

# Extreme-case analysis
ovb_extreme_plot(m, treatment = 'male')

Stata Code

* ssc install sensemakr
causaldata ccdrug.dta, use clear download

* Any comparison/benchmark covariate needs to be pre-made
* into binary indicators
xi i.drgi_culpability, pre(cul_)
xi i.drgi_class, pre(cla_)
xi i.age, pre(age_)
xi i.offense, pre(off_)
xi i.prev_convictions, pre(pre_)

local allcovs "male age_* off_* first_offense pre_* cla_* cul_* drg_*"

* The Stata version won't do all sets of benchmarks on one graph
* So we run three times. Need a big graph range for AllCovariates
sensemakr custody `allcovs', treat(male) gbenchmark(`allcovs') ///
    gname(AllCovariates) kd(1) contourplot clim(0 1)
sensemakr custody `allcovs', treat(male) gbenchmark(cul_*) ///
    gname(DrugCulpability) kd(1) contourplot
sensemakr custody `allcovs', treat(male) gbenchmark(cla_*) ///
    gname(DrugClass) kd(1) contourplot

* Extreme-case plot
* this will also give us the robustness value in the RV_q table column
sensemakr custody `allcovs', treat(male) extremeplot

Python Code

import pandas as pd
import statsmodels.formula.api as smf
import sensemakr as smkr

from causaldata import ccdrug
cc = ccdrug.load_pandas().data

# The full set of covariates, everything but intercept and treatment
all_covs = list(cc.columns[1:])
# Regress on male and all covariates
m = smf.ols('custody ~ male + ' + ' + '.join(list(cc.columns[1:])), 
    data=cc).fit()
# Get covariate names again, this time accounting for factor names
covs = list(m.params.index)
covs.remove('male')
covs.remove('Intercept')


bounds = smkr.Sensemakr(m,treatment = "male",
                              benchmark_covariates={
                          'All Covariates': covs,
                          'Culpability': covs[15:17],
                          'Drug Class': covs[10:15]})
# Make a graph, with a big range for the "All Covariates" adjustment
bounds.plot(treatment = 'male', lim = .3, lim_y = 1)
# Find the robustness value from the graph
bounds.sensitivity_stats['rv_q']

# Extreme-case analysis
bounds.plot('extreme', treatment = 'male')

21.2.3 What About Matching?

Allowing for the possibility that some back doors are still open is not limited to linear regression approaches. However, your humble textbook author comes to a crossroads. The most common way of doing partial identification with matching is called Rosenbaum bounds, from Rosenbaum and Rubin (⊕1983Rosenbaum, Paul R., and Donald B. Rubin. 1983. “Assessing Sensitivity to an Unobserved Binary Covariate in an Observational Study with Binary Outcome.” Journal of the Royal Statistical Society: Series B (Methodological) 45 (2): 212–18.; ⊕Rosenbaum 2010Rosenbaum, Paul R. 2010. Design of Observational Studies. Vol. 10. Springer.). It’s also highly intuitive and easy to explain. A slam dunk for being the method that I choose to explain to you.

However! Rosenbaum bounds is specifically designed to apply only when you select good matches using a propensity score. This is a problem, since I spent much of Chapter 14 telling you not to select matches with a propensity score! If you’re selecting matches, use a different distance measure, and if you’re using a propensity score, use inverse probability weighting. Worse, there are ways of doing partial identification with those methods, but they’re newer and less tested, as well as more complex. So what’s an author to do?

I’ve decided to go ahead with telling you about Rosenbaum bounds. However, in a “How the Pros Do It” mini-edition, I’ll also point you to some of those other methods. Liu, Kuramoto, and Stuart (⊕2013Liu, Weiwei, S. Janet Kuramoto, and Elizabeth A. Stuart. 2013. “An Introduction to Sensitivity Analysis for Unobserved Confounding in Nonexperimental Prevention Research.” Prevention Science 14: 570–80.) offer an accessible introduction to seven approaches to partial identification in matching (three of which are variations on Rosenbaum bounds), as well as pointing the way to many others. In more recent developments, Chernozhukov et al. (⊕2022Chernozhukov, Victor, Carlos Cinelli, Whitney Newey, Amit Sharma, and Vasilis Syrgkanis. 2022. “Long Story Short: Omitted Variable Bias in Causal Machine Learning.” NBER.) cover a range of ways that matching is used in causal machine learning, and they offer approaches to performing partial identification in those settings. Many of these machine learning methods are implemented in the dml.sensemakr package in R, which works much like the sensemakr package covered in the previous section. Crucially, the concept of what is going on is very similar across these methods, even if the calculations are different (necessarily so; you’ll see how heavily the Rosenbaum bounds calculations rely on the use of matched pairs), so much of it ports over anyway.

On to Rosenbaum bounds! Like other kinds of partial identification, Rosenbaum bounds look at a particularly strong assumption and then ask “if we weaken this, what can we still say?” Rosenbaum bounds do this by looking at a single matched pair at a time.

The idea behind matching in general is that you are constructing a treated sample and a control sample that do not differ in terms of any variables that would be on a back door. If, say, income is a variable on a back door path, but your treated and control groups have exactly the same income, then income can’t really explain any differences between the groups, can it? So income doesn’t cause treatment any more and you’ve closed the back door.

Another way to think about this approach, when you’re matching by choosing matched pairs, is that for any given pair of individual observations that are matched to each other, they should have the exact same probability of being treated. Let’s call the probability of being treated $\pi_i$ for a given individual $i$. So if we have a treated individual $t$ and an untreated individual $u$ who are matched on the basis of all their matching variables, then $\pi_t$ should equal $\pi_u$. Maybe both probabilities are, for example, 50%, but the coin flip just happened to land on “treated” for $t$ and “untreated” for $u$.

In this case, the probabilities are equal, so the odds ($\pi_i/(1-\pi_i)$) will also be equal, and the odds ratio (one of the odds divided by the other) will be 1. When this happens, matching will point identify the causal effect of treatment.

\[\begin{equation} \tag{21.9} \frac{\pi_t/(1-\pi_t)}{\pi_u/(1-\pi_u)} = 1 \end{equation}\]

Naturally, the researcher must ask what happens if they’re not exactly equal, perhaps because some back-door variable wasn’t measured and couldn’t be matched on. Rosenbaum bounds propose a “gamma” parameter $\Gamma$ that is greater than 1 and represents the worst possible odds ratio we’re willing to consider.

\[\begin{equation} \tag{21.10} \frac{1}{\Gamma} \leq \frac{\pi_t/(1-\pi_t)}{\pi_u/(1-\pi_u)} \leq \Gamma \end{equation}\]

In other words, we’re allowing there to be a mismatch, but the odds ratio can’t be any farther away from 1 than $\Gamma$ is. This is our weaker assumption: considering a $\Gamma$ above 1 instead of forcing it to be 1.

From there, we’re pretty much home free! Something important about that equation is that it’s not just true over your whole sample, it’s true for every matched pair individually. So for each pair in your data, you have this equation. You can then estimate your bounds by letting the degree of mismatch range from a worst-case where the treated individual was far less likely to be treated than their matched partner who was actually untreated (so the odds ratio is $1/\Gamma$), all the way up to the worst-case where the treated individual was far more likely to be treated than the matched partner who was actually untreated (so the odds ratio is $\Gamma$).

At each of these extremes you can get the effect, and with some additional calculations you end up with a range of effect estimates, as well as a range of p-values testing the effect against zero. And there you have your Rosenbaum bounds for a given $\Gamma$. You can say something like “if the odds ratio is no worse than $\Gamma$, then the estimated effect is between (A) and (B).” Or, alternately, you can try several different values of $\Gamma$ and say “the estimated effect is larger than 0 as long as $\Gamma$ is smaller than (C).”661661 You may have noticed that we got all the way here without making assumptions about how strongly related an omitted matching variable is to the outcome. We only care about its influence on treatment! However, the relationship between the omitted variable and the outcome shows up once we start looking at effects, which are themselves differences in the outcome between the treated and control groups.

Odds are, we have a problem here. We have our nice calculation, and it tells us the range of effects for a given $\Gamma$. But $\Gamma$ is a restriction on the odds ratio. Odds aren’t too difficult to understand, but odds ratios are just about impossible to interpret intuitively.662662 In preparing this section I tried very hard to intuitively explain, for example, how big an odds ratio of 1.2 is. But I couldn’t come up with anything good, despite intuitive explanations of statistics being, like, my whole thing. Then I scoured the internet and couldn’t turn up anybody else able to really do it either. I don’t feel too bad conceding defeat on this one. It’s even common for people who work with them all the time to mistake them for risk ratios (“the risk for something is X times higher in group A than group B”, and no that’s not what odds ratios are) (⊕Holcomb Jr et al. 2001Holcomb Jr, William L., Tinnakorn Chaiworapongsa, Douglas A. Luke, and Kevin D. Burgdorf. 2001. “An Odd Measure of Risk: Use and Misuse of the Odds Ratio.” Obstetrics & Gynecology 98 (4): 685–88.). How weak of an assumption is $\Gamma=1.2$, exactly? Or $\Gamma=2$? Or 4? The whole point of this is to say that we know how strong our assumptions need to be to generate useful conclusions. If we don’t actually know how strong those assumptions are, there’s no point!

One approach we can take here is to just do the conversion ourselves! It’s not too hard. Want to know whether your results are robust to the mismatch between pairs being as big as one member of the pair having a treatment probability of .4 while the other is .45? Calculate the odds ratio yourself: $(.45/(1-.45))/(.4/(1-.4))=1.23$. So we know a $\Gamma$ of 1.23 covers a gap in treatment rates of that size.663663 Importantly, if you do this, keep in mind that you can’t just say “ah, .4 to .45 is an odds ratio of 1.23, so I’m good for a gap of five percentage points.” Odds ratios get bigger the farther the base rate is from 50%. The odds ratio from .05 to .1, also a five-percentage-point gap, is an odds ratio of 2.11, and from .75 to .8 it’s 1.33. You can pick your base rate using the actual treatment rate in your data.

Alternatively, we can take a further step in our Rosenbaum bounds calculation, one that uses the fact that our matched pair is one treated and one control observation. We rewrite our condition for each pair as:

\[\begin{equation} \tag{21.11} \frac{1}{1+\Gamma} \leq \frac{\pi_t}{\pi_t+\pi_u}\leq \frac{\Gamma}{1+\Gamma} \end{equation}\]

This doesn’t seem much better at first, but that $\pi_t/(\pi_t+\pi_u)$ in the middle looks a lot easier to interpret to me! In a given matched pair, you’re always matching a treated person $t$ to an untreated person $u$. So, if we know that exactly one of these two people is going to be treated ($\pi_t+\pi_u$), what’s the probability that it was going to be the person who did get treated ($\pi_t$)? Keep in mind that if matching is perfect, it should be a coin flip, since you’re matching people with equal probabilities of treatment. Still not the easiest interpretation in the world, but easier.

This allows us to convert our $\Gamma$ of 1.2 (for example) into a lower bound of $1/(1+1.2)=.455$ and an upper bound of $1.2/(1+1.2)=.545$, which are both .045 away from a “fair coin” of .5. Now we can interpret our $\Gamma$ of 1.2 as “if I took a matched pair and then somehow learned about all the ways they are different despite the matching, I still wouldn’t be able to guess which one of the pair was the treated one more than 54.5% of the time.”

That’s… better I think? Still one to sit with a while. But hey, you’ve got time. What else are you doing anyway?

Let’s code up some Rosenbaum bounds. Packages to calculate Rosenbaum bounds are readily available! In fact there are many, for the many different flavors of Rosenbaum bounds. I will show a demonstration in this section, but before copy/pasting this code for your own application, it’s likely worth a little internet searching to see if there’s an implementation out there that more closely matches what you’re doing.

For the basics, in R and Stata there are packages both called rbounds.664664 No relation. They are written by different people and are used differently. Inconveniently, however, the specific kind of Rosenbaum-style bounds you need to calculate depends on whether your outcome variable is continuous or binary. So I’ll show both, keeping in mind that the version for a continuous outcome will be incorrect since our example data has a binary outcome, but you can use it if you have a continuous outcome. In the case of Stata, the rbounds package will only do continuous outcomes, and the binary-outcome approach will require a switch to the mhbounds package (see also the slightly updated rmhbounds package). In Python there does not appear to be a mature and maintained package that calculates Rosenbaum bounds, so we are out of luck there.

We’ll continue our application from Pina-Sánchez and Harris (⊕2020Pina-Sánchez, Jose, and Lyndon Harris. 2020. “Sentencing Gender? Investigating the Presence of Gender Disparities in Crown Court Sentences.” Criminal Law Review 2020 (1): 3–28.) as used in the previous section on regression. So we’re looking at people who have been arrested for drug crimes in England and Wales, and we want to know whether men are more likely than women to be taken into custody, after adjusting for a bunch of variables likely to differ on average between sexes and also influence the decision to take someone into custody.

R Code

library(rbounds); library(MatchIt); library(tidyverse)
cc <- causaldata::ccdrug

# We are using MatchIt to match, as that's the original way rbounds was 
# designed to work, but anything that makes pairs should work fine.
# use everything but custody to match
m <- matchit(male ~ . - custody,
                 data = cc, replace = TRUE)
# Get the outcomes for each pair, match.matrix contains
# row numbers for the treated and control observations
# Make sure that your rownames() are 1, 2, 3, etc.
treated_rows <- as.numeric(rownames(m$match.matrix))
control_rows <- as.numeric(m$match.matrix)
pairs <- data.frame(Treated = cc$custody[treated_rows],
                    Control = cc$custody[control_rows]) %>%
  # don't include unmatched treated observations
  filter(!is.na(Control))

# For a binary outcome, we count up the number of 
# "discordant pairs" with different outcomes
true_only_treated = sum(pairs$Treated == 1 & pairs$Control == 0)
true_only_control = sum(pairs$Treated == 0 & pairs$Control == 1)

# For our binary outcome. Select which Gammas to evaluate
binarysens(true_only_control, true_only_treated,
           Gamma = 2, GammaInc = .1)

# Pretending we have a continuous outcome
psens(pairs$Treated, pairs$Control, Gamma = 2, GammaInc = .1)

# For continuous outcomes, we can use the sensitivitymw
# package to get a confidence interval for our effect
library(sensitivitymw)
senmwCI(as.matrix(pairs), gamma = 1.6)

Stata Code

* ssc install rbounds
* ssc install rmhbounds
* ssc install psmatch2
causaldata ccdrug.dta, use clear download

* rmhbounds expects our matching to be done by psmatch2
psmatch2 male i.drgi_culpability i.drgi_class i.age i.offense ///
    i.prev_convictions first_offense, outcome(custody)
* Difference in the outcomes for rbounds later
g diff = custody - _custody if _treat==1 & _support==1

* psmatch2 produced the _weight matching variable indicating the number of times
* each observation was matched, which we need for rmhbounds, and 
* _treated, which we already have male for, but _treated is automatically used
rmhbounds custody, gamma(1(.1)2)

* Pretending we have a continuous outcome
* Note the outcome here is garbage; rbounds is very confused
* by a binary outcome. Just take this as example code.
rbounds diff, gamma(1(.1)2)

From this, it looks like our effect loses significance somewhere around $\Gamma = 1.5$.665665 That’s in R. If you run the Stata code, you will find it loses significance around $\Gamma = 1.4$ instead. This is because matchit uses slightly different defaults than psmatch2, and binarysens is doing a slightly different flavor of Rosenbaum bounds than mhbounds. Always read the documentation of your software to check what exactly it is doing! Using our calculation from the previous section, we get that $\pi_t/(\pi_t+\pi_u)$ is bounded between $1/(1+1.5) = .4$ and $1.5/(1+1.5) = .6$. So if, for a given matched pair, we could use their unobserved data to guess which of the two is the treated one more than 60% of the time, then we cannot reject an effect of 0, given our range of plausible estimates. If we can’t guess that well, then we can still reject .

21.3 How the Pros Do It

21.3.1 Formal Partial Identification

When thinking about how “the pros” do partial identification, there are two extremely different kinds of things I can recommend. The first is formal partial identification. This really isn’t so much a specific tool or estimator as it is, well… just sorta doing everything by hand.

We saw a little bit of this in the “How Does it Work?” section when we walked through the selection-bias calculation. In that calculation, we wrote out what we planned to estimate, and plugged in what it represented theoretically. That process made fairly clear what assumptions we needed to make about things we couldn’t see in our data, and we put in a reasonable range of assumptions to get a reasonable range of estimates.

That same process can be repeated for any causal estimation, not just ones as simple as we walked through. This process is highly flexible and lets you see how sensitive your results are to the relaxation of all kinds of assumptions, to all different degrees of relaxation. Pretty neat!

Unfortunately, with that freedom comes complexity. This section is fairly short because I can’t really do justice to the process of formal partial identification without very quickly getting into the weeds, and showing you a far deeper level of technical detail than you’d find in the rest of this book. This generally looks like walking very carefully through complex conditional probability calculations. That example from “How Does It Work” is doing exactly this, but it was very specifically chosen to be an example I could walk through fairly simply. Most applications are much trickier.

So I won’t attempt to show you how to do this. But I will give you a few places to look in case you’re interested. One straightforward introduction (as these things go) is the classic Charles Manski (⊕1995Manski, Charles F. 1995. Identification Problems in the Social Sciences. Harvard University Press.) book Identification Problems in the Social Sciences. Tamer (⊕2010Tamer, Elie. 2010. “Partial Identification in Econometrics.” Annual Review of Economics 2 (1): 167–95.) provides a somewhat more updated, but brief, overview. If you really like what you see, you can begin your journey into the deep end with Manski (⊕2003Manski, Charles F. 2003. Partial Identification of Probability Distributions. Springer Science & Business Media.) or Gangl (⊕2013Gangl, Markus. 2013. “Partial Identification and Sensitivity Analysis.” In Handbook of Causal Analysis for Social Research, 377–402. Springer.).

21.3.2 Design-Specific Partial Identification

That short and breezy overview of formal partial identification leads to another way the pros approach partial identification that probably gets used way more often, although it is much less general in its application.

All the different standard research designs I’ve described in this book, from Fixed Effects to Regression Discontinuity Design, always rely on key identifying assumptions. Rather, they rely on a lot of identifying assumptions of many kinds, but there are often a few that get the most attention or are viewed as being the most difficult to support despite being key to the method. Somewhat like in a Hollywood movie, the star of the show when it comes to building trust in your causal inference study is the underdog nobody really believes in at first, until they get a chance to prove themselves. These are things like parallel trends in Difference-in-Differences, or the validity assumption in Instrumental Variables.

As you might expect, this is prime territory for partial identification to move in on. Have a key identifying assumption that you can never really prove is true, and you know readers of your study will be suspicious of? Well, why not do some partial identification so that readers can know how much your results really do rely on that assumption, and what you get if the result is weakened?

A great example of this kind of thing is the “honest Difference-in-Differences” estimator for, well, difference-in- differences (DID) (⊕Rambachan and Roth 2023Rambachan, Ashesh, and Jonathan Roth. 2023. “A More Credible Approach to Parallel Trends.” The Review of Economic Studies, 2555–91.). A key assumption in DID studies is parallel trends. In DID, you are taking a group that got a treatment at a particular period of time and seeing how their outcomes changed from before treatment to after. Then, you compare those changes against a control group who never received the treatment. For this analysis to give you the causal effect of treatment, you must assume that, if the treatment had never been given to anyone, then the gap between the treated and control groups would have been exactly the same before treatment and after.

Exactly the same is a pretty high bar, isn’t it? Truly zero change in the gap between two groups? C’mon now. I’d probably never believe that parallel trends holds exactly. But what if it doesn’t hold exactly but rather is pretty close to holding, but not so close that we can justify ignoring the problem? Maybe just a tiny change in the gap would have happened. Surely we can work with that? And we can. Obviously, if the gap would have grown by .01, then our estimated effect would also be off by .01. So we have a pretty direct relationship between “amount of parallel trends violation” and “bias in our estimate.” Honest DID starts with a range of reasonable parallel trends violation values, based on a few different approaches including prior-trends tests (see the DID chapter), and turns those into adjusted effect estimates, including confidence intervals. This lets you say stuff like “if the parallel trends violation is no worse than $2\times$ the size of our prior trends violation, then we can bound the effect estimate between A and B.” Honest DID can be implemented in R using the HonestDiD package, or in Stata using honestdid.

In regression discontinuity designs, there are several key assumptions to be concerned about, but a big one has to do with manipulation of the running variable. Regression discontinuity is based on the idea that there is a cutoff value of some variable that assigns you to treatment (or at least makes you much more likely to be treated on one side of the cutoff than the other), and that for those really close to the cutoff, it’s sort of happenstance whether you end up just on one side of the cutoff or just on the other, making people just above and just below the cutoff good comparisons who really only differ based on random noise. But there might be some reasons why it’s not just random noise which side you end up on. If you know where the cutoff is, maybe you can change your behavior to just barely nudge yourself over that line. In that case, getting treated or not isn’t quite so random! In Chapter 20 I showed how we can test for the presence of manipulation, but not what to do if it happens.

Gerard, Rokkanen, and Rothe (⊕2020Gerard, François, Miikka Rokkanen, and Christoph Rothe. 2020. “Bounds on Treatment Effects in Regression Discontinuity Designs with a Manipulated Running Variable.” Quantitative Economics 11 (3): 839–70.) show how the extent of manipulation in the running variable translates into a range of effect estimates. You can either specify an upper bound on how much manipulation you expect there to be, or allow it to vary, and translate that into a range of effect estimates that allow for the potential presence of manipulation in your running variable.666666 You may be seeing a pattern here in how this stuff works! The concept, across all of these partial identification methods, is quite similar. Spot an assumption, relax it, calculate how that violation passes through to the estimate, then check a range of violation values. The hard part is the calculation to get from violation to estimate, and that’s the part these methods do for you. Their method has packages available for R and Stata, both of which can be downloaded from https://francoisgerard.github.io/rdbounds/.

How about instrumental variables? I’m leaving this one for last because, well, I already covered it back in Chapter 19 in the “Okay, You Can Have a Little Validity Violation” section. In instrumental variables, a key difficult-to-justify assumption is validity, which says that there are no open back doors from your instrumental variable to your outcome, except for those going through your treatment variable.

Partial identification in instrumental variables starts basically how you’d expect. Allow a range of validity violation amounts and get a range of estimates. Other approaches add more assumptions: if you assume, for example, that the instrument affects the treatment in the same direction in all subgroups, you can narrow your bounds. Or if you can choose whether the part of the instrument violating validity has a positive or negative relationship with the outcome, you can narrow your bounds.

Page built: 2025-10-17 using R version 4.5.0 (2025-04-11 ucrt)

Chapter 21 - Partial Identification

21.1 How Does It Work?Copy link

21.1.1 Strength Through RelaxationCopy link

21.1.2 Establishing Clear BoundariesCopy link

21.2 How Is It Performed?Copy link

21.2.1 Being SensitiveCopy link

21.2.2 Resolving Back Doors with Partial Identification in RegressionCopy link

21.2.3 What About Matching?Copy link

21.3 How the Pros Do ItCopy link

21.3.1 Formal Partial IdentificationCopy link

21.3.2 Design-Specific Partial IdentificationCopy link