Based on the content of this book thus far, we have covered the concepts of data generating processes and causality. We’ve discussed how to isolate our paths of interest, and how to identify the paths we want by either shutting down the back door paths we don’t want, or isolating the paths we want directly.
But that’s all conceptual. How do we actually do those things?
While the first half of this book covered concepts and intuition, the second half covers execution. We’ll be delving into the toolbox of methods that are commonly used by researchers.
Many of these methods, especially those past the “Regression,” “Matching,” and “Simulation” chapters, are based around the idea of simplifying a causal diagram. That is, in the real world, causal diagrams get so complex and intricate that it would be very difficult to measure and adjust for all the variables we need, like I talked about in Chapter 11.
But there are certain kinds of what we might call template causal diagrams that can be solved easily.186 I attribute this concept of “template” diagrams to researcher Jason Abaluck, who has successfully used it to deeply annoy causal-diagram purists. We can ask ourselves whether our context and research question of interest fits one of those templates. If it does, the associated method will give us a shortcut to identification that may be more plausible to people reading your work than trying to convince them you’ve really thought of and closed every back door.187 Depending on who you talk to, these methods might be called “reduced form,” or “quasiexperimental.”
So how to use these methods?
As always, first we want to model our data generating process, and draw a causal diagram! These methods are no replacement for understanding our data generating process, and indeed knowing which method to use relies on it.
Second, we want to ask ourselves does our diagram look how it needs to look to use one of these methods? For example, as you’ll read in Chapter 19, to use the instrumental variables method there must be an “instrumental” variable that causes our treatment, and for which all paths from the instrumental variable to the outcome go through the treatment.
If it does, we can use the method. We’ve solved our research design problem. From that point, we can start concerning ourselves with statistical issues rather than design issues.
The Toolbox chapters from Chapter 16 through Chapter 20 focus on “template” research designs in which the same sort of causal diagram, and thus design, applies in lots of different settings. These chapters will be structured the same, with three sub-chapters.
Because this book focuses more on research design than econometrics proper, there is little in the way of statistical proofs. If you are interested in these, I recommend the excellent textbooks by Jeffrey Wooldridge Wooldridge (2016). Or, if you want the real advanced stuff, William Greene (2003).
It’s also an undeniable fact that the How the Pros Do It sections do not tell you all the information you need to actually do it like the pros do it. This is because the way the pros actually do it is to read a voluminous and ever-changing literature on the newest approaches to these methods, or at least just read a bunch of other studies using the same general method and then largely follow their lead.188 There are, of course, some pros who never move beyond what they learned in their textbooks. Depending on what they’re doing, this is sometimes fine. Some research questions can be answered handily with tools that have been around long enough to make it into textbooks. Other times, well… there’s plenty of work out there by pros that could be better. Trying to keep up in textbook form would be fruitless, and would require nearly a whole book on each method.
Rather, the How the Pros Do It sections focus on highlighting some of the most important caveats and extensions, and giving you what you need to go learn about the state of the art on your own. If you are hoping for additional up-to-date applications of these methods, or information on their history, I recommend Causal Inference: The Mixtape by Scott Cunningham (2021).
All of the chapters in The Toolbox will include code examples in R, Stata, and Python, showing you how methods can be executed in code.
These code chunks may rely on packages that you have to install. Anywhere you see
X:: in R or
import X or
from X import in Python, that’s a package X that will need to be installed if it isn’t already installed. You can do this with
install.packages('X') in R, or using a package manager like pip or conda in Python. In Stata, packages don’t need to be loaded each time they’re used, so I’ll always specify in the code example if there’s a package that might need to be installed. In all three languages, you only have to install each package once, and then you can load it as many times as you want.
One additional package you’ll want to install to run these code examples is causaldata, which is a package of data sets I’ve made for this book (and several other books) and is available for all three languages. Do
install.packages('causaldata') in R,
ssc install causaldata in Stata, or
pip install causaldata (if using pip) in Python.
The datasets all come with documentation. Using the
mortgages data as an example: in R, see the description with
help(mortgages, package = 'causaldata') (or just
help(mortgages) if you already loaded the package with
library(causaldata)). You can also see the description of each variable as you work with
library(vtable) and then
vtable(mortgages) after loading the data. In Stata, variable labels can be seen in the Variable Explorer as normal, and you can get a description of the data set with the command
causaldata mortgages. In Python, after loading the data with
from causaldata import mortgages, you can see the data and variable descriptions with, respectively,
These code examples have been run using R 4.1, Stata 15.1, and Python 3.8. If your version of the language is at that level or newer, you should be good to go! If it’s older than that, you may want to upgrade, but while I haven’t tested the code examples on all old versions, you’re probably still fine as long as your R is version 3+ and your Stata is 14+ (except for the one example that relies on 16+, but I’ll warn you about it). For Python, it’s strongly recommended that you at least use 3.0+, as there are a number of major syntax changes from Python 2 to Python 3. For all three languages, you may still get somewhat different results based on updates to downloadable packages that occur after the publication of this book.189 For example, the modelsummary R package updated just before publication of this book and changed the significance star levels it displays—I had to change all the example code so it would keep producing the results I already had in the book! Who knows what package updates will occur after publication. If you spot such a change, please feel free to contact me.
One final note on code, specifically in Stata: Stata doesn’t naturally allow you to split one command onto two lines. However, some lines of code are going to be too long to fit on one line of the book and so must be split! This can be accomplished with the use of
/// at the end of a line, which means “the line isn’t over yet, keep reading on to the next one.” However, for some reason, Stata has decided that this only will only work if you are running code using the “Execute” button in the do-file editor. It doesn’t work if you’re just copy/pasting code into the Stata console. So if you see a
/// at the end of a line, either be sure to run that code using Execute from your do-file editor, or just erase the
/// and combine that line with the following line of code (and keep going until you hit a line that doesn’t end in
Page built: 2022-05-17 using R version 4.2.0 (2022-04-22 ucrt)