The Effect is a textbook all about causal inference, specifically causal inference done with observational data. We want to know whether \(X\) causes \(Y\), and by how much, but we can’t or don’t want to run an experiment. How can we design a research study to answer that question? A tricky task, one that I’ve had more than one researcher tell me to my face was impossible and not worth trying.1 I think they were just trying to get out of it.
And that’s what this book is for. In this book I’ll cover what a causal research question even is, and how we can do the hard work of answering that causal research question once we have it.
I’ll do that while scaling far back on equations and proofs. There’s absolutely a technical element to causal inference, and we’ll get to some of that in this book. But when you talk to people who actually do causal research, they think of this stuff intuitively first, not mathematically. They talk about assumptions about the real world and whether they’re reasonable, and what the story is behind the data. After they’ve got that settled, then they worry about equations and statistical properties. Designing good research and proving (or even understanding) statistical theorems are separate tasks. I think they should be introduced in that order.
In the first part of the book, The Design of Research, I’ll go over the concept of identification—the process of figuring out what part of the data has your answer in it, so you can start to work on digging it out. This will require us to use what we know about how the world works to learn just a little more. You’ll come out of the first half with an idea of what it is you need to do to answer a research question—what your research design is! Or, if you prefer, how you can tell whether to believe causal claims that you’ve seen other people make. What is it that they needed to do to support that claim, and did they do it?
The first part of the book is great. You’re gonna love it. I’ve read a lot of causal inference books at this point, and I don’t think there’s anything quite like it out there. My mom loved the first part of this book, and she is allergic to statistics.
The second part of the book, The Toolbox, is more technical. In The Toolbox I go over the standard set of tools that someone doing causal inference is likely to reach for. Some of these are statistical tools like regression. Others are common research designs that have turned out to be handy in answering lots of research questions, like difference-in-differences.
Of course, I say it’s more technical, but the emphasis is still heavily on intuition. I’m never going to try to sell you on a method by proving mathematically that it works. Instead, my goal is to get you to understand what these methods are trying to do, why they’re useful, and when they can be used. Then, I want to help you learn how to carry out these methods, which in the 2020s means how to code them up in R, Stata, or Python. Plenty of code examples in the second part of this book.
I want you to come out of the second part of this book feeling competent—ready to implement these methods and understand what’s going on when they’re used. With a little work, I think you can be.
I’m definitely biased, but I think this book is a lot of fun. It was fun to write, and I think it will be about as fun to read as a causal inference textbook can be. And it will take you on a tour of all kinds of methods and all kinds of research. Causal inference is a mutt of a field, with important contributions from medicine, from epidemiology, from economics, from sociology, from political science, from finance, from data science, and so on and so on. My home is in economics, so you know where I’m coming from, but I can guarantee I’ll be visiting all of these fields in the course of this book. I hope you’ll come with me.
When I was in college, I was introduced to the very barest of causal inference methods. But even those felt to me like a real kind of power. These tools and ideas are the kinds of things that, if you use them right, can turn you from a consumer of knowledge into a producer. You can find the answer to questions nobody else has the answer to. You can figure out how the world really works on your own. I think that’s pretty darn cool!
That’s the kind of power I want you to keep in mind as you read this book. The point I want to make clear in every page is this: the methods in this book are not blunt instruments to be whacked against some data until an answer emerges.2 Good thing they’re not—if these methods could just be applied blindly without real understanding, then before you finished reading this book someone would have just written a computer program that would do all this causal inference stuff for you, and you’d have wasted your time. They’re designed to be used as an extension of the researcher’s understanding about the world. They take what we know and tell us how we can learn more, and what assumptions we need to make to do so.
So don’t treat causal inference as a technical task. There are technical elements, and you’ll need to do some technical work, but that’s not the main point. Think of it as a reasoning task. What do you know? What theory can you rely on? And how can you use that to turn data from something confusing and jumbled into something useful and insightful?
I’m a professor. People have tried to sell me textbooks, so I know how it is. Every textbook is new, and different, and really provides real-world examples that students can latch onto, not like those stuffy old books that kids hate! Then you read it and it’s the exact same thing as the other books, but with different stock photos and New Yorker cartoons in the margins this time.
This book really is different in a few major ways, though. I promise! For one, no stock photos. But the differences aren’t just cosmetic, they’re structural. I suspect you’re either going to think this book is the perfect teaching tool you’ve been waiting for all these years, or you’re going to think it’s completely wrong-headed and focusing entirely on the wrong things.3 Not to mention all the stuff I left out. Apologies if your favorite causal or statistical method didn’t make it in. Every cut from the outline was agony, I assure you.
What we end up with is a book that is, for its subject, a fairly easy read without cutting back on rigor or breadth. The difficulty level is such that it is most appropriate for an undergraduate causality, observational methods, or applied econometrics course. Depending on the program, it could also be used in masters-level versions of those courses. In conjunction with more technical materials you may also find it useful reading for PhD courses. Readings from the first part of the book, The Design of Research, would also be acceptable for high school statistics classes that want to discuss causality.
There are many ways to use the book, but the way I organize the course myself is to spend the first third of my time discussing the concept of identification and how to work with causal diagrams to figure out identification. Then, the latter part of the course goes into specific methods, with plenty of opportunities to read and replicate existing research that uses those methods. Assignments and video materials are available on the textbook website at theeffectbook.net.
What makes this book so different then?
The first point of difference is the level of mathematics. Compared to the existing crop of causal inference textbooks (and certainly to the existing crop of econometrics textbooks), this one is very light on equations. In my experience, even among students who are good at math and can answer mathematical questions, it’s a pretty slim portion of students who understand a method because of an equation.4 But the students who do gain real understanding from equations, myself included, are more likely to become professors, and thus a problem… But if you know what a method is trying to do, then when you do get to the equations they take on meaning beyond just being a homework problem to conquer.
The priorities in this book place a conceptual understanding of research design far, far above anything else. The second priority is the ability to implement. That means writing code to perform these methods so you can see first-hand what they do and how they behave. The upside: if it works, students will understand what they’re doing and how to do it. Plus, I can introduce more advanced and up-to-date methods than a typical textbook that would expect to have to lay out the whole mathematical foundation.
A downside - and it is a real downside - is that this won’t prepare students to write proofs in a graduate statistical methods course, or develop their own estimators. However, I suspect that’s not where a lot of students, even the ones going into research, are heading anyway.
The second point of difference is the theoretical approach to causality. This book focuses heavily not just on causal inference methods but also causal inference concepts. As such, it has a theoretical underpinning for those concepts!
There are two main theoretical frameworks to choose from in causal inference. One is potential outcomes, associated primarily with Donald Rubin, and the other is the structural causal model/causal diagram framework associated with Judea Pearl.
I make two potentially controversial choices here. The first is to omit almost entirely the potential outcomes framework. The logic of potential outcomes certainly makes its way into the book several times, but I never introduce the model formally. Why? Because the stuff that potential outcomes is great at-clarifying the “missing-data” problem, handling treatment-effect averages, expressing ignorability conditions-I either don’t do, or I do in ways I think are more intuitive for students. I’ve taught potential outcomes to undergraduates before. The intuition is helpful; the math is a barrier. I take what I like!
So I’m largely using the causal diagram framework. The second controversial choice is to use what I think of as “causal diagrams lite.” No do-calculus, and I do some things that are helpful for clarity but not part of the formal causal diagram setup, like occasionally including functional form terms on the diagram.
Both of these choices mean that there will be some additional work to do for students who want to continue on to advanced study of these methods. But hopefully they’ll understand what they’re trying to do extremely well. I hope you’ll agree with me that, while the things I’ve left out are valuable and worth knowing in the long run, it’s the right call to leave them for later. Better to learn one thing well than two things poorly.
The mark of a truly excellent textbook is when someone chooses to read it even when it’s not assigned. A truly, truly excellent textbook is one that someone just wants to sit down and read the whole way through. If you do this, please let me know. My ego doesn’t need the boost, but it does want the boost.
While I’m sure I’m leaving some people out, I can imagine three kinds of people who might be likely to read this outside of a classroom. And for those three kinds of people, I have some reading recommendations.
For the data scientist or business analyst with little background in causality who wants to answer causal questions: glad you’re here! This book will be taking a rather different approach to data analysis than you’re probably used to. For the most part, data science and business analytics are both fields that are first data-driven.5 Not always! But usually. You look for patterns in the data and see what it tells you. Your goal, usually, is to make some prediction or measurement with the data.
Causal research, on the other hand, is theory-driven. You start with what you know and use that to interpret the data. Your goal is to use the data to uncover some broader truth about the processes and laws that generated the data.
Coming into this book, you not only are going to learn some new methods but also a whole new approach to thinking about research! Using a whole different mental framework is tough to do - I find it very difficult when I try to go in the other direction to read data science results, for example. For you, the key chapters of the book are going to be 2 and 5. Maybe even read those ones a couple times until you really get them. Once you do, the rest is methods. Given your background you can pick those up in a snap. Open your mind and step inside.
For the non-researcher who wants to understand how causal inference works or get better at interpreting and evaluating studies that use causal inference: this book is conveniently arranged in such a way that you can learn what you need without getting in over your head. Chapters 1 through 9 will give you a look into what studies that use causal inference, as a whole, are trying to do. It also gives you a leg up in determining how studies (or people) who make causal claims can support, or fail to support, the claims they’re making. Even as a non-researcher, you’ll become fully capable of drawing your own causal diagrams (Chapter 7), and then thinking about what must be done to support (identify) a causal claim (Chapters 8-9). Then you can ask yourself if they did that or not! If you find the prospect of reading about math terrifying, you can probably get away with skipping Chapters 3-4, but do give them a shot and see how far you get before skipping to Chapter 5.
The second part of the book may also be handy for you. If you’re not planning to do your own statistical research, you’re not going to need to read any chapter all the way through. But if there’s a study out there you want to be able to understand that uses one of the designs in these chapters, you can look through the “How Does It Work?” section at the beginning of many of the Toolbox chapters to see what that design is trying to actually do. And if you wanna get real fancy and interpret some of their results more directly, the “How Is It Performed?” sections will help with that too.
For the researcher with experience in causal inference who wants to review the standard methods or get a better sense of how they work, the second part of this book, The Toolbox, is perfect for you. Each chapter on a standard causal inference design is split in three parts-a “How Does It Work?” section that will refresh you on the concepts and theory underlying the design, a “How Is It Performed?” section that probably treads the closest to the econometrics-textbook introduction to these methods you’ve already seen, and a “How the Pros Do It” section that tracks modern tweaks, concerns, and fixes that you probably want to know about. And while we’re at it, those “How the Pros Do It” sections are perfect for the researcher who learned methods long ago and needs to get up to speed on recent developments.
That said, for both those kinds of researchers I’d also recommend checking out the material from the first half of the book on causal diagrams.6 And Chapter 5 on Identification, too-you already know this stuff but I think the chapter turned out really well; you might pick up a few new perspectives on identification. The marginal value of learning causal diagrams for someone already trained in potential outcomes isn’t enormous in my opinion, but it’s still a powerful tool to have in your belt. And they really are magical little things when it comes to teaching. If you teach, you may find yourself adding causal diagrams in class even if you don’t teach methods and statistics. I always used them in my Economics of the Education System class to help students understand empirical papers above their heads.
Writing this book has been a wild time, in every sense of the word. I started this thing in February 2020, when my kid was six months old and almost exactly one month before the coronavirus pandemic hit the United States. Now, as I finish the book, my kid is calling every animal “puppy” and, well, the pandemic isn’t exactly over, but I did just get my second vaccine shot on Friday.
Having a book to write has been a distraction, a passion project, and a way of keeping time during this weird and very tense year. If you’re reading this in the future and have no idea what I’m talking about, I’m sure there are many history books about 2020 you can read. Or, by the standards of your time, 2020 isn’t even that notably weird, in which case I hope you’re doing okay and am shocked you have the time to read causal inference textbooks.
You can also consider this book a strange and minor outgrowth of the economics community on Twitter (and the other academic communities it overlaps with). Not only did I learn quite a lot about causal inference from Twitter, but it’s just a great intellectual environment to be in, and contributing to it has been a great motivator. There’s a real rush in sharing some of your teaching materials and having a thousand people tell you they like them. So I spent a year writing a book. I guess flattery will get you anything.
Thanks to Scott Cunningham for encouraging me to write this book in the first place. Thanks to Joan for being a good enough sleeper to turn midnight-to-three into writing time. Thanks to Spike for everything.
Chapter-heading and cover illustrations are by Sara Jean.
Many thanks to Katelyn Trujillo for editing work. Thanks also to Megha Joshi who wrote many of the chapter questions that have now been moved out of the book itself and into supplementary material, and to the reviewers (anonymous and otherwise) of each draft.
Figures and tables in the main text of this book were generated using the R packages ggplot2 by Wickham (2016), Cairo by Urbanek and Horner (2020), ggpubr by Kassambara (2020), modelsummary by Arel-Bundock (2020), and vtable by Huntington-Klein (2020). Causal diagrams were generated using shinyDAG by Creed, Aden-Buie, and Gerke (2020), and the LaTeX package TikZ by Tantau (2013). Many other packages were used in performing analysis. Citations for all those not mentioned in the text can be found in the source code for this book, which is available at the book’s GitHub code repository.
This Bookdown version was created using the R packages bookdown (Xie 2016), tufte (Xie and Allaire 2020), and msmbstyle (Smith 2021), and with additional table-creation packages knitr by Xie (2015) and kableExtra by Zhu (2021).
Page built: 2021-12-21 using R version 4.1.2 (2021-11-01)