Prepare-se para as provas
Obter pontos
Guias e Dicas

Prepare-se para as provas

Estude fácil! Tem muito documento disponível na Docsity

Ganhe pontos para baixar

Ganhe pontos ajudando outros esrudantes ou compre um plano Premium

Guias e Dicas

Venda na Docsity

Entrar Cadastre-se

Prepare-se para as provas

Estude fácil! Tem muito documento disponível na Docsity

Encontrar documentos

Prepare-se para as provas com trabalhos de outros alunos como você, aqui na Docsity

Pesquisar documentos Store

Os melhores documentos à venda: Trabalhos de alunos formados

Videoaulas

Prepare-se com as videoaulas e exercícios resolvidos criados a partir da grade da sua Universidade

Quiz

Responda perguntas de provas passadas e avalie sua preparação.

Pesquise entre todos os recursos de estudo

Docsity AINEW

Resuma seus documentos, faça perguntas, converta-os em questionários e mapas conceituais

TCC e ENEM 2025

Estude com provas passadas, TCCs e dicas úteis

Explorar perguntas

Tire suas dúvidas lendo as respostas dadas por outros alunos como você.

Ganhe pontos para baixar

Ganhe pontos ajudando outros esrudantes ou compre um plano Premium

Compartilhe documentos

20 Pontos

Por cada documento compartilhado

Responda às perguntas

5 Pontos

por cada resposta enviada (máx. 1 por dia)

Todas as maneiras de obter pontos grátis

Ganhe pontos imediatamente

Escolha um Plano Premium com todos os pontos que precisa

Oportunidades de estudo

Escolha seu próximo programa de estudos

Entre em contato direto com as melhores Universidades do mundo. Pesquise entre milhares de Universidades e parceiros oficiais

Comunidade

Pergunte à comunidade

Peça ajuda à comunidade e tire suas dúvidas relacionadas ao estudo

Ranking universidades

Descubra as melhores universidades em seu país de acordo com os usuários da Docsity

Guias grátis

Os eBooks que salvam estudantes!

Baixe gratuitamente nossos guias de estudo, métodos para diminuir a ansiedade, dicas de TCC preparadas pelos professores da Docsity

Do blog

Vá para o blog

Difference-in-Differences (DiD) with Staggered Treatment Adoption: A Comprehensive Guide, Esquemas de Econometria

Universidade Estadual de Goiás (UEG)Econometria

A comprehensive guide to difference-in-differences (did) with staggered treatment adoption, a powerful technique for causal inference in economics and social sciences. It delves into the theoretical foundations, identification assumptions, and estimation methods for did with staggered treatment adoption, offering insights into the nuances of this approach. The document also explores various extensions and applications of did, including incorporating covariates, handling treatment effect heterogeneity, and addressing parallel trends violations. It is a valuable resource for researchers and students seeking to understand and apply did with staggered treatment adoption in their research.

Tipologia: Esquemas

2016

Compartilhado em 03/04/2025

dayana-loveim-6 🇧🇷

1 documento

1 / 75

Esta página não é visível na pré-visualização

Não perca as partes importantes!

Difference-in-Differences Designs: A Practitioner’s Guide

Andrew Baker∗Brantly Callaway†Scott Cunningham‡

Andrew Goodman-Bacon§Pedro H. C. Sant’Anna¶

March 18, 2025

Abstract

Difference-in-Differences (DiD) is arguably the most popular quasi-experimental research

design. Its canonical form, with two groups and two periods, is well-understood. However,

empirical practices can be ad hoc when researchers go beyond that simple case. This article

provides an organizing framework for discussing different types of DiD designs and their

associated DiD estimators. It discusses covariates, weights, handling multiple periods, and

staggered treatments. The organizational framework, however, applies to other extensions of

DiD methods as well.

1 Introduction

Dating to the 1840s, Difference-in-Differences (DiD) is now the most common research design for

estimating causal effects in the social sciences.1A basic DiD design requires two time periods,

one before and one after some treatment begins, and two groups, one that receives a treatment

and one that does not. The DiD estimate equals the change in outcomes for the treated group

minus the change in outcomes for the untreated group: the difference of two differences. If the

average change in the outcomes would have been the same in the two groups had treatment not

occurred, a so-called parallel trends assumption, this comparison estimates the average treatment

effect among treated units.

∗University of California, Berkeley

†University of Georgia

‡Baylor University

§Opportunity and Inclusive Growth Institute, Federal Reserve Bank of Minneapolis

¶Emory University

1Currie, Kleven and Zwiers (2020) find that almost 25% of all NBER empirical working papers in 2018 and

17% of empirical articles in economics’ top 5 journals mention them. The earliest DiD applications we are aware of

are from Ignaz Semmelweis from the 1840’s (Semmelweis,1983) and Snow (1855). For a brief overview of the long

history of DiD in economics, see Section 2 of Lechner (2011).

arXiv:2503.13323v1 [econ.EM] 17 Mar 2025

Pré-visualização parcial do texto

Baixe Difference-in-Differences (DiD) with Staggered Treatment Adoption: A Comprehensive Guide e outras Esquemas em PDF para Econometria, somente na Docsity!

Difference-in-Differences Designs: A Practitioner’s Guide

Andrew Baker∗^ Brantly Callaway†^ Scott Cunningham‡

Andrew Goodman-Bacon§^ Pedro H. C. Sant’Anna¶

March 18, 2025

Abstract Difference-in-Differences (DiD) is arguably the most popular quasi-experimental research design. Its canonical form, with two groups and two periods, is well-understood. However, empirical practices can be ad hoc when researchers go beyond that simple case. This article provides an organizing framework for discussing different types of DiD designs and their associated DiD estimators. It discusses covariates, weights, handling multiple periods, and staggered treatments. The organizational framework, however, applies to other extensions of DiD methods as well.

1 Introduction

Dating to the 1840s, Difference-in-Differences (DiD) is now the most common research design for estimating causal effects in the social sciences.^1 A basic DiD design requires two time periods, one before and one after some treatment begins, and two groups, one that receives a treatment and one that does not. The DiD estimate equals the change in outcomes for the treated group minus the change in outcomes for the untreated group: the difference of two differences. If the average change in the outcomes would have been the same in the two groups had treatment not occurred, a so-called parallel trends assumption, this comparison estimates the average treatment effect among treated units.

∗University of California, Berkeley †University of Georgia ‡Baylor University §Opportunity and Inclusive Growth Institute, Federal Reserve Bank of Minneapolis ¶Emory University (^1) Currie, Kleven and Zwiers (2020) find that almost 25% of all NBER empirical working papers in 2018 and 17% of empirical articles in economics’ top 5 journals mention them. The earliest DiD applications we are aware of are from Ignaz Semmelweis from the 1840’s (Semmelweis, 1983) and Snow (1855). For a brief overview of the long history of DiD in economics, see Section 2 of Lechner (2011).

arXiv:2503.13323v1 [econ.EM] 17 Mar 2025

In practice, however, researchers apply DiD methods to situations that are more complicated than the classic two-period and two-group ( 2 ˆ 2 ) setup. Most datasets cover multiple periods, and units may enter (or exit) treatment at different times. Treatment might also vary in its amount or intensity. Other variables are often used to make treated and untreated units more comparable. Today’s typical DiD study includes at least one of these deviations from the canonical 2 ˆ 2 setup. For many years, the common practice in applied research was to estimate complex DiD designs using linear regressions with unit and time fixed effects (two-way fixed effects, henceforth TWFE). Their identifying assumptions and interpretation were informally traced to the fact that, in the 2 ˆ 2 case, a TWFE estimator gives the same estimate as a DiD estimator calculated directly from sample means, and thus inherits a clear causal interpretation under a specific parallel trends identification assumption. This appeared to justify the use of a single technique for any type of design or specification. Recent research, however, has shown that simple regressions can fail to estimate meaningful causal parameters when DiD designs are complex and treatment effects vary, producing estimates that are not only misleading in their magnitudes but potentially of the wrong sign; for recent overviews, see, e.g., Roth, Sant’Anna, Bilinski and Poe (2023), de Chaisemartin and D’Haultfoeuille (2023b), and Callaway (2023). The significance of these findings is substantial; given the prevalence of DiD analysis in modern applied econometrics work, common empirical practices have almost certainly yielded misleading results in several concrete cases (Baker, Larcker and Wang, 2022).^2 So, what should applied researchers do instead? This paper proposes a unified framework for discussing and conducting DiD studies that is rooted in the principles of causal inference in the presence of treatment effect heterogeneity. The central conclusion of recent methodological research is that even complex DiD studies can be understood as aggregations of 2 ˆ 2 comparisons between one set of units for whom treatment changes and another set for whom it does not. This fact links a wide variety of DiD designs used in practice and guides methodological choices about estimating them. Viewing DiD studies through the lens of 2 ˆ 2 “building blocks” aids in interpretability by clarifying that they yield causal quantities that aggregate the treatment effects identified by each 2 ˆ 2 component. It also means that identification comes from the simple parallel trends assumptions required for each 2 ˆ 2 building block. Practically, the building block framework suggests first estimating each 2 ˆ 2 and then aggregating them. As long as the effective sample size is large, this approach allows for asymptotically valid inference using standard techniques. This framework is a “forward-engineering” approach to DiD that embraces treatment effect het- erogeneity and constructs estimators that recover well-motivated causal parameters under explicitly stated assumptions. By fixing the goals of the study (the target parameters) and deriving analyt- ical techniques, forward engineering provides clear benefits over “reverse-engineering” approaches that begin with a familiar regression specification and derive the assumptions under which it has

(^2) Braghieri, Levy and Makarin (2022) also shows how newer DiD methods can lead to first-order different results when compared to standard TWFE regressions.

Medicaid after 2014, but several have not done so as of 2024. Columns 1 and 2 of Table 1 illustrate the variation in Medicaid expansion dates. Table 1: Medicaid Expansion Under the Affordable Care Act Expansion Year States Share of States Share of Counties Share of Adults (2013) Pre-2014 DE, MA, NY, VT 0.08 0.03 0. 2014 AR, AZ, CA, CO, CT, HI, IA, IL, KY, MD, MI, MN, ND, NH, NJ, NM, NV, OH, OR, RI, WA, WV

0.44 0.36 0.

2015 AK, IN, PA 0.06 0.06 0. 2016 LA, MT 0.04 0.04 0. 2019 ME, VA 0.04 0.05 0. 2020 ID, NE, UT 0.06 0.04 0. 2021 MO, OK 0.04 0.06 0. 2023 NC, SD 0.04 0.05 0. Non-Expansion AL, FL, GA, KS, MS, SC, TN, TX, WI, WY 0.20 0.31 0. Notes: The table shows which states adopted the ACA’s Medicaid expansion in each year as well as the share of all states, counties, and adults in each expansion year.

States also expanded Medicaid largely because of economic and political considerations (Som- mers and Epstein, 2013), creating observable differences between expansion and non-expansion states. For instance, just four out of the 22 states that expanded Medicaid in 2014 are in the southern Census region compared to seven out of ten non-expansion states. This suggests a po- tential role for covariates when analyzing Medicaid expansion. Finally, mortality is measured in jurisdictions like states and counties, which are of very different sizes. Choices about (population) weights not only determine how different estimation approaches average the units within a given expansion group but also how a given estimation technique averages estimated effects across those groups. California, for example, represented 4.5 percent of the states that expanded Medicaid in 2014, 5 percent of the counties, but 23.1 percent of the adults ages 20-64; its contribution to “the” average outcome for the 2014 expansion group is very different with weights than without. The final three columns of Table 1 show that in our data the entire 2014 expansion group contains 44 percent of the states, 36 percent of the counties, but 45 percent of all adults. Weighting will, therefore, change how important the estimated treatment effects are for the 2014 group. Several recent papers study the effect of ACA Medicaid expansion on mortality rates for lower- income adults who are most likely to gain insurance through Medicaid. Miller, Johnson and Wherry (2021) and Wyse and Meyer (2024) use simple DiD methods to provide credible evidence that Medicaid reduced adult mortality rates for targeted sub-populations. Unfortunately, their analyses require restricted links between income and mortality data, which are important to overcome the low statistical power in studies using aggregate mortality data (Black, Hollingsworth, Nunes and Simon, 2022). Our goal is to pursue a replicable and shareable example based on a related analysis

by Borgschulte and Vogler (2020). They use a sophisticated strategy to select and use covariates in a weighted TWFE regression using restricted access data, and find that Medicaid expansion reduced aggregate county-level mortality rates. We use only publicly available data, which allows us to share a fully-reproducible replication package, and consider only a handful of intuitive demographic and economic covariates sufficient to illustrate several practical challenges that can arise with DiD. This empirical exercise is meant solely to illustrate how to tackle several common features of DiD designs. The results are pedagogical in spirit and do not represent the best possible estimates of Medicaid’s effect on adult mortality. Our outcome variable is the crude adult mortality rate, Yi,t, for people ages 20-64 (measured per 100,000) by county (i) from 2009 to 2019 released by the Centers for Disease Control and Prevention (2024).^3 We denote county i’s adult population in 2013 as Wi and its socioeconomic covariates in year t (discussed below) as Xi,t. The information in Table 1 defines the treatment group variable Gi that equals the year in which county i’s state expanded Medicaid with Gi “ 8 for the non-expansion states. Our final sample contains 2,604 counties in states with complete data on mortality rates from 2009 to 2019 and covariates for 2013 and 2014. Faced with a setup such as this, researchers need to make a range of tightly related choices. Which treatment groups in Table 1 should be compared to each other and over what time horizons? What must be true for those comparisons to identify causal effects, and how should one empirically evaluate their plausibility? How can other information, such as covariates or pre-period outcomes, be used to improve the credibility of the design? How do these methodological choices affect the causal interpretation of a given analysis? The aim of this review is to demonstrate to practitioners using DiD in realistic scenarios why and how to use state-of-the-art econometric tools to answer these questions.

3 2 ˆ2 DiD designs

We begin our discussion by focusing on the canonical 2 ˆ 2 DiD setup, which has two time periods— one before and one after treatment—and two groups—one that remains untreated in both periods and one that becomes treated in the second period. In our Medicaid example, we focus on com- parisons between the 2014 expansion group (978 counties) and the non-expansion group as of 2019 (1,222 counties) in 2014 and 2013. When we consider more complex designs, this kind of comparison will still play a role: it will be one 2 ˆ 2 “building block” among many. Using these basic ingredients, we can now define a 2 ˆ 2 DiD design, composed of a causal target parameter, a treatment variable, an assumption under which it is identified, and an estimation

(^3) It is common to adjust mortality rates by the county age distribution. Unfortunately, the CDC measurements of age-specific deaths for many counties are restricted due to there being fewer than ten annual deaths. We aim to use publicly available and shareable data for pedagogical purposes; we follow Borgschulte and Vogler (2020) and use the crude mortality rate.

dates. For instance, if the announcement of Medicaid expansion affects mortality before its actual expansion, “treatment” begins when the policy is announced rather than implemented. We formally state this assumption for completeness and maintain it throughout the paper.

Assumption NA (No-Anticipation). For all treated units i and all pre-treatment periods t, Yi,tp 1 q “ Yi,tp 0 q.

The potential outcomes define a causal effect for every unit in every time period, Yi,tp 1 q´Yi,tp 0 q. These describe what Medicaid expansion did to mortality rates in a specific treated county or what it would have done in a specific untreated county. This framework allows for arbitrary heterogeneity in the effects across units and time, i.e., the effect of Medicaid expansion can be different in every county and year. But it is hard to learn about this degree of rich heterogeneity without additional strong restrictions. Instead, DiD analyses typically seek to estimate averages of heterogeneous treatment effects. In particular, most DiD designs target the average treatment effect on the treated at time t, or AT T ptq:

AT T ptq “ EωrYi,tp 1 q ´ Yi,tp 0 q|Di “ 1 s “ EωrYi,t|Di “ 1 s ´ EωrYi,tp 0 q|Di “ 1 s. (3.2)

AT T ptq compares (weighted) average observed post-expansion mortality rates among treated coun- ties (EωrYi,t|Di “ 1 s) to the (weighted) average untreated mortality rates for the same treated counties (EωrYi,tp 0 q|Di “ 1 s). The second quantity is counterfactual because untreated outcomes are never observed for treated counties. Note that, by the no-anticipation assumption, AT T ptq “ 0 for all pre-treatment periods, i.e., AT T p 2013 q “ 0 in our two-period Medicaid example. This en- sures that cross-group outcome comparisons before treatment begins reflect untreated potential outcome gaps, which is central to the logic of DiD. Note that we abuse notation and omit the weight index when defining AT T ’s; we do that to unclutter notation throughout the paper. Equation (3.2) shows that weighting enters the analysis early on, as part of the definition of the causal parameter. If interest lies in the average treatment effect of Medicaid on mortality in the average treated county, sometimes motivated by a view of jurisdictions as “laboratories of democ- racy” none to be prioritized over another, then the relevant target parameter is an equally weighted average (ω “ 1 ). If, on the other hand, the parameter of interest is the average treatment effect of Medicaid on mortality in the county in which the average treated adult lives, then population weights are appropriate (ω “ W ). When treatment effect heterogeneity is related to the weights, weighted and unweighted target parameters differ meaningfully (Solon, Haider and Wooldridge, 2015). We conduct some of our empirical exercises with and without population weights to high- light how weights affect a given DiD result. From a policy-relevancy perspective, we argue that using population weights in our Medicaid expansion application is probably more appropriate. In the Medicaid context, the unweighted AT T p 2014 q answers the question, “What was the average causal effect of Medicaid expansion on 2014 mortality rates among the 2014 expansion

state counties?” Whether this parameter (or any causal parameter) is “of interest” is an argument about theoretical importance, policy relevance, and how one is planning to use it. Other target parameters are also possible. Designs other than DiD identify different kinds of average treatment effects, and some DiD methods use quantile (Athey and Imbens, 2006; Callaway and Li, 2019) or distribution regression (Fernández-Val, Meier, van Vuuren and Vella, 2024a) approaches to target features of the marginal distributions of Yi,tp 1 q and Yi,tp 0 q among treated units. We focus on identification and estimation strategies that target ATT parameters but emphasize that the 2 ˆ 2 building block framework applies to DiD methods more broadly; see our appendix for more discussions about distributional parameters.

3.2 Identifying assumptions: parallel trends

A research design is a strategy—a set of assumptions—to identify and estimate specific target parameters. Many different assumptions can identify the missing counterfactual for AT T p 2014 q in the Medicaid example. For example, mean independence between Yi, 2014 p 0 q and Di implies that the counterfactual equals average 2014 mortality rates in non-expansion counties (EωrYi, 2014 p 0 q|Di “ 0 s). Under this assumption, which essentially entails assuming that Medicaid expansion is as- good-as random, the cross-sectional mortality gap in 2014 between expansion and non-expansion counties is the AT T p 2014 q. Similarly, time invariance of Yi,tp 0 q among expansion counties (plus the fact that we ruled out anticipatory behavior) implies that the counterfactual equals 2013 mortality rates in expansion counties (EωrYi, 2013 p 0 q|Di “ 1 s). Under this assumption, which essentially rules out non-treatment-related changes in the outcome variable, the “time trend” in average mortality in expansion counties is the AT T p 2014 q. DiD comes from an alternative assumption that identifies the relevant counterfactual even when the mean of Yi,t“ 2 p 0 q differs across treatment groups (which violates mean independence) and changes over time (which violates time invariance). The so-called parallel trends assumption states that, in the absence of treatment, the average outcome evolution is the same among treated and comparison groups.

Assumption PT ( 2 ˆ 2 Parallel Trends). The (weighted) average change of Yi,t“ 2 p 0 q from Yi,t“ 1 p 0 q is the same between treated and comparison groups, i.e.,

EωrYi,t“ 2 p 0 q|Di “ 1 s “ EωrYi,t“ 1 |Di “ 1 s `

`

EωrYi,t“ 2 |Di “ 0 s ´ EωrYi,t“ 1 |Di “ 0 s

In the Medicaid example, assumption PT says that to calculate expansion counties’ average 2014 mortality rate in a counterfactual world without Medicaid expansion, start with their average 2013 mortality rate and add the observed change in average mortality rates in non-expansion counties.

more efficient estimators exist (Roth and Sant’Anna, 2023a). Against the political and economic backdrop of the early 2010s, the claim that state choices about expanding Medicaid were random is also implausible. Therefore, in realistic scenarios, parallel trends can only hold under some restrictions on the way untreated outcomes enter the treatment selection mechanism. As one example, imagine that treatment selection depends on the permanent component of Yi,tp 0 q (fixed effects) but not on shorter-term fluctuations (“shocks”). For instance, if state legislatures only knew and considered their long-run mortality levels when making their expansion decision, they would be following this kind of selection mechanism. Expansion and non-expansion states would then have large differences in the permanent part of untreated outcomes, which difference in equation (3.3), and so parallel trends would hold if shocks to Yi,tp 0 q had a stable mean.^7 State legislatures, however, may have also known whether their 2013 mortality rates were especially high or low when considering expanding Medicaid. If the expansion choice is related to these 2013 mortality shocks as well as to fixed effect, parallel trends would hold only if one imposes stronger time-series restrictions on Yi,tp 0 q. Ghanem et al. (2023b) provides a fuller discussion of the selection/time-series trade-off and theory-driven templates to assess parallel trends, while Marx et al. (2024) discusses economic models that are and are not compatible with parallel trends. Another implication of the fact that DiD does not rely on statistical independence between Yi,tp 0 q and treatment status is that there is no guarantee that parallel trends holds across different transformations of Yi,tp 0 q. As stated, it is simply an assumption about averages for a particular Yi,tp 0 q. Roth and Sant’Anna (2023b) show that parallel trends is insensitive to functional form if and only if it holds between groups and across the distribution of Yi,tp 0 q. This would entail assuming that either Medicaid adoption is random, the mortality distribution is constant between 2013 and 2014, or a mixture of the two cases. As these conditions are arguably ex-ante restrictive, our DiD analysis may depend on our choices to measure Yi,t in rates (deaths per 100,000) as opposed to logs, for example. One way to evaluate this measurement choice is to propose a theory that delivers it, though we recognize that this is not always possible. To assess whether cases where parallel trends holding for one functional form come at the cost of ruling other transformations, we recommend that researchers use the Roth and Sant’Anna (2023b)’s falsification tests for the null that parallel trends are insensitive to functional form. In our application, we do not reject the null that parallel trends is insensitive to functional form, with p-values above 0.80. The interplay between treatment selection and the properties of the outcome variable charac- terize the structural basis for a DiD analysis (see DiNardo and Lee, 2011) and engaging with them is essential to any DiD application. While every study will have its own institutions, choices, and outcomes to consider, a rigorous DiD analysis must provide a transparent discussion about the (^7) Some researchers may find easier to understand these as “parallel changes” rather than “parallel trends”. How- ever, the use of “parallel trends” is now firmly established in the literature, and other influential work has used “changes-in-changes” to refer to an alternative estimator to classical DiD estimation (Athey and Imbens, 2006). To avoid confusion, we use parallel trends throughout this paper.

reliability of the underlying identification assumptions. If parallel trends is not plausible, one may be better off using an alternative research design.

3.3 Estimation and inference: 4 means or one regression?

Mapping the DiD estimand in equation (3.5) to the canonical 2 ˆ 2 DiD estimator follows imme- diately from replacing population expectations with their sample analogs:

AT T^ z p 2014 q “ pY (^) ω,D“ 1 ,t“ 2014 ´ Y (^) ω,D“ 1 ,t“ 2013 q ´ pY (^) ω,D“ 0 ,t“ 2014 ´ Y (^) ω,D“ 0 ,t“ 2013 q, (3.6)

where Y (^) ω,D“g,t“t 1 “

řn ři“^1 n^1 tDi^ “^ g, t^ “^ t^1 uωiYi,t^1 i“ 1 ωi^1 tDi^ “^ g, t^ “^ t^1 u^

is the ω-weighted sample mean of Y for treatment

group g in period t^1. Equation (3.6) is the classic difference of two differences written in terms of sample means. It is a direct recipe for actually estimating AT T ptq and can be read directly from the following table of average mortality rates in 2013 and 2014 by expansion group.

Table 2: Simple 2 ˆ 2 DiD

Unweighted Averages Weighted Averages Expansion No Expansion Gap/DiD Expansion No Expansion Gap/DiD 2013 419.2 474.0 -54.8 322.7 376.4 -53. 2014 428.5 483.1 -54.7 326.5 382.7 -56. Trend/DiD 9.3 9.1 0.1 3.7 6.3 -2. Notes: This table reports average county-level mortality rates (deaths among adults aged 20-64 per 100,000 adults) in 2013 (row

and 2014 (row 2) in states that expanded adult Medicaid eligibility in 2014 (columns 1 and 4) and states that have not ex- panded by 2019 (columns 2 and 5). The first three columns present unweighted averages and the second three columns present population-weighted averages. Columns 1, 2, 4, and 5 in the third row show time trends in mortality between 2013 and 2014 for each group of states. The first two rows of columns 3 and 6 show the cross-sectional gap in mortality between expansion and non-expansion states in 2013 and 2014. The entries in bold red text in row 3 show the simple 2 ˆ 2 difference-in-differences esti- mates without weights (column 3) and with them (column 6)

The two across-time changes in equation (3.6) are in the third row of the table. Without weighting, average county-level mortality rates in expansion states rose by 9.3 deaths per 100, and 9.1 deaths in non-expansion states, so, after rounding, the DiD estimate of AT T p 2014 q is 0.1 deaths per 100,000, implying that the average treatment effect of Medicaid expansion had on mortality in 2014 among counties that are part of an expansion state was an increase of 0. deaths per 100,000. In contrast, the DiD result using population weights suggests that Medicaid expansion caused a reduction of 2.6 deaths per 100,000 for the average adult in expansion states.^8 The same result can be obtained as the (weighted) least squares estimate of β^2 ˆ^2 in the following linear regression specification (that only has data for t “ 2013 and t “ 2014 ):

Yi,t “ β 0 β 11 tDi “ 1 u β 21 tt “ 2014 u β^2 ˆ^2 p 1 tDi “ 1 u ˆ 1 tt “ 2014 uq εi,t, (3.7)

(^8) Columns 3 and 6 show cross-group gaps in average mortality in each year. These can also be used to construct the DiD estimate by rearranging equation (3.6): pY (^) D“ 1 ,t“ 2014 ´ Y (^) D“ 0 ,t“ 2014 q ´ pY (^) D“ 1 ,t“ 2013 ´ Y (^) D“ 0 ,t“ 2013 q.

which are themselves the subject of a large econometrics literature that is particularly important when it comes to clustering decisions (Wooldridge, 2003; Bertrand, Duflo and Mullainathan, 2004; Donald and Lang, 2007; Cameron, Gelbach and Miller, 2008; Conley and Taber, 2011; Abadie, Athey, Imbens and Wooldridge, 2020, 2023). Many inference procedures exist for DiD-type analyses, arising from a combination of choices about the target parameter, details of the data structure and sampling process, and maintained assumptions about the structure of outcomes. In practice, one needs to determine and discuss the forms of uncertainty the standard errors are designed to capture; that is, what is (conceptually) being resampled and what may or may not vary across those resamples. As discussed in Abadie et al. (2020), these details come from the nature of the parameter of interest—whether the focus is on sample-specific average treatment effects or population-level average treatment effects—and the stochastic elements of the model that make the estimator random. Heuristically, this involves a thought experiment (or stochastic process) hypothesized to generate the random components of the model (or the data-generating process). Different inferential frameworks highlight different sources of uncertainty by resampling distinct model components and treating other components as fixed (non-random). Inferential frameworks on two extremes help cement these concepts. Design-based frameworks treat potential outcomes and covariates as non-random, focus on finite-population parameters (e.g., sample average treatment effects), and consider the allocation of treatment as the only source of the randomness in the model (Imbens and Rubin, 2015).^10 The only thing that is random and thus varies across the hypothetical resamples from this point of view is the treatment allocation. On the other hand, a traditional sampling-based approach to inference presumes that we independently sample units from a superpopulation. In this case, it is customary to focus on population param- eters (like AT T p 2 q), treat all variables in the model as random variables, and cluster standard errors at the level in which the (hypothesized) sampling was conducted. In this framework, every variable in the analysis—outcomes, covariates, and treatment—is randomly redrawn across the hypothetical resamples. A drawback of the sampling approach is that sometimes, it is unnatural to think of the data as a random sample from a well-defined population. A third popular approach to inference—the model-based approach—is more structural and involves taking a stand on the structure of the error component of the model (e.g., imposing a putative model for how shocks affect outcomes and their relationship with treatment and other variables in the model). The uncertainty reflected in this model-based setting entails a thought experiment in which different values of these shocks and the other random variables in the model are drawn from their joint distribution (Abadie et al., 2023). This model-based approach is common in econometrics, and it almost always takes the linear regression specification (or model, in this

(^10) Traditionally, design-based inference procedures are justified when treatment assignment is fully random, which is a much stronger requirement than parallel trends. See Rambachan and Roth (2024) for a discussion on design- based inference for quasi-experimental designs, including a discussion of the Medicaid expansion.

case) as the starting point of the analysis. Albeit this is often convenient, it is important to note that imposing model restrictions on the error component of the model necessarily imposes restrictions on treatment effect heterogeneity and on the relationship between potential outcomes; see Section 5 and Appendix A of Roth et al. (2023) for a discussion. Another challenge with the model-based approach is that it is hard to use this framework when adopting estimation strategies other than linear regressions, e.g., inverse probability weighting or doubly robust procedures that we will discuss in Session 4.4. Ultimately, as the discussion above highlights, each approach has pros and cons, and discussions about the best way to compute standard errors are complex and often ambiguous. As such, a detailed treatment of the topic is outside the scope of this paper. We emphasize that such a discussion is intrinsically context-specific, requiring information about the sampling process, the research design and target parameters, what is treated as fixed and random, and the structure of the error components of their models, among other factors. We refer interested readers to Abadie et al. (2020, 2023) and Section 5 of Roth et al. (2023) for discussions on these topics, though we also emphasize that further methodological research in this area is warranted. For the remainder of this article, we adopt a sampling perspective for uncertainty and cluster our standard errors at the county level. In our context, this is compatible with treating all variables as random, including treatment groups and potential outcomes. It also allows us to avoid (a) making time- series dependence restrictions on potential (and realized) outcomes—as we are in a short-panel framework with a large number of units and a fixed number of time periods—and (b) taking an explicit strand on the structure of error components of the model, which is particularly appealing as the starting point of our analysis are potential outcomes and not regression models. It is also worth mentioning that as our treatment in the empirical example is assigned at the state level, clustering at the county level would also be compatible with treating state-specific shocks as fixed (or conditioned on) and assessing if they lead to violations of parallel trends (Roth et al., 2023, Section 5.1). Clustering at the state level would be justified using a design-based perspective (Rambachan and Roth, 2024), though that would require us to treat potential outcomes as fixed (which we do not in this paper). Our choice of inference procedure is not without controversies, and other inferential approaches may also be rationalized under additional auxiliary structures. However, we do not follow that path in this paper. We conclude this section by stressing that the appeal of using regressions like (3.7) to estimate ATT in DiD designs comes from the fact that it is numerically equivalent to the “by-hand” DiD estimator (3.6), which was explicitly derived from the AT T ptq and the parallel trends assumption. This ensures that the regression specification respects the underlying identifying assumptions and estimates the desired target parameter. Unfortunately, the tight connection between (TWFE) regressions and DiD designs breaks under more complex setups that are ubiquitous in practice. We now turn to some of these issues and how approaching them from the point of view of 2 ˆ 2 building blocks can guide good econometric practices.

also report a measure of imbalance that is comparable across variables: the normalized difference in means between treatment and comparison group (Imbens and Rubin, 2015, Chapter 14),

Norm. Diffω “ bXω,T^ ´^ Xω,C pS^2 ω,T ` S^2 ω,C q{ 2

where Xω,T and Xω,C are the sample weighted or unweighted averages for the treatment and com- parison groups, respectively, and S ω,T^2 and S^2 ω,C are the sample weighted or unweighted variances of the covariates for the treatment and comparison group. As a general rule of thumb, values of the normalized difference in excess of 0.25 in absolute value indicate a potentially problematic imbalance between the two groups (Imbens and Rubin, 2015, page 277).

Table 4: Covariate Balance Statistics

Unweighted Weighted Variable Non-Adopt Adopt Norm. Diff. Non-Adopt Adopt Norm. Diff. 2013 Covariate Levels % Female 49.43 49.33 -0.03 50.48 50.07 -0. % White 81.64 90.48 0.59 77.91 79.54 0. % Hispanic 9.64 8.23 -0.10 17.01 18.86 0. Unemployment Rate 7.61 8.01 0.16 7.00 8.01 0. Poverty Rate 19.28 16.53 -0.42 17.24 15.29 -0. Median Income 43.04 47.97 0.43 49.31 57.86 0. 2014 - 2013 Covariate Differences % Female -0.02 -0.02 0.00 0.02 0.01 -0. % White -0.21 -0.21 0.01 -0.32 -0.33 -0. % Hispanic 0.20 0.21 0.04 0.25 0.33 0. Unemployment Rate -1.16 -1.30 -0.21 -1.08 -1.36 -0. Poverty Rate -0.55 -0.28 0.14 -0.41 -0.35 0. Median Income 0.98 1.11 0.06 1.10 1.74 0. Notes: This table reports the covariate balance between adopting and non-adopting states. In the top panel, we report the averages and standardized differences of each variable, measured in 2013, by adoption status. All variables are measured in percentage values, except for median household income, which is measured in thousands of U.S. dollars. In the bottom panel we report the average and standardized differences of the county-level long differences between 2014 and 2013 of each variable. We report both weighted and unweighted measures of the averages to correspond to the different estimation methods of including covariates in a 2 ˆ 2 setting.

We find meaningful imbalance in several baseline measures. Expansion counties in 2013 were whiter and had a higher unemployment rate despite lower poverty and higher median income than non-expansion counties. Because DiD uses changes in outcomes, sometimes researchers argue that the effect of pre-treatment variables is differenced out. This logic does not hold, though, if baseline covariates are related to untreated potential outcome trends themselves. The imbalance in panel A of Table 4 will lead to violations of parallel trends to the extent that counties starting out with different racial composition or income distributions would have had different mortality trends even absent Medicaid expansion.

Nevertheless, balance in covariate changes can be informative about parallel trends as well. Panel B of Table 4 reports average changes by group between 2013 and 2014 as well as normalized differences. Many of the imbalances evident in baseline levels change, or even flip signs, when measured in changes. Unemployment, for example, was higher in expansion states in 2013 but fell faster. To the extent that these changes are important determinants of ∆Yi,tp 0 q, then these results could suggest that Assumption PT is violated. Why do we say “could”? A major challenge in interpreting cross-group gaps in ∆Xi,t involves deciding which variables are truly covariates and which are mechanisms/outcomes. If an element of Xi,t cannot be affected by the treatment, it is a (strictly exogenous) covariate, and differential changes in exogenous covariates may indicate a PT violation. Since the treatment cannot have caused Xi,t to change (by assumption), something else that differs across groups and over time must have. Since little research suggests an effect of Medicaid expansion on unemployment, this may be a good assumption. On the other hand, if Medicaid expansion can change the demographic and economic composition of its counties, then differential changes in these variables may actually be a consequence of the expansion itself.^12 If so, then differential post-treatment changes in them would not necessarily indicate a parallel trends violation; they could partially reflect a causal effect. Like the plausibility of Assumption PT itself, whether something is a covariate or a mechanism is not a data question per se. It requires context-specific knowledge (or assumptions) about how treatment works.

4.2 DiD with covariates: Identification under conditional parallel trends

Having detected covariate imbalance that casts doubt on Assumption PT, how should we proceed to estimate AT T p 2 q? Because the imbalance documented in Table 4 suggested that unconditional parallel trends may not hold, our goal is to develop a DiD identification strategy based on an assumption that accounts for this imbalance. Working from a conditional parallel trends assump- tion shows how to construct AT T p 2 q from 2 ˆ 2 comparisons that are each conditioned on specific covariate values, thus addressing the imbalance problem. Let Xi be a vector of observed determinants of changes in Yi,tp 0 q. Here, we purposefully omit the time subscript on Xi because the covariates in this section can be time-invariant, such as fixed variations or baseline values (Xi,t“ 1 ), or time-varying in the sense of including values from in the second period, Xi,t“ 2. The empirical content of a “new” identification assumption that incorporates Xi, henceforth conditional parallel trends (CPT) assumption, is formalized as follows.

Assumption CPT ( 2 ˆ 2 Conditional Parallel Trends). The (weighted) average change of Yi,t“ 2 p 0 q from Yi,t“ 1 p 0 q is the same between treated and comparison units that share the same covariate

(^12) In fact, comparing mean covariate changes in expansion and non-expansion is the same as using Xi,t as the outcome in a 2 ˆ 2 DiD estimator.

assumptions. This expression has a clear intuition: the AT T p 2 q is equal to the path of outcomes experienced by the treated group (the term on the left) minus the average path of outcomes in the comparison group for each value of the covariates, averaged over the treated group’s distribution of covariates (the term on the right).

4.3 DiD estimation with covariates: TWFE

Unlike in an unconditional DiD, moving from the population identification result in equation (4.2) to sample analogs is a challenge unless the covariates are discrete and the conditional expectations themselves are easily calculable. With continuous covariates, or many discrete ones, it may not be feasible to construct Eωr∆Yi,t“ 2 |Xi, Di “ 0 s. Conditional DiD estimation, therefore, uses additional econometric techniques to bridge this gap. We begin, however, by discussing how regression DiD estimators that include covariates relate to the assumptions used for identification in (4.2). Because the TWFE specification in (3.7) recovers the AT T p 2 q in 2 ˆ 2 DiD setups without covariates, it is natural to extend this logic to regressions with covariates. Indeed, this is by far the most popular approach adopted by practitioners arguably because it is both easy and familiar. A typical regression specification is

Yi,t “ θt ηi βtreatDi,t X i,t^1 βcovs ei,t, (4.3)

where the unit and time fixed effects, treatment status, and covariates have already been defined, ei,t is an error term, and βtreat is interpreted as the parameter of interest. A related specification explicitly controls for baseline covariates by replacing Xi,t with interactions of the pre-treatment covariates and a post-treatment dummy,

Yi,t “ θt ηi βtreat, 2 Di,t p 1 tt “ 2 uXi,t“ 1 qβcovs, 2 ei,t, (4.4)

In Table 5, we report the OLS and weighted least squares estimates of the unconditional 2 ˆ 2 DiD estimate, βcovs from (4.4), βcovs, 2 , from (4.3), and their cluster-robust standard errors using the covariates from Table 4.

Table 5: Regression 2 ˆ 2 DiD with Covariates

Unweighted Weighted No Covs Xi,t“ 2013 Xi,t No Covs Xi,t“ 2013 Xi,t (1) (2) (3) (4) (5) (6) Medicaid Expansion 0.12 -2.35 -0.49 -2.56˚^ -2.56 -1. (3.75) (4.29) (3.83) (1.49) (1.78) (1.62) Notes: This table reports the regression 2 ˆ 2 DiD estimate comparing counties that expand Medicaid in 2014 to counties that do not expand Medicaid, adjusting for the inclusion of covariates (percent female, percent white, percent hispanic, the unemployment rate, the poverty rate, and median household income). Columns 1-3 report unweighted regression results, while columns 4-6 weight by county population aged 20-64 in 2013. Columns 1 and 4 report results for expansion states without covariates, columns 2 and 5 adjust for the baseline levels of the covariates in 2013, and columns 3 and 6 control for the time-varying covariate values in 2014 and 2013. Standard errors (in parentheses) are clustered at the county level.

Although only one covariate-adjusted estimate in Table 5 is (marginally) statistically significant, the point estimates differ noticeably. In the unweighted case, adjusting for the 2013 levels of the covariates decreases the estimated effect of Medicaid expansion on short-run mortality rates from a point estimate of roughly 0.12 to -2.35. However, if we include their time-varying values instead, we estimate an effect of -0.49, a large difference. We find a similar result when using weighted regressions; while the coefficient remains fairly constant (-2.56) when using 2013 values of the covariates, it attenuates to -1.37 if we use (4.3). The jump from the conditional DiD identification result in (4.2) to the TWFE estimators in (4.3) and (4.4) skips a crucial question about βtreat or βtreat, 2 : do they equal the target parameter AT T p 2 q under the conditional parallel trends assumption? It turns out that the close relationship between regression DiD, AT T p 2 q, and parallel trends in a design without covariates does not hold with covariates. The issues come from exactly what kinds of covariates are effectively being “controlled for” in these specifications and how the regression estimator combines outcome trends for covariate sub-groups. Note that in our two-period setup, (4.3) and (4.4) are respectively equivalent to (with some abuse of notation),

∆Yi,t“ 2 “ α βtreatDi ∆X i,t^1 “ 2 βcovs ∆ei,t“ 2 , ∆Yi,t“ 2 “ α βtreat, 2 Di X i,t^1 “ 1 βcovs, 2 ∆ei,t“ 2.

The first thing that is clear from these representations is that because time-invariant variables drop out of equation (4.3), a TWFE specification can only account for differential trends related to baseline covariate levels if they enter as interactions with the post-treatment dummy as in equa- tion (4.4). The exact regression specification, therefore, determines the implied conditional par- allel trends assumption. Controlling for annual poverty rates really means controlling for poverty changes, and areas that are poor are not the same as areas that are becoming poor. Another limitation evident in (4.3) relates to “bad controls.” Whenever Xi,t“ 2 is affected by the treatment, then conditioning on it (in any way) can bias estimates of the AT T p 2 q. If Medicaid expansion lowered poverty rates, for example, then including 2014 poverty rates or the 2013- change in poverty rates as a covariate is problematic. This echoes our discussion about testing balance in ∆Xi,t in the sense that time-varying covariates must be unaffected by treatment in order to interpret imbalance in their trends as a source of bias, and to be able to control for them to address that bias. See Caetano, Callaway, Payne and Rodrigues (2022) for a discussion. Suppose we have decided on which variables to include in a conditional parallel trends assump- tion and whether to measure them in levels or changes. If Assumptions CPT and SO hold with respect to this set of covariates, does βtreat recover the AT T p 2 q? In the DiD context, Caetano and Callaway (2024) tackles exactly this question. They show that βtreat equals a weighted average of conditional average treatment effects, defined as AT Txk p 2 q ” Eω

Yi,t“ 2 p 1 q ´ Yi,t“ 2 p 0 q|Di “ 1 , Xi “ xk

, with weights that may not be convex, plus three bias terms reflecting misspecification either