The NonlinearDiD package implements difference-in-differences estimators for staggered treatment adoption with binary, count, and other nonlinear outcomes.
The landmark Callaway & Sant’Anna (2021) framework assumes that parallel trends holds on the probability scale:
\[E[Y_{it}(0) - Y_{it-1}(0) \mid G = g] = E[Y_{it}(0) - Y_{it-1}(0) \mid G = \infty]\]
For continuous outcomes (wages, test scores), this is natural. But for binary outcomes (employment, hospitalization, default), parallel trends in probabilities implies parallel trends in log-odds only under a very specific functional form restriction.
Roth & Sant’Anna (2023) show this creates two distinct problems:
Sensitivity to functional form: Whether pre-treatment trends appear parallel depends entirely on which scale you test.
Treatment effect heterogeneity conflation: In nonlinear models, the “treatment effect” includes both the true effect and Jensen’s inequality adjustments.
This package addresses both problems.
# From CRAN (forthcoming)
install.packages("NonlinearDiD")
# Development version from GitHub
remotes::install_github("example/NonlinearDiD")dat <- sim_binary_panel(
n = 800,
nperiods = 8,
prop_treated = 0.6,
n_cohorts = 3,
true_att = c(0.15, 0.25, 0.20), # heterogeneous treatment effects
base_prob = 0.3,
seed = 42
)
head(dat)
#> id period y g D alpha_i x1 x2
#> 1 1 1 1 0 0 0.8921 0.6888 1
#> 2 1 2 1 0 0 0.8921 0.6888 1
#> 3 1 3 0 0 0 0.8921 0.6888 1
#> 4 1 4 1 0 0 0.8921 0.6888 1
#> 5 1 5 0 0 0 0.8921 0.6888 1
#> 6 1 6 0 0 0 0.8921 0.6888 1
cat("Treatment cohorts:", table(dat$g[dat$period == 1]), "\n")
#> Treatment cohorts: 320 160 160 160
cat("Baseline outcome rate:", round(mean(dat$y[dat$D == 0]), 3), "\n")
#> Baseline outcome rate: 0.343res <- nonlinear_attgt(
data = dat,
yname = "y",
tname = "period",
idname = "id",
gname = "g",
xformla = ~ x1 + x2,
outcome_model = "logit",
estimand = "att",
control_group = "nevertreated",
doubly_robust = TRUE
)
summary(res)
#>
#> === Nonlinear DiD Summary ===
#>
#> Model: logit
#> Estimand: att
#> Control group: nevertreated
#>
#> --- Post-treatment ATT(g,t) ---
#> Mean ATT: 0.0327
#> Median ATT: 0.0287
#> Min ATT: -0.0531 | Max ATT: 0.1634
#> N(g,t) cells: 14
#>
#> --- Pre-treatment ATT(g,t) (should be ~0 under PT) ---
#> Mean: 0.0399 | SD: 0.0568
#> N(g,t) cells: 10
#>
#> --- Convergence ---
#> Converged: 24 / 24The nonlinear_attgt() function estimates a separate ATT
for each (group, time) pair. The key difference from the linear case: we
model the change in log-odds for control units as the
counterfactual, not the change in probability.
agg_dynamic <- nonlinear_aggte(res, type = "dynamic")
print(agg_dynamic)
#>
#> === Nonlinear DiD: Aggregated ATT ===
#>
#> Aggregation type: dynamic
#> Outcome model: logit
#>
#> Overall ATT: 0.0327 (SE: 0.0130, t = 2.508, p = 0.0121)
#> 95% CI: [0.0071, 0.0582]
#>
#> label att se ci_lo ci_hi n_groups post
#> -6 0.155126 0.04903 0.059034 0.25122 1 FALSE
#> -5 0.082694 0.04779 -0.010974 0.17636 1 FALSE
#> -4 0.108083 0.04850 0.013032 0.20313 1 FALSE
#> -3 0.002701 0.03444 -0.064793 0.07019 2 FALSE
#> -2 0.024059 0.03406 -0.042694 0.09081 2 FALSE
#> -1 0.000000 0.00000 0.000000 0.00000 3 FALSE
#> 0 0.005890 0.02791 -0.048816 0.06060 3 TRUE
#> 1 0.062026 0.02836 0.006445 0.11761 3 TRUE
#> 2 0.002272 0.03493 -0.066189 0.07073 2 TRUE
#> 3 0.065044 0.03432 -0.002223 0.13231 2 TRUE
#> 4 0.045172 0.03519 -0.023797 0.11414 2 TRUE
#> 5 -0.001089 0.04710 -0.093395 0.09122 1 TRUE
#> 6 0.029827 0.04741 -0.063086 0.12274 1 TRUE
plot(agg_dynamic)agg_group <- nonlinear_aggte(res, type = "group")
print(agg_group)
#>
#> === Nonlinear DiD: Aggregated ATT ===
#>
#> Aggregation type: group
#> Outcome model: logit
#>
#> Overall ATT: 0.0327 (SE: 0.0130, t = 2.508, p = 0.0121)
#> 95% CI: [0.0071, 0.0582]
#>
#> label att se ci_lo ci_hi n_periods post
#> 2 0.003981 0.01821 -0.0317023 0.03966 7 TRUE
#> 4 0.043094 0.02214 -0.0003072 0.08650 5 TRUE
#> 7 0.107062 0.03449 0.0394586 0.17467 2 TRUEThe group-level ATT tells us: cohort 4 experienced the largest treatment effect (they adopted treatment earlier, when baseline rates were lower).
pt <- nonlinear_pretest(res, plot = FALSE)
print(pt)
#>
#> === Nonlinear DiD Pre-Treatment Parallel Trends Test ===
#>
#> Test type: joint
#> Significance level: 0.05
#>
#> --- Pre-period ATT(g,t) estimates ---
#> group time att se tstat pval sig
#> 1 2 1 0.00000000 0.00000000 NaN NaN NA
#> 9 4 1 0.02766112 0.04872714 0.5676738 0.570256489 FALSE
#> 10 4 2 0.02104178 0.04841993 0.4345685 0.663875632 FALSE
#> 11 4 3 0.00000000 0.00000000 NaN NaN NA
#> 17 7 1 0.15512568 0.04902738 3.1640623 0.001555835 TRUE
#> 18 7 2 0.08269389 0.04779045 1.7303435 0.083568925 FALSE
#> 19 7 3 0.10808258 0.04849612 2.2286851 0.025834862 TRUE
#> 20 7 4 -0.02225973 0.04867279 -0.4573342 0.647430846 FALSE
#> 21 7 5 0.02707614 0.04790974 0.5651490 0.571972408 FALSE
#> 22 7 6 0.00000000 0.00000000 NaN NaN NA
#>
#> --- Joint Test ---
#> REJECT parallel trends in pre-period (joint chi2(7) = 19.012, p = 0.0081 < 0.05).
#> Estimates may be biased.The pre-trends test examines whether ATT(g,t) = 0 for all pre-treatment periods. Critical caveat: A failure to reject pre-trends does NOT validate parallel trends — it only fails to find evidence against it. For binary outcomes, the test is conducted on the same scale as the estimator.
For some binary outcome applications, the odds-ratio DiD is preferable because: - It is invariant to the choice of reference group - It does not require parallel trends in probabilities (only in log-odds) - It has a natural multiplicative interpretation
# Use simple 2-period case for clarity
dat2 <- dat[dat$period %in% c(3, 5), ]
res_or <- odds_ratio_did(
data = dat2,
yname = "y",
tname = "period",
idname = "id",
treat_period = 5,
control_period = 3,
gname = "g"
)
print(res_or)
#>
#> === Odds-Ratio DiD ===
#>
#> Interpretation: OR-DiD = 0.964 [95% CI: 0.634, 1.464]
#>
#> Log(OR-DiD): -0.0372 (SE: 0.2135)
#> 95% CI [OR]: [0.6340, 1.4642]
#> t-stat: -0.174 | p-value: 0.8618An OR-DiD of 1 means no treatment effect. Values > 1 indicate the treatment increased the odds of Y = 1.
For count outcomes (patents, hospital visits, crimes), we assume parallel trends on the log scale (multiplicative parallel trends):
\[\frac{E[Y_{it}(0)]}{E[Y_{it-1}(0)]} = \frac{E[Y_{jt}(0)]}{E[Y_{jt-1}(0)]}\]
count_dat <- sim_count_panel(
n = 600,
nperiods = 6,
prop_treated = 0.5,
true_rr = c(1.5, 2.0, 1.3), # rate ratios per cohort
base_rate = 10,
seed = 7
)
summary(count_dat$y)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.00 10.00 14.00 15.53 19.00 63.00# Staggered Poisson DiD
res_count <- nonlinear_attgt(
data = count_dat,
yname = "y",
tname = "period",
idname = "id",
gname = "g",
outcome_model = "poisson",
estimand = "att",
control_group = "nevertreated"
)
agg_count <- nonlinear_aggte(res_count, type = "dynamic")
plot(agg_count)# Simple 2x2 Poisson DiD (Wooldridge QMLE)
count_sub <- count_dat[count_dat$period %in% c(2, 4), ]
res_pois <- count_did_poisson(
count_sub,
yname = "y",
tname = "period",
idname = "id",
treat_period = 4,
control_period = 2,
gname = "g"
)
print(res_pois)
#>
#> === Count DiD (Poisson QMLE) ===
#>
#> ATT (log rate ratio): 0.2643 (SE: 0.0510)
#> Rate ratio: 1.3026 [95% CI: 1.1787, 1.4394]
#> ATT (APE, count scale): 3.9909
#> t-stat: 5.185 | p-value: 0.0000The doubly-robust estimator combines an outcome model with a propensity score model. It is consistent if either (but not necessarily both) is correctly specified.
res_dr <- binary_did_dr(
data = dat[dat$period %in% c(3, 5), ],
yname = "y",
tname = "period",
idname = "id",
treat_period = 5,
control_period = 3,
gname = "g",
xformla = ~ x1 + x2,
outcome_model = "logit"
)
print(res_dr)
#>
#> === Doubly-Robust Binary DiD ===
#>
#> ATT: -0.0080 (SE: 0.0475)
#> 95% CI: [-0.1010, 0.0851]
#> t-stat: -0.168 | p-value: 0.8670
#> N (treated): 480 | N (control): 320When we are uncertain about functional form, we can compute sharp bounds on the ATT. Under parallel trends alone (with no functional form assumption), the ATT for binary outcomes is point identified. But under weaker assumptions (e.g., only monotone treatment response), we get intervals.
bounds <- nonlinear_bounds(
data = dat,
yname = "y",
tname = "period",
idname = "id",
gname = "g",
bound_type = "manski", # widest (no assumptions)
control_group = "nevertreated"
)
# Show post-treatment bounds
post_bounds <- bounds[bounds$post, ]
head(post_bounds[, c("group", "time", "lb", "ub", "identified")])
#> group time lb ub identified
#> 2 2 2 -0.67500 0.32500 TRUE
#> 3 2 3 -0.64375 0.35625 TRUE
#> 4 2 4 -0.62500 0.37500 TRUE
#> 5 2 5 -0.58750 0.41250 TRUE
#> 6 2 6 -0.56250 0.43750 TRUE
#> 7 2 7 -0.58750 0.41250 TRUEA critical question: does it matter? When baseline probabilities are far from 0 and 1, the answer is yes — the linear DiD can be severely biased.
# Linear comparison (for illustration)
res_linear <- nonlinear_attgt(
data = dat,
yname = "y",
tname = "period",
idname = "id",
gname = "g",
outcome_model = "linear",
estimand = "att"
)
agg_lin <- nonlinear_aggte(res_linear, type = "dynamic")
agg_logit <- nonlinear_aggte(res, type = "dynamic")
# The two produce different estimates when baseline rates are moderate
cat("Linear overall ATT:", round(agg_lin$overall_att, 4), "\n")
cat("Logit overall ATT:", round(agg_logit$overall_att, 4), "\n")
cat("True ATT (avg): ", round(mean(c(0.15, 0.25, 0.20)), 4), "\n")For the logit-scale ATT(g,t) estimator, the key identifying assumption is:
Parallel Trends on the Logit Scale: For all \(t \geq 2\) and all treated cohorts \(g\):
\[[\text{logit}(E[Y_{it}(0)]) - \text{logit}(E[Y_{it-1}(0)])] \cdot \mathbf{1}_{G_i = g}\] \[= [\text{logit}(E[Y_{it}(0)]) - \text{logit}(E[Y_{it-1}(0)])] \cdot \mathbf{1}_{G_i = \infty}\]
This is a stronger assumption than CS2021’s linear parallel trends when treatment effects are heterogeneous (it restricts both the means and the full distribution).
The doubly-robust estimator for binary outcomes uses the influence function:
\[\hat{\psi}^{DR}_{g,t} = \frac{1}{n} \sum_i \left[ \frac{D_i}{\hat{p}} (\Delta Y_i - \hat{\mu}(X_i)) - \frac{(1-D_i)\hat{\pi}(X_i)}{1-\hat{\pi}(X_i)} \cdot \frac{1}{\hat{p}} (\Delta Y_i - \hat{\mu}(X_i)) \right]\]
where \(\Delta Y_i = Y_{it} - Y_{it-1}\), \(\hat{\pi}\) is the propensity score, and \(\hat{\mu}\) is the outcome regression on controls.
Callaway, B., & Sant’Anna, P. H. C. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2), 200-230.
Roth, J., & Sant’Anna, P. H. C. (2023). When is parallel trends sensitive to functional form? Econometrica, 91(2), 737-747.
Wooldridge, J. M. (2023). Simple approaches to nonlinear difference-in- differences with panel data. The Econometrics Journal, 26(3), 31-66.
Sant’Anna, P. H. C., & Zhao, J. (2020). Doubly robust difference-in-differences estimators. Journal of Econometrics, 219(1), 101-122.
Manski, C. F. (1990). Nonparametric bounds on treatment effects. American Economic Review, 80(2), 319-323.