Negative-Binomial, Hurdle, Zero-Inflation, and Firth Logistic Regression with ‘fastglm’

Jared Huling

2026-05-04

Beyond family-driven GLMs, fastglm has fitting functions for four model types/situations that arise often in practice but require model-specific machinery on top of the standard IRLS solver:

All four reuse the same C++ IRLS solver as fastglm() itself; the outer iteration (joint NB MLE, the EM loop in zero-inflation, the inner theta-Brent for unknown-theta NB hurdle / ZI) likewise lives in C++. Cameron and Trivedi (1998) is the standard textbook reference for the count-data theory underlying this vignette, and Zeileis, Kleiber, and Jackman (2008) gives the pscl implementation we benchmark against.

library(fastglm)
suppressPackageStartupMessages({
    library(MASS)        # glm.nb, rnegbin, sex2-style separation data
    library(pscl)        # hurdle, zeroinfl, bioChemists
    library(logistf)     # canonical Firth implementation
    library(microbenchmark)
})

Negative-binomial regression

fastglm_nb() is a drop-in alternative to MASS::glm.nb() (Venables and Ripley, 2002) for the NB2 model with Var(Y) = mu + mu^2 / theta. Both theta and beta are estimated by maximum likelihood, alternating an IRLS update for beta at fixed theta with a 1-D Brent root-find for theta at fixed beta. Both loops run in C++.

A small-sample comparison on MASS::quine (school absences):

data(quine)
X <- model.matrix(~ Eth + Sex + Age + Lrn, data = quine)
y <- quine$Days

fit_f <- fastglm_nb(X, y)
fit_m <- MASS::glm.nb(Days ~ Eth + Sex + Age + Lrn, data = quine)

c(theta_fastglm = fit_f$theta, theta_glm.nb = fit_m$theta)
#> theta_fastglm  theta_glm.nb 
#>      1.274893      1.274893

max(abs(unname(coef(fit_f)) - unname(coef(fit_m))))
#> [1] 3.004626e-08
abs(as.numeric(logLik(fit_f)) - as.numeric(logLik(fit_m)))
#> [1] 0

Coefficients and theta agree to roughly 1e-8 on the same MLE.

A timing comparison, on a moderately sized simulated NB(theta = 2) example:

set.seed(1)
n  <- 5e4
X  <- cbind(1, matrix(rnorm(n * 3), n, 3))
mu <- exp(X %*% c(0.5, 0.4, -0.2, 0.3))
y  <- MASS::rnegbin(n, mu = mu, theta = 2)
df <- data.frame(y = y, x1 = X[, 2], x2 = X[, 3], x3 = X[, 4])

mb_nb <- microbenchmark(
    fastglm_nb = fastglm_nb(X, y),
    glm.nb     = MASS::glm.nb(y ~ x1 + x2 + x3, data = df),
    times = 5L
)
print(mb_nb)
#> Unit: milliseconds
#>        expr       min        lq      mean    median        uq       max neval
#>  fastglm_nb  27.27906  27.54245  28.20274  28.64723  28.66466  28.88032     5
#>      glm.nb 257.68812 300.93619 294.54855 303.13879 304.43976 306.53990     5
#>  cld
#>   a 
#>    b

The native NB family kernel is also exposed directly through negbin(theta, link) for the case when theta is known (or estimated separately). Holding theta fixed at the joint MLE recovers the fastglm_nb() regression coefficients exactly, since fastglm_nb() is just IRLS at the converged theta:

fit_joint <- fastglm_nb(X, y)
fit_known <- fastglm(X, y, family = negbin(theta = fit_joint$theta, link = "log"))
max(abs(unname(coef(fit_known)) - unname(coef(fit_joint))))
#> [1] 1.624203e-07

Firth bias-reduced logistic regression

Logistic regression breaks down under (quasi-) separation: the MLE is at infinity and the Wald standard errors blow up. Firth (1993) modifies the score by (1/2) ∂ log|I(beta)| / ∂beta, which both removes the leading O(1/n) bias and produces finite estimates under separation. fastglm exposes the penalty as a flag,

fastglm(x, y, family = binomial(), firth = TRUE)

For binomial logit the penalty collapses to a per-iteration adjustment of the working response by h_i (0.5 - mu_i) / (mu_i (1 - mu_i)) where h_i is the leverage; no new family code is needed, and the cost per IRLS iteration is one extra streaming pass for the leverages.

The canonical demonstration uses the sex2 dataset shipped with logistf (Heinze and Schemper, 2002):

data(sex2, package = "logistf")
X <- model.matrix(~ age + oc + vic + vicl + vis + dia, data = sex2)
y <- sex2$case

fit_f <- fastglm(X, y, family = binomial(), firth = TRUE)
fit_l <- logistf(case ~ age + oc + vic + vicl + vis + dia, data = sex2)

cbind(fastglm = unname(coef(fit_f)), logistf = unname(coef(fit_l)))
#>          fastglm     logistf
#> [1,]  0.12025405  0.12025405
#> [2,] -1.10598133 -1.10598131
#> [3,] -0.06881673 -0.06881673
#> [4,]  2.26887465  2.26887464
#> [5,] -2.11140819 -2.11140817
#> [6,] -0.78831695 -0.78831694
#> [7,]  3.09601183  3.09601166
max(abs(unname(coef(fit_f)) - unname(coef(fit_l))))
#> [1] 1.67938e-07

The two implementations land on the same penalized MLE to about 2e-7. The difference in runtime is substantial because fastglm’s Firth-fitting code leverages our existing C++ solver:

mb_firth <- microbenchmark(
    fastglm = fastglm(X, y, family = binomial(), firth = TRUE),
    logistf = logistf(case ~ age + oc + vic + vicl + vis + dia, data = sex2),
    times = 10L
)
print(mb_firth)
#> Unit: microseconds
#>     expr      min       lq      mean    median       uq      max neval cld
#>  fastglm  181.753  195.283  214.3357  211.2935  225.951  282.408    10  a 
#>  logistf 6785.336 6794.684 6853.8183 6863.2565 6894.519 6922.932    10   b

firth = TRUE is currently restricted to family = binomial(link = "logit") on a dense matrix x; probit / cloglog use a different score correction and will be added separately.

Hurdle models

Hurdle models, introduced by Mullahy (1986), factorize a count distribution into two independent pieces:

Because the two parts share no parameters, the joint likelihood factorizes and they can be fit independently. fastglm’s C++ driver fits both parts using the same IRLS solver as fastglm(), with new FAM_POIS_TRUNC_* / FAM_NB_TRUNC_* family codes that handle the truncation correction stably (expm1, log1p near mu = 0).

The formula uses the Formula package convention y ~ x1 + x2 | z1 + z2; the right-hand side after | specifies the zero-part design (it defaults to the count-part design if absent).

data(bioChemists, package = "pscl")
fit_f <- fastglm_hurdle(art ~ ., data = bioChemists, dist = "poisson")
fit_p <- pscl::hurdle (art ~ ., data = bioChemists, dist = "poisson")

max(abs(unname(coef(fit_f, "count")) - unname(fit_p$coefficients$count)))
#> [1] 3.144497e-07
max(abs(unname(coef(fit_f, "zero"))  - unname(fit_p$coefficients$zero)))
#> [1] 1.27043e-10
abs(as.numeric(logLik(fit_f)) - as.numeric(logLik(fit_p)))
#> [1] 1.07093e-10

coef() and vcov() accept a model = c("full", "count", "zero") argument so each part can be inspected separately:

coef(fit_f, model = "count")
#> (Intercept)    femWomen  marMarried        kid5         phd        ment 
#>  0.67113934 -0.22858262  0.09648498 -0.14218724 -0.01272657  0.01874550
coef(fit_f, model = "zero")
#> (Intercept)    femWomen  marMarried        kid5         phd        ment 
#>  0.23679601 -0.25115113  0.32623358 -0.28524872  0.02221940  0.08012135

For NB count parts, the dispersion is estimated by an inner Brent MLE that runs between outer IRLS iterations; the outer-loop tolerance is controlled by outer.tol / outer.maxit.

A timing comparison on a 4000-observation simulated example:

set.seed(11)
n  <- 4000
x1 <- rnorm(n);  x2 <- rnorm(n)
lam    <- exp(0.7 + 0.4 * x1 - 0.3 * x2)
is_pos <- rbinom(n, 1, plogis(-0.4 + 0.5 * x1 + 0.2 * x2))
yt     <- integer(n)
for (i in seq_len(n)) {
    repeat { v <- rpois(1, lam[i]); if (v > 0) { yt[i] <- v; break } }
}
y  <- ifelse(is_pos == 1, yt, 0L)
df <- data.frame(y = y, x1 = x1, x2 = x2)

mb_hurdle <- microbenchmark(
    fastglm_hurdle = fastglm_hurdle(y ~ x1 + x2, data = df, dist = "poisson"),
    pscl_hurdle    = pscl::hurdle (y ~ x1 + x2, data = df, dist = "poisson"),
    times = 5L
)
print(mb_hurdle)
#> Unit: milliseconds
#>            expr       min        lq      mean    median        uq       max
#>  fastglm_hurdle  1.187319  1.191624  1.285629  1.234223  1.390966  1.424012
#>     pscl_hurdle 15.933543 16.948949 16.986948 17.167397 17.327912 17.556938
#>  neval cld
#>      5  a 
#>      5   b

Zero-inflated models

Zero inflation, introduced by Lambert (1992), differs from a hurdle in that the two components share a latent variable: a y = 0 outcome could come from the inflation component (with probability pi_i) or from the count component (with probability 1 - pi_i and a count-side zero). The likelihood for y = 0 is

\[ \Pr(Y_i = 0) = \pi_i + (1 - \pi_i) f(0; \mu_i) \]

and for y > 0 it is (1 - pi_i) f(y; mu_i). The two-component mixture rules out a closed-form factorization, so fastglm uses an EM algorithm:

The final observed-information vcov comes from a numerical Jacobian of the analytical observed score at the EM fixed point, which is stable and cheap (block-diagonal per (gamma, beta, theta)). The complete EM driver — E-step, both M-steps, the inner theta MLE, and the score Jacobian — runs in C++.

The formula syntax matches fastglm_hurdle() exactly:

fit_f <- fastglm_zi(art ~ ., data = bioChemists, dist = "poisson",
                    em.tol = 1e-10, em.maxit = 300L)
fit_p <- pscl::zeroinfl(art ~ ., data = bioChemists, dist = "poisson")

max(abs(unname(coef(fit_f, "count")) - unname(fit_p$coefficients$count)))
#> [1] 2.019915e-05
max(abs(unname(coef(fit_f, "zero"))  - unname(fit_p$coefficients$zero)))
#> [1] 4.132504e-05
abs(as.numeric(logLik(fit_f)) - as.numeric(logLik(fit_p)))
#> [1] 3.170896e-07

EM is iterative, so the agreement is slightly looser than for hurdle (where the two parts are closed-form independent fits), but coefficients and the joint log-likelihood still match to about 1e-5.

A timing comparison on a simulated zero-inflated Poisson:

set.seed(21)
n  <- 3000
x1 <- rnorm(n);  x2 <- rnorm(n)
eta_c <- 0.7 + 0.4 * x1 - 0.3 * x2
eta_z <- -0.4 + 0.5 * x1 + 0.2 * x2
z     <- rbinom(n, 1, plogis(eta_z))
y     <- ifelse(z == 1, 0L, rpois(n, exp(eta_c)))
df    <- data.frame(y = y, x1 = x1, x2 = x2)

mb_zi <- microbenchmark(
    fastglm_zi = fastglm_zi(y ~ x1 + x2, data = df, dist = "poisson"),
    pscl_zi    = pscl::zeroinfl(y ~ x1 + x2, data = df, dist = "poisson"),
    times = 5L
)
print(mb_zi)
#> Unit: milliseconds
#>        expr       min        lq      mean    median        uq      max neval
#>  fastglm_zi  9.547096  9.674073  9.760296  9.742502  9.753695 10.08411     5
#>     pscl_zi 25.020619 25.383756 26.361393 26.381737 27.159507 27.86135     5
#>  cld
#>   a 
#>    b

References