--- title: "Stability-Selection for GAMLSS with SelectBoost.gamlss" shorttitle: "Stability-Selection for GAMLSS" author: - name: "Frédéric Bertrand" affiliation: - Cedric, Cnam, Paris email: frederic.bertrand@lecnam.net date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Stability-Selection for GAMLSS with SelectBoost.gamlss} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} #file.edit(normalizePath("~/.Renviron")) LOCAL <- identical(Sys.getenv("LOCAL"), "TRUE") #LOCAL=TRUE knitr::opts_chunk$set(purl = LOCAL) knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette demonstrates bootstrap stability-selection for GAMLSS models. ## What you'll learn * How to launch `sb_gamlss()` with correlated-selectBoost bootstraps for μ and σ. * How to interpret `selection_table()`, `plot_sb_gamlss()`, and `effect_plot()` outputs. * How to mix engines across parameters (`engine`, `engine_sigma`, `engine_tau`) and keep factor/spline terms grouped. * How to automate c0 exploration with `sb_gamlss_c0_grid()`, `autoboost_gamlss()`, and `fastboost_gamlss()`. > **Conference highlight.** SelectBoost for GAMLSS was presented at the Joint Statistics Meetings 2024 (Portland, OR) in the talk *"An Improvement for Variable Selection for Generalized Additive Models for Location, Shape and Scale and Quantile Regression"* by Frédéric Bertrand and M. Maumy. The presentation stressed how correlation-aware resampling sharpens variable selection for GAMLSS and quantile regression problems even when predictors are numerous and strongly correlated. ```{r, cache=TRUE, eval=LOCAL} set.seed(1) library(gamlss) library(SelectBoost.gamlss) n <- 400 x1 <- rnorm(n); x2 <- rnorm(n); x3 <- rnorm(n) mu <- 1 + 1.5*x1 - 1.2*x3 y <- gamlss.dist::rNO(n, mu = mu, sigma = 1) dat <- data.frame(y, x1, x2, x3) fit <- sb_gamlss( y ~ 1, mu_scope = ~ x1 + x2 + x3, sigma_scope = ~ x1 + x2, family = gamlss.dist::NO(), data = dat, B = 50, sample_fraction = 0.7, pi_thr = 0.6, k = 2, pre_standardize = TRUE, trace = FALSE ) fit$final_formula sel <- selection_table(fit) sel <- sel[order(sel$parameter, -sel$prop), , drop = FALSE] head(sel, n = min(8L, nrow(sel))) stable <- sel[sel$prop >= fit$pi_thr, , drop = FALSE] if (nrow(stable)) { stable } else { cat("No terms reached the stability threshold of", fit$pi_thr, "for this run.\n") } plot_sb_gamlss(fit, top = 10) if (requireNamespace("ggplot2", quietly = TRUE)) { print(effect_plot(fit, "x1", dat, what = "mu")) } ``` `selection_table()` ranks stable terms for each parameter; the chunk above prints the eight strongest effects (sorted by stability) alongside the terms exceeding the stability threshold. `plot_sb_gamlss()` overlays selection frequency vs. coefficient magnitude, and `effect_plot()` visualizes partial effects (using **ggplot2** if available, otherwise base R). ## Grouped penalties and parameter-specific engines SelectBoost.gamlss lets you mix-and-match engines across parameters. Grouped penalties (via **grpreg** and **SGL**) keep whole factors/spline bases/interactions together, while `engine`, `engine_sigma`, `engine_nu`, and `engine_tau` control the selection backend individually. ```{r, cache=TRUE, eval=LOCAL} set.seed(2) f <- factor(sample(letters[1:3], n, TRUE)) dat2 <- transform(dat, f = f, x4 = rnorm(n)) fit_grouped <- sb_gamlss( y ~ 1, mu_scope = ~ f + x1 + pb(x2) + x3 + x1:f, sigma_scope = ~ f + x1 + pb(x2), family = gamlss.dist::NO(), data = dat2, engine = "grpreg", # μ: grouped penalty engine_sigma = "sgl", # σ: sparse group lasso engine_tau = "glmnet", # τ (if present) would fall back automatically grpreg_penalty = "grLasso", sgl_alpha = 0.7, B = 40, pi_thr = 0.6, pre_standardize = TRUE, trace = FALSE ) sel_grouped <- selection_table(fit_grouped) sel_grouped <- sel_grouped[order(sel_grouped$parameter, -sel_grouped$prop), , drop = FALSE] head(sel_grouped, n = min(8L, nrow(sel_grouped))) stable_grouped <- sel_grouped[sel_grouped$prop >= fit_grouped$pi_thr, , drop = FALSE] if (nrow(stable_grouped)) { stable_grouped } else { cat("No terms reached the stability threshold of", fit_grouped$pi_thr, "for this run.\n") } ``` Grouped engines build design matrices internally (safely handling factors/splines) and retain original formulas for the final `gamlss` refit. ## SelectBoost integrations: c0 grids, autoboost, and fastboost The SelectBoost wrappers expose correlation-aware grouping and automated c0 choice. ```{r, cache=TRUE, eval=LOCAL} grid_obj <- sb_gamlss_c0_grid( y ~ 1, data = dat, family = gamlss.dist::NO(), mu_scope = ~ x1 + x2 + x3, sigma_scope = ~ x1 + x2, c0_grid = seq(0.2, 0.8, by = 0.2), B = 30, pi_thr = 0.6, pre_standardize = TRUE, trace = FALSE, progress = TRUE ) plot(grid_obj) confidence_table(grid_obj) auto_obj <- autoboost_gamlss( y ~ 1, data = dat, family = gamlss.dist::NO(), mu_scope = ~ x1 + x2 + x3, sigma_scope = ~ x1 + x2, c0_grid = seq(0.2, 0.8, by = 0.2), B = 30, pi_thr = 0.6, pre_standardize = TRUE, trace = FALSE ) attr(auto_obj, "chosen_c0") fast_obj <- fastboost_gamlss( y ~ 1, data = dat, family = gamlss.dist::NO(), mu_scope = ~ x1 + x2 + x3, sigma_scope = ~ x1 + x2, B = 30, sample_fraction = 0.6, pi_thr = 0.6, pre_standardize = TRUE, trace = FALSE ) plot_sb_gamlss(fast_obj) ``` Use `progress = TRUE` on any of these helpers to monitor grid/config iteration. ## Confidence summaries and effect diagnostics `confidence_functionals()` collapses stability curves across the c0 grid into scalar summaries (AUSC, coverage, quantiles, weighted, conservative). The output ranks terms and integrates seamlessly with `plot()` and `plot_stability_curves()`. ```{r, cache=TRUE, eval=LOCAL} cf <- confidence_functionals( grid_obj, pi_thr = 0.6, q = c(0.5, 0.8, 0.9), conservative = TRUE, B = 30 ) head(cf) plot(cf, top = 8) plot_stability_curves(grid_obj, terms = c("x1", "x3"), parameter = "mu") ``` For model interpretation, pair the stability summaries with partial-effect diagnostics via `effect_plot()` (see the quick-start example above) or use `selection_table()` to inspect the final refit. ## Tuning engines and penalties `tune_sb_gamlss()` evaluates different engine/penalty configurations using either stability or deviance metrics (with optional K-fold CV). Supply a list of configurations and a shared baseline of arguments. ```{r, cache=TRUE, eval=LOCAL} configs <- list( list(engine = "stepGAIC"), list(engine = "glmnet", glmnet_alpha = 1), list(engine = "grpreg", grpreg_penalty = "grLasso", engine_sigma = "sgl", sgl_alpha = 0.8) ) baseline <- list( formula = y ~ 1, data = dat2, family = gamlss.dist::NO(), mu_scope = ~ f + x1 + pb(x2) + x3 + x1:f, sigma_scope = ~ f + x1 + pb(x2), pi_thr = 0.6, pre_standardize = TRUE, sample_fraction = 0.7, trace = FALSE, parallel = "auto" ) tuned <- tune_sb_gamlss(configs, base_args = baseline, B_small = 25, metric = "stability", progress = TRUE) tuned$best_config ``` Switch `metric = "deviance"` and set `K` to engage fast deviance CV with the optimized density routes. ## Fast deviance utilities and diagnostics Deviance-based tuning reuses `loglik_gamlss_newdata_fast()` under the hood. Use `fast_vs_generic_ll()` to benchmark the speedup relative to the generic density evaluator, and `check_fast_vs_generic()` to verify numerical agreement (with per-family tolerances). ```{r eval=FALSE} # Example: compare deviance routes on a fitted model fit_ga <- gamlss::gamlss(y ~ x1 + x2, sigma.formula = ~ x1, data = dat, family = gamlss.dist::NO()) fast_vs_generic_ll(fit_ga, newdata = dat, reps = 30) check_fast_vs_generic(fit_ga, newdata = dat, tol = 1e-6) ``` See the dedicated fast-deviance vignettes for expanded results across many `gamlss.dist` families. ## Parallel bootstraps, progress, and long tests All bootstrap helpers accept `parallel = "auto"` to leverage **future.apply** (or any plan you set via `future::plan()`). Progress bars are enabled via `progress = TRUE` on `sb_gamlss()`, grid/autoboost helpers, and `tune_sb_gamlss()`. ```{r eval=FALSE} future::plan(multisession, workers = 4) fit_parallel <- sb_gamlss( y ~ 1, mu_scope = ~ x1 + x2 + x3, sigma_scope = ~ x1 + x2, family = gamlss.dist::NO(), data = dat, B = 200, pi_thr = 0.6, pre_standardize = TRUE, parallel = "auto", progress = TRUE, trace = FALSE ) ``` To opt into the full suite of equality/benchmark tests locally, set `options(SelectBoost.gamlss.run_long_tests = TRUE)` or `Sys.setenv(RUN_LONG_TESTS = "true")` before running `devtools::test()`.