Fast and Flexible Predictor Pruning for Data Analysis and Modeling
The corrselect package provides simple, high-level
functions for predictor pruning using association-based
and model-based approaches. Whether you need to reduce multicollinearity
before modeling or clean correlated predictors in your dataset,
corrselect offers fast, deterministic solutions with
minimal code.
library(corrselect)
data(mtcars)
# Association-based pruning (model-free)
pruned <- corrPrune(mtcars, threshold = 0.7)
names(pruned)
# Model-based pruning (VIF)
pruned <- modelPrune(mpg ~ ., data = mtcars, limit = 5)
attr(pruned, "selected_vars")Variable selection is a central task in statistics and machine learning, particularly when working with high-dimensional or collinear data. In many applications, users aim to retain sets of variables that are weakly associated with one another to avoid redundancy and reduce overfitting. Common approaches such as greedy filtering or regularized regression either discard useful features or do not guarantee bounded pairwise associations.
This package addresses the admissible set problem:
selecting all maximal subsets of variables such that no pair exceeds a
user-defined threshold. It generalizes to mixed-type data, supports
multiple association metrics, and allows constrained subset selection
via force_in (e.g. always include key predictors).
These features make the package useful in domains like:
corrPrune(): Association-based
predictor pruning
force_inmodelPrune(): Model-based predictor
pruning
lm, glm, lme4,
glmmTMB enginescorrPrune(mode = "exact")"pearson", "spearman",
"kendall""bicor" (WGCNA), "distance" (energy),
"maximal" (minerva)"eta", "cramersv" for mixed-type dataforce_in: protect variables from removal# Install from GitHub
remotes::install_github("gcol33/corrselect")corrPrune)library(corrselect)
data(mtcars)
# Basic: Remove correlated predictors
pruned <- corrPrune(mtcars, threshold = 0.7)
names(pruned)
# Protect important variables
pruned <- corrPrune(mtcars, threshold = 0.7, force_in = "mpg")
# Use exact mode (slower, guaranteed optimal)
pruned <- corrPrune(mtcars, threshold = 0.7, mode = "exact")
# Use greedy mode (faster for large datasets)
pruned <- corrPrune(mtcars, threshold = 0.7, mode = "greedy")
# Check what was removed
attr(pruned, "selected_vars")modelPrune)# Linear model with VIF threshold
pruned <- modelPrune(mpg ~ cyl + disp + hp + wt, data = mtcars, limit = 5)
attr(pruned, "removed_vars")
# GLM with binomial family
mtcars$am_binary <- as.factor(mtcars$am)
pruned <- modelPrune(am_binary ~ cyl + disp + hp,
data = mtcars, engine = "glm",
family = binomial(), limit = 5)
# Mixed model (requires lme4)
if (requireNamespace("lme4", quietly = TRUE)) {
df <- data.frame(
y = rnorm(100),
x1 = rnorm(100),
x2 = rnorm(100),
group = rep(1:10, each = 10)
)
pruned <- modelPrune(y ~ x1 + x2 + (1|group),
data = df, engine = "lme4", limit = 5)
}
# Custom engine (advanced: works with any modeling package)
# Example: INLA-based pruning
if (requireNamespace("INLA", quietly = TRUE)) {
inla_engine <- list(
name = "inla",
fit = function(formula, data, ...) {
INLA::inla(formula = formula, data = data,
family = "gaussian", ...)
},
diagnostics = function(model, fixed_effects) {
# Use posterior SD as badness metric
scores <- model$summary.fixed[, "sd"]
names(scores) <- rownames(model$summary.fixed)
scores[fixed_effects]
}
)
pruned <- modelPrune(y ~ x1 + x2, data = df,
engine = inla_engine, limit = 0.5)
}# Find ALL maximal subsets
res <- corrSelect(mtcars, threshold = 0.7)
show(res)
# Extract a specific subset
subset1 <- corrSubset(res, mtcars, which = 1)
# Convert to data frame
as.data.frame(res)corrPrune and modelPrune| Feature | corrPrune() |
modelPrune() |
|---|---|---|
| Requires model specification? | No | Yes |
| Based on | Pairwise correlations/associations | Model diagnostics (VIF) |
| Speed | Fast (greedy mode) | Moderate (refits models) |
| Works without response? | Yes | No |
| Supports mixed models? | No | Yes (lme4, glmmTMB) |
| Best for | Exploratory analysis, large p | Regression workflows, VIF reduction |
Tip: Use corrPrune() first to reduce
dimensionality, then modelPrune() for final cleanup within
a modeling framework.
Use assocSelect() for exact enumeration with mixed data
types:
df <- data.frame(
height = rnorm(30, 170, 10),
weight = rnorm(30, 70, 12),
group = factor(sample(c("A","B"), 30, TRUE)),
rating = ordered(sample(c("low","med","high"), 30, TRUE))
)
res <- assocSelect(df, threshold = 0.6)
show(res)Work directly with correlation matrices:
mat <- cor(mtcars)
res <- MatSelect(mat, threshold = 0.7, method = "els")This repository includes a short paper prepared for submission to the
Journal of Open Source Software (JOSS). You can find the
manuscript and references in the paper/ directory:
paper/paper.md – main textpaper/paper.bib – bibliographyMIT (see the LICENSE.md file)