--- title: "Comparison with Alternatives" author: "Gilles Colling" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Comparison with Alternatives} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 5, dev = "svglite", fig.ext = "svg" ) library(corrselect) ``` ## Overview This vignette compares corrselect's graph-theoretic approach to four established methods for multicollinearity management and variable selection: 1. **caret::findCorrelation()** - Greedy correlation-based pruning 2. **Boruta** - Random forest permutation importance 3. **glmnet** - L1/L2 regularization (LASSO/Ridge) 4. **Manual VIF removal** - Iterative variance inflation factor thresholding Each comparison examines algorithmic differences, performance characteristics, and appropriate use cases. Evaluations use the `bioclim_example` dataset (19 bioclimatic variables, $n = 100$). See `vignette("theory")` for mathematical foundations. See `vignette("quickstart")` for usage examples. --- ## Evaluation Dataset ```{r} data(bioclim_example) predictors <- bioclim_example[, -1] # Exclude response response <- bioclim_example[, 1] cat("Variables:", ncol(predictors), "\n") cat("Observations:", nrow(predictors), "\n") cat("Response: species_richness (continuous)\n") ``` ```{r, fig.width=8, fig.height=6, fig.alt="Correlation heatmap of 19 bioclimatic variables displayed as a color-coded matrix. Blue indicates negative correlations, white indicates near-zero correlations, and red indicates positive correlations. Numerical correlation values are overlaid on each cell. The heatmap reveals block structure with correlations ranging from -0.15 to 0.97, showing strong correlations among temperature-related variables and precipitation-related variables."} cor_matrix <- cor(predictors) # Correlation heatmap col_pal <- colorRampPalette(c("#3B4992", "white", "#EE0000"))(100) par(mar = c(1, 1, 3, 1)) nc <- ncol(cor_matrix) nr <- nrow(cor_matrix) image(seq_len(nc), seq_len(nr), t(cor_matrix[nr:1, ]), col = col_pal, xlab = "", ylab = "", axes = FALSE, main = "Bioclimatic Variable Correlations (p = 19)", zlim = c(-1, 1)) axis(1, at = seq_len(nc), labels = colnames(cor_matrix), las = 2, cex.axis = 0.7) axis(2, at = nc:1, labels = colnames(cor_matrix), las = 2, cex.axis = 0.7) for (i in seq_len(nc)) { for (j in seq_len(nr)) { text_col <- if (abs(cor_matrix[j, i]) > 0.6) "white" else "black" text(i, nr - j + 1, sprintf("%.2f", cor_matrix[j, i]), cex = 0.5, col = text_col) } } ``` Block structure present: correlations range from -0.15 to 0.97. --- ## Comparison 1: caret::findCorrelation() ### Method caret's `findCorrelation()` applies greedy iterative removal: 1. Identify pair with maximum $|r_{ij}|$ 2. Remove variable with larger mean absolute correlation 3. Repeat until all $|r_{ij}| < \tau$ Non-deterministic: results depend on internal ordering. Typically removes more variables than graph-theoretic methods. ### Execution ```{r} if (requireNamespace("caret", quietly = TRUE)) { # Apply caret's greedy algorithm to_remove_caret <- caret::findCorrelation(cor_matrix, cutoff = 0.7) result_caret <- predictors[, -to_remove_caret] cat("caret results:\n") cat(" Variables retained:", ncol(result_caret), "\n") cat(" Variables removed:", length(to_remove_caret), "\n") cat(" Removed:", paste(colnames(predictors)[to_remove_caret], collapse = ", "), "\n") } ``` ```{r} # Apply corrselect (exact mode) result_corrselect <- corrPrune(predictors, threshold = 0.7, mode = "exact") cat("\ncorrselect results:\n") cat(" Variables retained:", ncol(result_corrselect), "\n") cat(" Variables removed:", length(attr(result_corrselect, "removed_vars")), "\n") cat(" Removed:", paste(attr(result_corrselect, "removed_vars"), collapse = ", "), "\n") ``` corrselect retains more variables ($|S_{\text{corrselect}}| \ge |S_{\text{caret}}|$) via maximal clique enumeration while satisfying identical threshold constraint. ### Distribution Comparison ```{r, fig.width=7, fig.height=5, fig.alt="Overlaid histogram comparing absolute correlation distributions across three methods: original data (gray bars), caret's findCorrelation (red bars), and corrselect (blue bars). Black vertical dashed line marks the 0.7 threshold. All methods successfully reduce correlations below the threshold, but corrselect retains more variables than caret while still satisfying the constraint, demonstrating the advantage of maximal clique enumeration over greedy removal."} # Extract correlations cor_orig <- cor(predictors) cor_corrselect <- cor(result_corrselect) if (requireNamespace("caret", quietly = TRUE)) { cor_caret <- cor(result_caret) # Overlaid histogram comparing all three hist(abs(cor_orig[upper.tri(cor_orig)]), breaks = 30, main = "Distribution of Absolute Correlations", xlab = "Absolute Correlation", col = rgb(0.5, 0.5, 0.5, 0.4), xlim = c(0, 1)) hist(abs(cor_caret[upper.tri(cor_caret)]), breaks = 30, col = rgb(0.8, 0.2, 0.2, 0.4), add = TRUE) hist(abs(cor_corrselect[upper.tri(cor_corrselect)]), breaks = 30, col = rgb(0.2, 0.5, 0.8, 0.4), add = TRUE) abline(v = 0.7, col = "black", lwd = 2, lty = 2) legend("topright", legend = c( paste0("Original (", ncol(predictors), " vars)"), paste0("caret (", ncol(result_caret), " vars)"), paste0("corrselect (", ncol(result_corrselect), " vars)"), "Threshold" ), fill = c( rgb(0.5, 0.5, 0.5, 0.4), rgb(0.8, 0.2, 0.2, 0.4), rgb(0.2, 0.5, 0.8, 0.4), NA ), border = c("white", "white", "white", NA), lty = c(NA, NA, NA, 2), lwd = c(NA, NA, NA, 2), col = c(NA, NA, NA, "black"), bty = "o", bg = "white") } ``` ### Comparison | Feature | caret | corrselect | |---------|-------|------------| | **Algorithm** | Greedy iterative removal | Maximal clique enumeration | | **Optimality** | Heuristic | Exact (mode = "exact") | | **Reproducibility** | Non-deterministic | Deterministic | | **Variables retained** | $\le$ optimal | Maximal | | **Forced variables** | No | Yes (`force_in`) | | **Mixed data** | No | Yes (`assocSelect`) | | **Complexity** | $O(p^2)$ | $O(p^2)$ greedy, $O(3^{p/3})$ exact | ### Applications **caret**: Exploratory analysis, non-critical reproducibility requirements. **corrselect**: Reproducible research, maximal variable retention, forced variable constraints, mixed data types. --- ## Comparison 2: Boruta ### Method Boruta tests variable importance via random forest permutation: 1. Create shadow features (permuted copies) 2. Fit random forest on original + shadow features 3. Test: $\text{importance}(X_i) > \max(\text{importance}(\text{shadow}))$ 4. Iteratively confirm/reject until convergence **Orthogonal objective**: Boruta selects predictive variables (supervised). corrselect removes redundant variables (unsupervised). ### Execution ```{r, eval=requireNamespace("Boruta", quietly=TRUE)} if (requireNamespace("Boruta", quietly = TRUE)) { # Boruta: "Which variables predict species_richness?" set.seed(123) boruta_result <- Boruta::Boruta( species_richness ~ ., data = bioclim_example, maxRuns = 100 ) cat("Boruta variable importance screening:\n") print(table(boruta_result$finalDecision)) important_vars <- names(boruta_result$finalDecision[ boruta_result$finalDecision == "Confirmed" ]) cat("\n Confirmed predictors:", length(important_vars), "\n") cat(" ", paste(important_vars, collapse = ", "), "\n") } ``` ```{r} # corrselect: "Which variables are redundant?" corrselect_result <- corrPrune(predictors, threshold = 0.7) cat("\ncorrselect multicollinearity pruning:\n") cat(" Non-redundant variables:", ncol(corrselect_result), "\n") cat(" ", paste(names(corrselect_result), collapse = ", "), "\n") ``` Different variable sets selected: Boruta optimizes prediction; corrselect minimizes redundancy. ### Comparison | Criterion | Boruta | corrselect | |-----------|--------|------------| | **Objective** | Predictive power | Redundancy removal | | **Criterion** | Permutation importance | $\|r_{ij}\| < \tau$ | | **Response** | Required | Not required | | **Multicollinearity** | Indirect | Direct | | **Stochastic** | Yes | No | | **Complexity** | High ($\ge 100$ forests) | Low (graph) | ### Sequential Application ```{r, eval=FALSE} # Stage 1: Correlation-based pruning data_pruned <- corrPrune(raw_data, threshold = 0.7) # Stage 2: Importance testing boruta_result <- Boruta::Boruta(response ~ ., data = cbind(response, data_pruned)) final_vars <- names(boruta_result$finalDecision[ boruta_result$finalDecision == "Confirmed" ]) # Stage 3: Final model final_model <- lm(response ~ ., data = cbind(response, data_pruned)[, c("response", final_vars)]) ``` Stage 1 removes redundancy (reproducible). Stage 2 tests importance (stochastic). Stage 3 fits model with non-redundant, predictive variables. ### Applications **Boruta**: Prediction-focused analysis with response variable. **corrselect**: Multicollinearity removal, exploratory analysis without response, reproducible selection. **Sequential**: High-dimensional correlated data requiring both redundancy removal and importance testing. --- ## Comparison 3: glmnet (LASSO/Ridge) ### Method glmnet minimizes regularized loss: $$ \min_{\beta} \frac{1}{2n} \|y - X\beta\|_2^2 + \lambda \left[\alpha \|\beta\|_1 + (1-\alpha) \|\beta\|_2^2\right] $$ - $\alpha = 1$: LASSO (L1 penalty, sparse $\beta$) - $\alpha = 0$: Ridge (L2 penalty, shrinkage) - $\lambda$: Cross-validation selected **Difference**: glmnet performs soft selection (shrinkage) optimizing prediction. corrselect performs hard selection (removal) based on correlation structure. ### Execution ```{r, eval=requireNamespace("glmnet", quietly=TRUE)} if (requireNamespace("glmnet", quietly = TRUE)) { # Fit LASSO with cross-validation X <- as.matrix(predictors) y <- response set.seed(123) cv_lasso <- glmnet::cv.glmnet(X, y, alpha = 1) # Extract non-zero coefficients at lambda.1se (conservative choice) coef_lasso <- stats::coef(cv_lasso, s = "lambda.1se") selected_lasso <- rownames(coef_lasso)[coef_lasso[, 1] != 0][-1] # Remove intercept cat("glmnet (LASSO, λ = lambda.1se):\n") cat(" Variables retained:", length(selected_lasso), "\n") cat(" ", paste(selected_lasso, collapse = ", "), "\n") } ``` ```{r, eval=requireNamespace("glmnet", quietly=TRUE)} if (requireNamespace("glmnet", quietly = TRUE)) { # Compare model performance model_glmnet <- lm(species_richness ~ ., data = bioclim_example[, c("species_richness", selected_lasso)]) model_corrselect <- lm(species_richness ~ ., data = cbind(species_richness = response, result_corrselect)) cat("\nModel comparison (OLS on selected variables):\n") cat(" glmnet: R² =", round(summary(model_glmnet)$r.squared, 3), "with", length(selected_lasso), "predictors\n") cat(" corrselect: R² =", round(summary(model_corrselect)$r.squared, 3), "with", ncol(result_corrselect), "predictors\n") } ``` glmnet selects fewer variables ($|S_{\text{glmnet}}| \le |S_{\text{corrselect}}|$) optimizing prediction. corrselect maximizes retention under correlation constraint. ### Coefficient Comparison ```{r, eval=requireNamespace("glmnet", quietly=TRUE), fig.width=10, fig.height=5, fig.alt="Side-by-side barplots comparing coefficient magnitudes between glmnet (left panel, salmon bars) and corrselect (right panel, blue bars). Left panel shows glmnet's shrunk coefficients affected by L1 penalty, biased toward zero. Right panel shows corrselect's unbiased OLS coefficients on pruned variables with preserved effect sizes. The comparison illustrates the tradeoff between prediction-focused shrinkage and interpretation-focused hard selection."} if (requireNamespace("glmnet", quietly = TRUE)) { par(mfrow = c(1, 2), mar = c(8, 4, 3, 2)) # glmnet coefficients (shrinkage) coef_vals <- coef_lasso[coef_lasso[, 1] != 0, ][-1] barplot(sort(abs(coef_vals), decreasing = TRUE), las = 2, main = "glmnet: Shrunk Coefficients", ylab = "Absolute Coefficient Value", col = "salmon", cex.names = 0.7) # corrselect: unbiased OLS coefficients coef_corrselect <- coef(model_corrselect)[-1] # Remove intercept barplot(sort(abs(coef_corrselect), decreasing = TRUE), las = 2, main = "corrselect: Unbiased OLS Coefficients", ylab = "Absolute Coefficient Value", col = rgb(0.2, 0.5, 0.8, 0.7), cex.names = 0.7) } ``` Left: L1 penalty shrinks coefficients toward zero (biased). Right: OLS on pruned variables (unbiased). glmnet optimizes prediction with shrinkage. corrselect preserves effect sizes. ### Comparison | Criterion | glmnet | corrselect | |-----------|--------|------------| | **Objective** | Prediction accuracy | Multicollinearity removal | | **Selection** | Soft (shrinkage) | Hard (removal) | | **Coefficient bias** | Yes (L1/L2) | No | | **Multicollinearity** | Regularization | Pruning | | **Response** | Required | Not required | | **Tuning** | $\lambda$ (cross-validation) | $\tau$ (user-specified) | | **Interpretability** | Shrunk effects | Direct effects | ### Applications **glmnet**: Prediction-focused, high-dimensional ($p > n$), accepts biased coefficients. **corrselect**: Interpretable coefficients, exploratory analysis, explicit correlation constraint, unregularized modeling. **Sequential**: Correlation pruning (corrselect) followed by sparse prediction (glmnet). --- ## Comparison 4: modelPrune() vs Manual VIF Removal ### Method Variance Inflation Factor quantifies predictor multicollinearity: $$ \text{VIF}_j = \frac{1}{1 - R^2_j} $$ where $R^2_j$ results from regressing $X_j$ on remaining predictors. Thresholds: - VIF < 5: Low collinearity - VIF < 10: Moderate (acceptable) - VIF ≥ 10: High (problematic) Manual approach: iteratively remove max(VIF) until all VIF < threshold. ### Manual Implementation ```{r} # Manual iterative VIF removal manual_vif_removal <- function(formula, data, threshold = 5, max_iter = 10) { require(car) # Get response variable name response_var <- all.vars(formula)[1] # Get predictor names (handles ~ . notation) model <- lm(formula, data = data) current_vars <- names(coef(model))[-1] # Exclude intercept removed_vars <- character(0) vif_vals <- car::vif(model) while (max(vif_vals) > threshold && length(current_vars) > 1 && length(removed_vars) < max_iter) { # Remove variable with highest VIF var_to_remove <- names(which.max(vif_vals)) removed_vars <- c(removed_vars, var_to_remove) cat("Iteration", length(removed_vars), ": Removing", var_to_remove, "(VIF =", round(max(vif_vals), 2), ")\n") # Update variable list and refit current_vars <- setdiff(current_vars, var_to_remove) new_formula <- as.formula(paste(response_var, "~", paste(current_vars, collapse = " + "))) model <- lm(new_formula, data = data) vif_vals <- car::vif(model) } list(model = model, iterations = length(removed_vars), vif = vif_vals, removed = removed_vars, converged = max(vif_vals) <= threshold) } # Run manual VIF removal if (requireNamespace("car", quietly = TRUE)) { cat("Manual VIF removal (iterative):\n") manual_result <- manual_vif_removal(species_richness ~ ., data = bioclim_example, threshold = 5) cat("\nVariables kept:", length(manual_result$vif), "\n") if (!manual_result$converged) { cat("(Stopped at max_iter = 10; VIF threshold not yet reached)\n") } } ``` ## modelPrune() Comparison ```{r} # Run modelPrune modelprune_result <- modelPrune(species_richness ~ ., data = bioclim_example, limit = 5) cat("\nmodelPrune results:\n") cat("Variables removed:", attr(modelprune_result, "removed_vars"), "\n") cat("Variables kept:", length(attr(modelprune_result, "selected_vars")), "\n") # Extract final model final_model <- attr(modelprune_result, "final_model") if (requireNamespace("car", quietly = TRUE)) { cat("\nFinal VIF values:\n") print(round(car::vif(final_model), 2)) } ``` ## Visual: VIF Comparison ```{r, fig.width=8, fig.height=5, fig.alt="Side-by-side barplot showing VIF values before (red bars) and after (blue bars) applying modelPrune() for the top 15 variables ordered by initial VIF. Black horizontal dashed line marks the VIF limit of 5. Before pruning, many variables show high VIF values indicating severe multicollinearity. After modelPrune(), all retained variables have VIF below the threshold, and high-VIF variables are completely removed (shown as red-only bars), demonstrating automated and effective multicollinearity reduction."} if (requireNamespace("car", quietly = TRUE)) { # Compute VIF for original model model_full <- lm(species_richness ~ ., data = bioclim_example) vif_before <- car::vif(model_full) # VIF after modelPrune vif_after <- car::vif(final_model) # Combined barplot par(mar = c(8, 4, 4, 2)) all_vars <- unique(c(names(vif_before), names(vif_after))) vif_combined <- data.frame( before = vif_before[match(all_vars, names(vif_before))], after = vif_after[match(all_vars, names(vif_after))] ) vif_combined[is.na(vif_combined)] <- 0 vif_combined <- vif_combined[order(vif_combined$before, decreasing = TRUE), ] # Show top 15 n_show <- min(15, nrow(vif_combined)) barplot(t(as.matrix(vif_combined[1:n_show, ])), beside = TRUE, las = 2, main = "VIF Before and After modelPrune()", ylab = "VIF", col = c(rgb(0.8, 0.2, 0.2, 0.7), rgb(0.2, 0.5, 0.8, 0.7)), cex.names = 0.6, names.arg = rownames(vif_combined)[1:n_show]) abline(h = 5, col = "black", lwd = 2, lty = 2) legend("topright", legend = c("Before", "After", "Limit = 5"), fill = c(rgb(0.8, 0.2, 0.2, 0.7), rgb(0.2, 0.5, 0.8, 0.7), NA), border = c("white", "white", NA), lty = c(NA, NA, 2), lwd = c(NA, NA, 2), col = c(NA, NA, "black"), bty = "o", bg = "white") } ``` ### Comparison | Criterion | Manual VIF | modelPrune() | |-----------|------------|--------------| | **Algorithm** | Iterative greedy | Graph-based | | **Automation** | Manual | Automated | | **Optimality** | Heuristic | Maximal subset | | **Forced variables** | Manual exclusion | `force_in` | | **Output** | Verbose log | Summary | ### Applications **Manual VIF**: Educational use, diagnostic understanding, legacy workflows. **modelPrune()**: Production pipelines, forced variable constraints, reproducible documentation. --- ## Summary ### Method Selection | Goal | Primary Method | Alternative | |------|---------------|-------------| | Redundancy removal (unsupervised) | corrPrune() | caret::findCorrelation() | | VIF reduction (regression) | modelPrune() | Manual VIF | | Predictive variable selection | Boruta | RF importance | | Prediction accuracy | glmnet | Elastic net | | Mixed data types | assocSelect() | Manual metrics | | Forced variable constraints | corrselect (`force_in`) | N/A | | Exploratory (fast) | caret | corrPrune (greedy) | ### corrselect Distinguishing Features 1. **Maximal clique enumeration**: Optimal retention under $|r_{ij}| < \tau$ constraint 2. **Deterministic**: ELS and Bron-Kerbosch algorithms guarantee reproducibility 3. **Flexible**: Forced variables, mixed types, greedy/exact modes 4. **Unbiased estimates**: Hard removal preserves coefficient interpretability 5. **Model-agnostic**: Correlation-based preprocessing ### Integrated Workflow ```{r, eval=FALSE} # Correlation pruning data_pruned <- corrPrune(raw_data, threshold = 0.7) # VIF refinement model_data <- modelPrune(response ~ ., data = data_pruned, limit = 5) # Importance testing (optional) if (requireNamespace("Boruta", quietly = TRUE)) { boruta_result <- Boruta::Boruta(response ~ ., data = model_data) important_vars <- names(boruta_result$finalDecision[ boruta_result$finalDecision == "Confirmed" ]) } # Final model: OLS (interpretable) or glmnet (prediction) final_model <- lm(response ~ ., data = model_data[, c("response", important_vars)]) ``` ## References - **caret**: Kuhn, M. (2008). Building predictive models in R using the caret package. *Journal of Statistical Software*, 28(5), 1-26. [doi:10.18637/jss.v028.i05](https://doi.org/10.18637/jss.v028.i05) - **Boruta**: Kursa, M. B., & Rudnicki, W. R. (2010). Feature selection with the Boruta package. *Journal of Statistical Software*, 36(11), 1-13. [doi:10.18637/jss.v036.i11](https://doi.org/10.18637/jss.v036.i11) - **glmnet**: Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. *Journal of Statistical Software*, 33(1), 1-22. [doi:10.18637/jss.v033.i01](https://doi.org/10.18637/jss.v033.i01) - **VIF**: Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). *Regression Diagnostics: Identifying Influential Data and Sources of Collinearity*. Wiley. [doi:10.1002/0471725153](https://doi.org/10.1002/0471725153) ## See Also - `vignette("quickstart")` - Interface overview and usage examples - `vignette("workflows")` - Domain-specific workflows (genomics, ecology, surveys) - `vignette("advanced")` - Algorithm selection and performance tuning - `vignette("theory")` - Graph-theoretic foundations and formal proofs ## Session Info ```{r} sessionInfo() ```