--- title: "Advanced Topics" author: "Gilles Colling" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Advanced Topics} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, dev = "svglite", fig.ext = "svg" ) library(corrselect) ``` ## Overview This vignette covers advanced topics for power users, researchers, and method developers: 1. **Understanding the Algorithms** - Exact vs greedy, complexity analysis 2. **Custom Engines** - Integrate any modeling package (INLA, mgcv, brms) 3. **Exact Subset Enumeration** - Multiple maximal subsets 4. **Performance Optimization** - Speed and memory considerations 5. **Troubleshooting** - Common issues and solutions **Target audience**: Users comfortable with R programming and statistical methods **Estimated time**: 15-20 minutes --- # 1. Understanding the Algorithms ## 1.1 Exact vs Greedy: When to Use Each corrselect offers two algorithmic approaches for `corrPrune()`: ### Exact Mode (Graph-Theoretic) **Algorithm**: Eppstein–Löffler–Strash (ELS) or Bron–Kerbosch **Complexity**: O(2^p) - exponential in number of predictors **Guarantee**: Finds the **largest** maximal independent set ```{r} data(mtcars) # Exact mode: guaranteed optimal exact_result <- corrPrune(mtcars, threshold = 0.7, mode = "exact") cat("Exact mode kept:", ncol(exact_result), "variables\n") ``` **Use exact mode when**: - p ≤ 20 (feasible runtime) - You need guaranteed optimal solution - Reproducibility is critical - You're writing a paper (justify optimality) ### Greedy Mode (Heuristic) **Algorithm**: Deterministic iterative removal **Complexity**: O(p² × k) where k = iterations **Guarantee**: Near-optimal, deterministic ```{r} # Greedy mode: fast approximation greedy_result <- corrPrune(mtcars, threshold = 0.7, mode = "greedy") cat("Greedy mode kept:", ncol(greedy_result), "variables\n") ``` **Use greedy mode when**: - p > 20 (exact becomes slow) - Speed is priority - Near-optimal is acceptable - High-dimensional data (p > 100) ### Auto Mode (Recommended) Automatically selects based on p: ```{r} # Auto mode: smart switching (exact if p ≤ 20, greedy otherwise) auto_result <- corrPrune(mtcars, threshold = 0.7, mode = "auto") cat("Auto mode kept:", ncol(auto_result), "variables\n") ``` ## 1.2 Complexity Analysis with Benchmarks Let's measure runtime scaling: ```{r} # Generate datasets with increasing p library(microbenchmark) benchmark_corrPrune <- function(p_values) { results <- data.frame( p = integer(), exact_time_ms = numeric(), greedy_time_ms = numeric() ) for (p in p_values) { # Generate correlated data set.seed(123) cor_mat <- 0.5^abs(outer(1:p, 1:p, "-")) data <- as.data.frame(MASS::mvrnorm(n = 100, mu = rep(0, p), Sigma = cor_mat)) # Exact mode (skip if p too large) exact_time_ms <- if (p <= 500) { median(microbenchmark( corrPrune(data, threshold = 0.7, mode = "exact"), times = 3, # Few iterations for speed unit = "ms" )$time) / 1e6 # Convert nanoseconds to milliseconds } else { NA } # Greedy mode greedy_time_ms <- median(microbenchmark( corrPrune(data, threshold = 0.7, mode = "greedy"), times = 3, # Few iterations for speed unit = "ms" )$time) / 1e6 # Convert nanoseconds to milliseconds results <- rbind(results, data.frame( p = p, exact_time_ms = round(exact_time_ms, 1), greedy_time_ms = round(greedy_time_ms, 1) )) } results } # Benchmark (extended range to show comprehensive scaling behavior) p_values <- c(10, 20, 50, 100, 200, 300, 500, 1000) benchmark <- benchmark_corrPrune(p_values) print(benchmark) ``` ```{r, fig.width=8, fig.height=5, fig.alt="Log-scale plot showing runtime (milliseconds) versus number of predictors for exact mode (red line) and greedy mode (blue line). Exact mode shows exponential growth with runtime increasing sharply after p=100, becoming impractical beyond p=500 (marked by vertical gray dashed line). Greedy mode shows linear scaling, remaining fast even at p=1000, demonstrating orders-of-magnitude performance advantage for high-dimensional data."} # Visualize scaling # Separate data for exact mode (only where available) and greedy mode (all points) exact_valid <- !is.na(benchmark$exact_time_ms) & benchmark$exact_time_ms > 0 greedy_valid <- !is.na(benchmark$greedy_time_ms) & benchmark$greedy_time_ms > 0 # Replace any zeros with small positive value for log scale exact_times <- benchmark$exact_time_ms greedy_times <- benchmark$greedy_time_ms exact_times[exact_times <= 0 & !is.na(exact_times)] <- 0.01 greedy_times[greedy_times <= 0 & !is.na(greedy_times)] <- 0.01 # Determine y-axis limits from valid data all_valid_times <- c(exact_times[exact_valid], greedy_times[greedy_valid]) ylim <- c(min(all_valid_times) * 0.5, max(all_valid_times) * 2) # Plot greedy mode (all points) plot(benchmark$p[greedy_valid], greedy_times[greedy_valid], type = "b", col = rgb(0.2, 0.5, 0.8, 1), pch = 19, lwd = 2, xlab = "Number of Predictors (p)", ylab = "Time (milliseconds, log scale)", main = "Exact vs Greedy Scaling", ylim = ylim, xlim = range(benchmark$p), log = "y") # Add exact mode (only where available) lines(benchmark$p[exact_valid], exact_times[exact_valid], type = "b", col = rgb(0.8, 0.2, 0.2, 1), pch = 19, lwd = 2) # Mark exact mode limit abline(v = 500, lty = 2, col = "gray50") text(500, ylim[2] * 0.5, "Exact mode limit", pos = 4, col = "gray30") legend("topleft", legend = c("Exact", "Greedy"), col = c(rgb(0.8, 0.2, 0.2, 1), rgb(0.2, 0.5, 0.8, 1)), pch = 19, lwd = 2, bty = "o", bg = "white") ``` **Key insight**: The scaling behavior depends heavily on correlation structure. For this moderately correlated dataset (ρ = 0.5^|i-j|), exact mode remains practical up to p ≈ 500, while greedy mode scales efficiently beyond p = 1000. The exact mode provides guaranteed optimality at the cost of computational complexity, while greedy mode offers substantial speed advantages for large p with near-optimal results in most practical scenarios. ## 1.3 Deterministic Tie-Breaking When multiple variables have identical correlation profiles, corrselect uses deterministic tie-breaking: ```{r} # Create data with identical correlations set.seed(123) x1 <- rnorm(100) x2 <- x1 + rnorm(100, sd = 0.1) # Almost identical to x1 x3 <- x1 + rnorm(100, sd = 0.1) # Also almost identical to x1 x4 <- rnorm(100) # Independent data_ties <- data.frame(x1, x2, x3, x4) # Run multiple times - always same result result1 <- corrPrune(data_ties, threshold = 0.95) result2 <- corrPrune(data_ties, threshold = 0.95) cat("Run 1 selected:", names(result1), "\n") cat("Run 2 selected:", names(result2), "\n") cat("Identical:", identical(names(result1), names(result2)), "\n") ``` **Tie-breaking rules**: 1. Prefer variables with lower **mean absolute correlation** with others 2. If still tied, prefer **lexicographically first** (alphabetical) This ensures reproducibility across runs, machines, and R versions. --- # 2. Custom Engines for modelPrune() ## 2.1 Understanding Custom Engines ### What is a Custom Engine? A custom engine allows you to integrate **any** modeling framework with `modelPrune()`, not just base R's `lm()` or `lme4`. This enables VIF-based pruning for: - **Bayesian models** (INLA, brms, Stan) - **Additive models** (mgcv GAMs) - **Survival models** (coxph, flexsurv) - **Custom metrics** (AIC, BIC, posterior uncertainty) ### How Custom Engines Work The `modelPrune()` algorithm follows this iterative process: 1. **Fit** the model with current predictors 2. **Diagnose** each predictor (compute a "badness" metric) 3. **Identify** the worst predictor (highest diagnostic value) 4. **Remove** if it exceeds the `limit` threshold 5. **Repeat** until all predictors satisfy the limit Your custom engine defines steps 1 and 2; `modelPrune()` handles the iteration logic. ### Engine Structure Requirements A custom engine is a named list with two required functions: ```{r, eval=FALSE} my_engine <- list( # Required: How to fit the model fit = function(formula, data, ...) { # Your model fitting code # Must return a fitted model object }, # Required: How to compute diagnostics diagnostics = function(model, fixed_effects) { # Compute diagnostic scores for each fixed effect # Higher values = worse (more likely to be removed) # Must return a named numeric vector }, # Optional: Name for error messages name = "my_custom_engine" # Defaults to "custom" ) ``` **Key principle**: The `diagnostics` function must return **higher values for worse predictors**. This inverted metric ensures the algorithm removes the most problematic variables first. ## 2.2 Example: INLA Engine (Bayesian Spatial Models) ### Background INLA (Integrated Nested Laplace Approximations) is a popular package for fast Bayesian inference, especially for spatial and temporal models. Unlike traditional VIF, we can use **posterior uncertainty** as a pruning criterion: variables with high posterior standard deviation contribute less information to the model. ### Implementation ```{r, eval=FALSE} # Custom engine for INLA inla_engine <- list( name = "inla", fit = function(formula, data, ...) { # Fit INLA model INLA::inla( formula = formula, data = data, family = list(...)$family %||% "gaussian", control.compute = list(config = TRUE), ... ) }, diagnostics = function(model, fixed_effects) { # Use posterior SD as "badness" metric # Higher SD = more uncertain = candidate for removal summary_fixed <- model$summary.fixed scores <- summary_fixed[, "sd"] names(scores) <- rownames(summary_fixed) # Return scores for fixed effects only scores[fixed_effects] } ) # Usage example pruned <- modelPrune( y ~ x1 + x2 + x3 + x4, data = my_data, engine = inla_engine, limit = 0.5 # Remove if posterior SD > 0.5 ) ``` ### How It Works 1. **fit**: Calls `INLA::inla()` to compute posterior distributions 2. **diagnostics**: Extracts posterior standard deviations from `model$summary.fixed` 3. **limit**: Threshold for acceptable uncertainty (e.g., 0.5 means remove predictors with posterior SD > 0.5) **Interpretation**: Variables with high posterior SD have coefficients that are uncertain given the data. Removing them reduces model complexity without losing much information. **When to use**: Spatial models, hierarchical models, disease mapping, ecological modeling with INLA. ## 2.3 Example: mgcv Engine (GAMs) ### Background Generalized Additive Models (GAMs) allow non-linear relationships via smooth terms. When pruning GAMs, we want to remove **parametric (linear) terms** with weak evidence while preserving smooth terms that model non-linear patterns. ### Implementation ```{r, eval=FALSE} # Custom engine for mgcv GAMs mgcv_engine <- list( name = "mgcv_gam", fit = function(formula, data, ...) { mgcv::gam(formula, data = data, ...) }, diagnostics = function(model, fixed_effects) { # Use p-values as badness metric # Higher p-value = less significant = candidate for removal summary_obj <- summary(model) # Extract parametric term p-values pvals <- summary_obj$p.pv # Return p-values for fixed effects pvals[fixed_effects] } ) # Usage example pruned <- modelPrune( y ~ x1 + x2 + s(x3), # s() for smooth terms data = my_data, engine = mgcv_engine, limit = 0.05 # Remove if p > 0.05 ) ``` ### How It Works 1. **fit**: Calls `mgcv::gam()` to fit the additive model 2. **diagnostics**: Extracts p-values for **parametric terms only** (not smooth terms) 3. **limit**: Significance threshold (e.g., 0.05 for traditional α = 0.05) **Important**: `modelPrune()` only removes parametric (linear) terms. Smooth terms specified with `s()`, `te()`, etc. are **never removed** automatically, as they're part of the model structure. **When to use**: Non-linear regression, ecological response curves, time series with trends. ## 2.4 Example: Custom Criterion (AIC-Based) ### Background Sometimes you want to prune based on **model comparison metrics** rather than coefficient-level diagnostics. This example shows how to remove variables that don't improve model fit according to AIC. ### Implementation ```{r, eval=FALSE} # AIC-based pruning engine aic_engine <- list( name = "aic_pruner", fit = function(formula, data, ...) { lm(formula, data = data) }, diagnostics = function(model, fixed_effects) { # For each predictor, compute ΔAIC if removed full_aic <- AIC(model) scores <- numeric(length(fixed_effects)) names(scores) <- fixed_effects for (var in fixed_effects) { # Refit without this variable reduced_formula <- update(formula(model), paste("~ . -", var)) reduced_model <- lm(reduced_formula, data = model$model) # ΔAIC: negative means removing improves model # We negate so "higher = worse" convention holds scores[var] <- -(AIC(reduced_model) - full_aic) } scores } ) # Usage: Remove predictors with ΔAIC < -2 (improve AIC by > 2 when removed) pruned <- modelPrune( y ~ x1 + x2 + x3, data = my_data, engine = aic_engine, limit = -2 ) ``` ### How It Works 1. **fit**: Standard linear model 2. **diagnostics**: For each variable, refit the model **without** that variable and compute ΔAIC 3. **limit**: Threshold for ΔAIC (e.g., -2 means "remove if AIC improves by more than 2 points") **Key insight**: The score is **negated** (multiplied by -1) to maintain the "higher is worse" convention. A variable with score > -2 means removing it would worsen AIC by more than 2, so it's kept. **When to use**: Model selection focused on parsimony, comparing nested models, when VIF isn't the right metric. **Alternative metrics**: You can adapt this pattern for BIC, likelihood ratio tests, or any other model comparison criterion. ## 2.5 Validation and Error Handling ### Automatic Validation `modelPrune()` automatically validates custom engines to catch common errors early: ```{r, error=TRUE} # Invalid engine: missing 'diagnostics' bad_engine <- list( fit = function(formula, data, ...) lm(formula, data = data) # Missing 'diagnostics' ) tryCatch({ modelPrune(mpg ~ ., data = mtcars, engine = bad_engine, limit = 5) }, error = function(e) { cat("Error:", e$message, "\n") }) ``` ### Validation Checklist Your custom engine must satisfy these requirements: 1. **Structure**: Named list with `fit` and `diagnostics` functions 2. **fit returns model object**: Must return something the `diagnostics` function can consume 3. **diagnostics returns numeric vector**: No characters, factors, or other types 4. **Correct length**: One diagnostic value per fixed effect (excluding intercept) 5. **Named vector**: Names must match variable names exactly 6. **No missing values**: NA, NaN, or Inf will cause errors 7. **Higher = worse**: Diagnostic values must increase for more problematic predictors ### Debugging Tips If your custom engine fails, check: ```{r, eval=FALSE} # Test your fit function in isolation test_model <- my_engine$fit(y ~ x1 + x2, data = my_data) summary(test_model) # Does it work? # Test your diagnostics function test_diag <- my_engine$diagnostics(test_model, c("x1", "x2")) print(test_diag) # Numeric? Named? Correct length? # Check that "higher = worse" convention is satisfied # The variable with the highest score should be the one you'd remove first ``` --- # 3. Exact Subset Enumeration ## 3.1 When You Need ALL Maximal Subsets `corrPrune()` returns a **single** subset. Sometimes you want **all** maximal subsets: ```{r} # corrPrune: Single subset single <- corrPrune(mtcars, threshold = 0.7) cat("corrPrune returned:", ncol(single), "variables\n") # corrSelect: ALL maximal subsets (use higher threshold to ensure subsets exist) all_subsets <- corrSelect(mtcars, threshold = 0.9) show(all_subsets) ``` ## 3.2 Exploring Multiple Subsets When multiple maximal subsets exist, you can explore them: ```{r} if (length(all_subsets@subset_list) > 0) { # Display first few subsets cat("First few maximal subsets:\n") for (i in seq_len(min(3, length(all_subsets@subset_list)))) { cat(sprintf("\nSubset %d (avg corr = %.3f):\n", i, all_subsets@avg_corr[i])) cat(" ", paste(all_subsets@subset_list[[i]], collapse = ", "), "\n") } # Analyze subset characteristics subset_sizes <- lengths(all_subsets@subset_list) cat("\nSubset sizes:\n") print(table(subset_sizes)) cat("\nAverage correlations:\n") print(summary(all_subsets@avg_corr)) } else { cat("No subsets found at threshold 0.9\n") } ``` ## 3.3 Extracting Specific Subsets ```{r} if (length(all_subsets@subset_list) > 0) { # Extract subset with lowest average correlation best_idx <- which.min(all_subsets@avg_corr) best_subset <- corrSubset(all_subsets, mtcars, which = best_idx) cat("Best subset (lowest avg correlation):\n") print(names(best_subset)) # Extract subset with most predictors subset_sizes <- lengths(all_subsets@subset_list) largest_idx <- which.max(subset_sizes) largest_subset <- corrSubset(all_subsets, mtcars, which = largest_idx) cat("\nLargest subset:\n") print(names(largest_subset)) } else { cat("No subsets to extract at threshold 0.9\n") } ``` ## 3.4 Advanced: Domain-Specific Subset Selection You can define custom criteria for choosing among multiple subsets: ```{r} if (length(all_subsets@subset_list) > 0) { # Custom criterion: Prefer subsets with specific variables preferred_vars <- c("mpg", "hp", "wt") # Compute score: number of preferred variables in each subset scores <- sapply(all_subsets@subset_list, function(vars) { sum(preferred_vars %in% vars) }) # Select subset with most preferred variables best_idx <- which.max(scores) cat("Subset with most preferred variables (score:", scores[best_idx], "):\n") cat(paste(all_subsets@subset_list[[best_idx]], collapse = ", "), "\n") # Extract as data frame preferred_subset <- corrSubset(all_subsets, mtcars, which = best_idx) print(head(preferred_subset)) } else { cat("No subsets available for custom selection\n") } ``` --- # 4. Performance Optimization ## 4.1 Precomputed Correlation Matrices For repeated pruning with different thresholds, precompute the correlation matrix: ```{r} # Create larger dataset for meaningful timing comparison set.seed(123) large_data <- as.data.frame(matrix(rnorm(100 * 50), ncol = 50)) # Benchmark: Recompute correlation every time time1 <- median(microbenchmark( { result1 <- corrPrune(large_data, threshold = 0.7) result2 <- corrPrune(large_data, threshold = 0.8) result3 <- corrPrune(large_data, threshold = 0.9) }, times = 3, unit = "ms" )$time) / 1e6 # Convert nanoseconds to milliseconds # Benchmark: Compute correlation once, reuse cor_matrix <- cor(large_data) time2 <- median(microbenchmark( { result1 <- MatSelect(cor_matrix, threshold = 0.7) result2 <- MatSelect(cor_matrix, threshold = 0.8) result3 <- MatSelect(cor_matrix, threshold = 0.9) }, times = 3, unit = "ms" )$time) / 1e6 # Convert nanoseconds to milliseconds cat(sprintf("Recomputing each time: %.1f ms\n", time1)) cat(sprintf("Precomputed matrix: %.1f ms\n", time2)) cat(sprintf("Speedup: %.1fx faster\n", time1 / time2)) ``` **Use precomputed matrices when**: - Testing multiple thresholds - Cross-validation workflows - Sensitivity analysis ## 4.2 Memory Considerations for Large Data For large datasets (n > 10,000, p > 500): ### Memory-Efficient Correlation Computation ```{r, eval=FALSE} # Standard (memory-intensive for large n) cor_matrix <- cor(large_data) # Memory-efficient alternative (for very large n) # Process in chunks if needed compute_cor_chunked <- function(data, chunk_size = 1000) { n <- nrow(data) n_chunks <- ceiling(n / chunk_size) # Use online algorithm or chunked computation # (Implementation depends on your data size) } ``` ### Sparse Correlation Matrices If most correlations are low, consider sparse storage: ```{r, eval=FALSE} # Convert to sparse format (requires Matrix package) library(Matrix) # Compute correlation cor_mat <- cor(data) # Threshold and convert to sparse cor_sparse <- cor_mat cor_sparse[abs(cor_sparse) < 0.3] <- 0 # Set low correlations to 0 cor_sparse <- Matrix(cor_sparse, sparse = TRUE) # Memory savings object.size(cor_mat) object.size(cor_sparse) ``` ## 4.3 Parallel Processing Strategies For multiple independent pruning operations: ```{r, eval=FALSE} library(parallel) # Create cluster cl <- makeCluster(detectCores() - 1) # Export functions to cluster clusterEvalQ(cl, library(corrselect)) # Parallel pruning with different thresholds thresholds <- seq(0.5, 0.9, by = 0.1) results <- parLapply(cl, thresholds, function(thresh) { corrPrune(my_data, threshold = thresh) }) # Clean up stopCluster(cl) ``` **Note**: corrselect itself doesn't parallelize internally (for reproducibility), but you can parallelize **across** multiple analyses. ## 4.4 Choosing the Right Mode Decision tree for mode selection: ``` p ≤ 15: Use "exact" (fast enough, guaranteed optimal) 15 < p ≤ 25: Use "exact" if time permits, "greedy" if speed critical p > 25: Use "greedy" or "auto" p > 100: Always use "greedy" ``` --- # 5. Troubleshooting ## 5.1 Common Errors and Solutions ### Error: "No valid subsets found" **Cause**: Threshold too strict - all variables exceed it ```{r, error=TRUE} # Example: All correlations > 0.9 set.seed(123) x <- rnorm(100) high_cor_data <- data.frame( x1 = x, x2 = x + rnorm(100, sd = 0.01), x3 = x + rnorm(100, sd = 0.01) ) tryCatch({ corrPrune(high_cor_data, threshold = 0.5) }, error = function(e) { cat("Error:", e$message, "\n") }) ``` **Solutions**: 1. Increase threshold 2. Use `force_in` to keep at least one variable 3. Check data for near-duplicates ```{r, error=TRUE} # Solution 1: Increase threshold result <- corrPrune(high_cor_data, threshold = 0.95) print(names(result)) # Solution 2: Force keep one variable result <- corrPrune(high_cor_data, threshold = 0.5, force_in = "x1") print(names(result)) ``` ### Error: force_in variables conflict with threshold **Cause**: Variables in `force_in` have |correlation| > threshold ```{r, error=TRUE} # x1 and x2 are highly correlated tryCatch({ corrPrune(high_cor_data, threshold = 0.5, force_in = c("x1", "x2")) }, error = function(e) { cat("Error:", e$message, "\n") }) ``` **Solution**: Either increase threshold or reduce `force_in` set ### Error: VIF computation fails in modelPrune() **Cause**: Perfect multicollinearity (R² = 1) ```{r, error=TRUE} # Create perfect multicollinearity perfect_data <- data.frame( y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100) ) perfect_data$x3 <- perfect_data$x1 + perfect_data$x2 # Perfect collinearity tryCatch({ modelPrune(y ~ ., data = perfect_data, limit = 5) }, error = function(e) { cat("Error:", e$message, "\n") }) ``` **Solution**: Use `corrPrune()` first to remove perfect collinearity: ```{r} # Two-step approach step1 <- corrPrune(perfect_data[, -1], threshold = 0.99) step2_data <- data.frame(y = perfect_data$y, step1) result <- modelPrune(y ~ ., data = step2_data, limit = 5) print(attr(result, "selected_vars")) ``` ## 5.2 Threshold Selection Guidance ### For corrPrune() (Correlation Threshold) **Conservative (strict)**: - threshold = 0.5: Very low redundancy, may lose information - Use when: Interpretability is critical, small sample size **Moderate (recommended)**: - threshold = 0.7: Balances redundancy and information retention - Use when: Standard regression, general analysis **Lenient**: - threshold = 0.9: Only removes near-duplicates - Use when: Large sample size, prediction focus ### For modelPrune() (VIF Limit) **Strict**: - limit = 2: Very low multicollinearity, may over-prune - Use when: Small sample size, interpretability critical **Moderate (recommended)**: - limit = 5: Standard threshold in literature - Use when: General regression analysis **Lenient**: - limit = 10: Tolerates more multicollinearity - Use when: Large sample size, prediction focus ### Empirical Approach: Visualize First ```{r, fig.width=10, fig.height=4, fig.alt="Two side-by-side plots for threshold selection guidance. Left panel: histogram of absolute correlations in mtcars showing distribution with vertical lines marking strict (0.5, red), moderate (0.7, blue), and lenient (0.9, green) thresholds. Right panel: line plot showing number of variables retained versus threshold, demonstrating sensitivity analysis with plateau beginning around 0.7, helping identify optimal threshold for balancing redundancy reduction and information retention."} data(mtcars) # Visualize correlation distribution cor_mat <- cor(mtcars) cor_vec <- cor_mat[upper.tri(cor_mat)] par(mfrow = c(1, 2)) # Histogram of correlations hist(abs(cor_vec), breaks = 30, main = "Distribution of |Correlations|", xlab = "|Correlation|", col = "lightblue") abline(v = c(0.5, 0.7, 0.9), col = c("red", "blue", "green"), lwd = 2, lty = 2) legend("topright", legend = c("0.5 (strict)", "0.7 (moderate)", "0.9 (lenient)"), col = c("red", "blue", "green"), lwd = 2, lty = 2, bty = "o", bg = "white") # Subset size vs threshold thresholds <- seq(0.3, 0.95, by = 0.05) sizes <- sapply(thresholds, function(t) { tryCatch({ ncol(corrPrune(mtcars, threshold = t)) }, error = function(e) NA) }) plot(thresholds, sizes, type = "b", xlab = "Threshold", ylab = "Number of Variables Retained", main = "Threshold Sensitivity", col = "blue", lwd = 2) abline(h = ncol(mtcars), lty = 2, col = "gray") text(0.3, ncol(mtcars), "Original", pos = 3) ``` **Strategy**: Choose threshold where curve begins to plateau. ## 5.3 Handling Edge Cases ### Single Predictor After Pruning ```{r} # Very strict threshold may leave only 1 variable strict_result <- corrPrune(mtcars, threshold = 0.3) cat("Variables remaining:", ncol(strict_result), "\n") # Check if result is usable if (ncol(strict_result) < 2) { cat("Warning: Only 1 variable remaining. Consider:\n") cat(" 1. Increasing threshold\n") cat(" 2. Using force_in to keep important variables\n") } ``` ### All Variables Removed ```{r, error=TRUE} # Impossible threshold tryCatch({ corrPrune(mtcars, threshold = 0.0) }, error = function(e) { cat("Error:", e$message, "\n") }) ``` ### Mixed-Type Data ```{r} # Create mixed data mixed_data <- mtcars mixed_data$cyl <- factor(mixed_data$cyl) mixed_data$am <- factor(mixed_data$am) # Use assocSelect for mixed types result <- assocSelect(mixed_data, threshold = 0.6) show(result) ``` --- # 6. Best Practices ## 6.1 Workflow Recommendations ### For Exploratory Analysis ```{r, eval=FALSE} # 1. Visualize correlations corrplot::corrplot(cor(data), method = "circle") # 2. Try multiple thresholds results <- lapply(c(0.5, 0.7, 0.9), function(t) { corrPrune(data, threshold = t) }) # 3. Compare subset sizes sapply(results, ncol) # 4. Choose based on your needs final_data <- results[[2]] # threshold = 0.7 ``` ### For Publication-Quality Analysis ```{r, eval=FALSE} # 1. Use exact mode for reproducibility and optimality data_pruned <- corrPrune(data, threshold = 0.7, mode = "exact") # 2. Document in methods section cat(sprintf( "Variables were pruned using corrselect::corrPrune() with threshold = 0.7, ", "exact mode, retaining %d of %d original predictors.", ncol(data_pruned), ncol(data) )) # 3. Report which variables were removed removed <- attr(data_pruned, "removed_vars") cat("Removed variables:", paste(removed, collapse = ", ")) ``` ## 6.2 Combining with Other Methods ```{r, eval=FALSE} # Comprehensive variable selection pipeline pipeline <- function(data, response) { # Step 1: Remove correlations step1 <- corrPrune(data, threshold = 0.7, mode = "auto") # Step 2: VIF cleanup step2_data <- data.frame(response = response, step1) step2 <- modelPrune(response ~ ., data = step2_data, limit = 5) # Step 3: Feature importance (optional) if (requireNamespace("Boruta", quietly = TRUE)) { boruta_result <- Boruta::Boruta(response ~ ., data = step2) important <- Boruta::getSelectedAttributes(boruta_result) final_data <- step2[, c("response", important)] } else { final_data <- step2 } final_data } ``` --- # 7. Summary ## Key Takeaways ### Algorithms - Use **exact mode** for p ≤ 20 (optimal, reproducible) - Use **greedy mode** for p > 20 (fast, near-optimal) - Use **auto mode** to let corrselect decide ### Custom Engines - Integrate any modeling package (INLA, mgcv, brms) - Define custom pruning criteria (AIC, BIC, p-values) - Two required functions: `fit` and `diagnostics` ### Optimization - Precompute correlation matrices for multiple thresholds - Use greedy mode for large p - Parallelize across analyses, not within ### Troubleshooting - Visualize correlation distribution before choosing threshold - Use `force_in` to protect important variables - Two-step pruning (corrPrune → modelPrune) for robustness --- # 8. References **Algorithms**: - Eppstein, D., Löffler, M., & Strash, D. (2010). Listing all maximal cliques in sparse graphs in near-optimal time. *Symposium on Algorithms and Computation*. - Bron, C., & Kerbosch, J. (1973). Algorithm 457: Finding all cliques of an undirected graph. *Communications of the ACM*, 16(9), 575-577. **Multicollinearity**: - O'Brien, R. M. (2007). A caution regarding rules of thumb for variance inflation factors. *Quality & Quantity*, 41(5), 673-690. - Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). *Regression Diagnostics*. Wiley. **Software**: - INLA: Rue, H., Martino, S., & Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian models. *Journal of the Royal Statistical Society: Series B*, 71(2), 319-392. - mgcv: Wood, S. N. (2017). *Generalized Additive Models: An Introduction with R* (2nd ed.). Chapman and Hall/CRC. --- ## See Also - `vignette("quickstart")` - 5-minute introduction - `vignette("workflows")` - Real-world examples - `vignette("comparison")` - vs caret, Boruta, glmnet - `vignette("corrselect_vignette")` - Original exact methods vignette - `?corrPrune` - Association-based pruning - `?modelPrune` - Model-based pruning - `?corrSelect` - Exact subset enumeration ## Session Info ```{r} sessionInfo() ```