---
title: "Advanced Topics"
author: "Gilles Colling"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Advanced Topics}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5,
  dev = "svglite",
  fig.ext = "svg"
)
library(corrselect)
```

## Overview

This vignette covers advanced topics for power users, researchers, and method developers:

1. **Understanding the Algorithms** - Exact vs greedy, complexity analysis
2. **Custom Engines** - Integrate any modeling package (INLA, mgcv, brms)
3. **Exact Subset Enumeration** - Multiple maximal subsets
4. **Performance Optimization** - Speed and memory considerations
5. **Troubleshooting** - Common issues and solutions

**Target audience**: Users comfortable with R programming and statistical methods

**Estimated time**: 15-20 minutes

---

# 1. Understanding the Algorithms

## 1.1 Exact vs Greedy: When to Use Each

corrselect offers two algorithmic approaches for `corrPrune()`:

### Exact Mode (Graph-Theoretic)

**Algorithm**: Eppstein–Löffler–Strash (ELS) or Bron–Kerbosch
**Complexity**: O(2^p) - exponential in number of predictors
**Guarantee**: Finds the **largest** maximal independent set

```{r}
data(mtcars)

# Exact mode: guaranteed optimal
exact_result <- corrPrune(mtcars, threshold = 0.7, mode = "exact")
cat("Exact mode kept:", ncol(exact_result), "variables\n")
```

**Use exact mode when**:

- p ≤ 20 (feasible runtime)
- You need guaranteed optimal solution
- Reproducibility is critical
- You're writing a paper (justify optimality)

### Greedy Mode (Heuristic)

**Algorithm**: Deterministic iterative removal
**Complexity**: O(p² × k) where k = iterations
**Guarantee**: Near-optimal, deterministic

```{r}
# Greedy mode: fast approximation
greedy_result <- corrPrune(mtcars, threshold = 0.7, mode = "greedy")
cat("Greedy mode kept:", ncol(greedy_result), "variables\n")
```

**Use greedy mode when**:

- p > 20 (exact becomes slow)
- Speed is priority
- Near-optimal is acceptable
- High-dimensional data (p > 100)

### Auto Mode (Recommended)

Automatically selects based on p:

```{r}
# Auto mode: smart switching (exact if p ≤ 20, greedy otherwise)
auto_result <- corrPrune(mtcars, threshold = 0.7, mode = "auto")
cat("Auto mode kept:", ncol(auto_result), "variables\n")
```

## 1.2 Complexity Analysis with Benchmarks

Let's measure runtime scaling:

```{r}
# Generate datasets with increasing p
library(microbenchmark)

benchmark_corrPrune <- function(p_values) {
  results <- data.frame(
    p = integer(),
    exact_time_ms = numeric(),
    greedy_time_ms = numeric()
  )

  for (p in p_values) {
    # Generate correlated data
    set.seed(123)
    cor_mat <- 0.5^abs(outer(1:p, 1:p, "-"))
    data <- as.data.frame(MASS::mvrnorm(n = 100, mu = rep(0, p), Sigma = cor_mat))

    # Exact mode (skip if p too large)
    exact_time_ms <- if (p <= 500) {
      median(microbenchmark(
        corrPrune(data, threshold = 0.7, mode = "exact"),
        times = 3,  # Few iterations for speed
        unit = "ms"
      )$time) / 1e6  # Convert nanoseconds to milliseconds
    } else {
      NA
    }

    # Greedy mode
    greedy_time_ms <- median(microbenchmark(
      corrPrune(data, threshold = 0.7, mode = "greedy"),
      times = 3,  # Few iterations for speed
      unit = "ms"
    )$time) / 1e6  # Convert nanoseconds to milliseconds

    results <- rbind(results, data.frame(
      p = p,
      exact_time_ms = round(exact_time_ms, 1),
      greedy_time_ms = round(greedy_time_ms, 1)
    ))
  }

  results
}

# Benchmark (extended range to show comprehensive scaling behavior)
p_values <- c(10, 20, 50, 100, 200, 300, 500, 1000)
benchmark <- benchmark_corrPrune(p_values)
print(benchmark)
```

```{r, fig.width=8, fig.height=5, fig.alt="Log-scale plot showing runtime (milliseconds) versus number of predictors for exact mode (red line) and greedy mode (blue line). Exact mode shows exponential growth with runtime increasing sharply after p=100, becoming impractical beyond p=500 (marked by vertical gray dashed line). Greedy mode shows linear scaling, remaining fast even at p=1000, demonstrating orders-of-magnitude performance advantage for high-dimensional data."}
# Visualize scaling
# Separate data for exact mode (only where available) and greedy mode (all points)
exact_valid <- !is.na(benchmark$exact_time_ms) & benchmark$exact_time_ms > 0
greedy_valid <- !is.na(benchmark$greedy_time_ms) & benchmark$greedy_time_ms > 0

# Replace any zeros with small positive value for log scale
exact_times <- benchmark$exact_time_ms
greedy_times <- benchmark$greedy_time_ms
exact_times[exact_times <= 0 & !is.na(exact_times)] <- 0.01
greedy_times[greedy_times <= 0 & !is.na(greedy_times)] <- 0.01

# Determine y-axis limits from valid data
all_valid_times <- c(exact_times[exact_valid], greedy_times[greedy_valid])
ylim <- c(min(all_valid_times) * 0.5, max(all_valid_times) * 2)

# Plot greedy mode (all points)
plot(benchmark$p[greedy_valid], greedy_times[greedy_valid],
     type = "b", col = rgb(0.2, 0.5, 0.8, 1), pch = 19, lwd = 2,
     xlab = "Number of Predictors (p)",
     ylab = "Time (milliseconds, log scale)",
     main = "Exact vs Greedy Scaling",
     ylim = ylim,
     xlim = range(benchmark$p),
     log = "y")

# Add exact mode (only where available)
lines(benchmark$p[exact_valid], exact_times[exact_valid],
      type = "b", col = rgb(0.8, 0.2, 0.2, 1), pch = 19, lwd = 2)

# Mark exact mode limit
abline(v = 500, lty = 2, col = "gray50")
text(500, ylim[2] * 0.5, "Exact mode limit", pos = 4, col = "gray30")

legend("topleft",
       legend = c("Exact", "Greedy"),
       col = c(rgb(0.8, 0.2, 0.2, 1), rgb(0.2, 0.5, 0.8, 1)),
       pch = 19,
       lwd = 2,
       bty = "o",
       bg = "white")
```

**Key insight**: The scaling behavior depends heavily on correlation structure. For this moderately correlated dataset (ρ = 0.5^|i-j|), exact mode remains practical up to p ≈ 500, while greedy mode scales efficiently beyond p = 1000. The exact mode provides guaranteed optimality at the cost of computational complexity, while greedy mode offers substantial speed advantages for large p with near-optimal results in most practical scenarios.

## 1.3 Deterministic Tie-Breaking

When multiple variables have identical correlation profiles, corrselect uses deterministic tie-breaking:

```{r}
# Create data with identical correlations
set.seed(123)
x1 <- rnorm(100)
x2 <- x1 + rnorm(100, sd = 0.1)  # Almost identical to x1
x3 <- x1 + rnorm(100, sd = 0.1)  # Also almost identical to x1
x4 <- rnorm(100)                  # Independent

data_ties <- data.frame(x1, x2, x3, x4)

# Run multiple times - always same result
result1 <- corrPrune(data_ties, threshold = 0.95)
result2 <- corrPrune(data_ties, threshold = 0.95)

cat("Run 1 selected:", names(result1), "\n")
cat("Run 2 selected:", names(result2), "\n")
cat("Identical:", identical(names(result1), names(result2)), "\n")
```

**Tie-breaking rules**:
1. Prefer variables with lower **mean absolute correlation** with others
2. If still tied, prefer **lexicographically first** (alphabetical)

This ensures reproducibility across runs, machines, and R versions.

---

# 2. Custom Engines for modelPrune()

## 2.1 Understanding Custom Engines

### What is a Custom Engine?

A custom engine allows you to integrate **any** modeling framework with `modelPrune()`, not just base R's `lm()` or `lme4`. This enables VIF-based pruning for:

- **Bayesian models** (INLA, brms, Stan)
- **Additive models** (mgcv GAMs)
- **Survival models** (coxph, flexsurv)
- **Custom metrics** (AIC, BIC, posterior uncertainty)

### How Custom Engines Work

The `modelPrune()` algorithm follows this iterative process:

1. **Fit** the model with current predictors
2. **Diagnose** each predictor (compute a "badness" metric)
3. **Identify** the worst predictor (highest diagnostic value)
4. **Remove** if it exceeds the `limit` threshold
5. **Repeat** until all predictors satisfy the limit

Your custom engine defines steps 1 and 2; `modelPrune()` handles the iteration logic.

### Engine Structure Requirements

A custom engine is a named list with two required functions:

```{r, eval=FALSE}
my_engine <- list(
  # Required: How to fit the model
  fit = function(formula, data, ...) {
    # Your model fitting code
    # Must return a fitted model object
  },

  # Required: How to compute diagnostics
  diagnostics = function(model, fixed_effects) {
    # Compute diagnostic scores for each fixed effect
    # Higher values = worse (more likely to be removed)
    # Must return a named numeric vector
  },

  # Optional: Name for error messages
  name = "my_custom_engine"  # Defaults to "custom"
)
```

**Key principle**: The `diagnostics` function must return **higher values for worse predictors**. This inverted metric ensures the algorithm removes the most problematic variables first.

## 2.2 Example: INLA Engine (Bayesian Spatial Models)

### Background

INLA (Integrated Nested Laplace Approximations) is a popular package for fast Bayesian inference, especially for spatial and temporal models. Unlike traditional VIF, we can use **posterior uncertainty** as a pruning criterion: variables with high posterior standard deviation contribute less information to the model.

### Implementation

```{r, eval=FALSE}
# Custom engine for INLA
inla_engine <- list(
  name = "inla",

  fit = function(formula, data, ...) {
    # Fit INLA model
    INLA::inla(
      formula = formula,
      data = data,
      family = list(...)$family %||% "gaussian",
      control.compute = list(config = TRUE),
      ...
    )
  },

  diagnostics = function(model, fixed_effects) {
    # Use posterior SD as "badness" metric
    # Higher SD = more uncertain = candidate for removal
    summary_fixed <- model$summary.fixed
    scores <- summary_fixed[, "sd"]
    names(scores) <- rownames(summary_fixed)

    # Return scores for fixed effects only
    scores[fixed_effects]
  }
)

# Usage example
pruned <- modelPrune(
  y ~ x1 + x2 + x3 + x4,
  data = my_data,
  engine = inla_engine,
  limit = 0.5  # Remove if posterior SD > 0.5
)
```

### How It Works

1. **fit**: Calls `INLA::inla()` to compute posterior distributions
2. **diagnostics**: Extracts posterior standard deviations from `model$summary.fixed`
3. **limit**: Threshold for acceptable uncertainty (e.g., 0.5 means remove predictors with posterior SD > 0.5)

**Interpretation**: Variables with high posterior SD have coefficients that are uncertain given the data. Removing them reduces model complexity without losing much information.

**When to use**: Spatial models, hierarchical models, disease mapping, ecological modeling with INLA.

## 2.3 Example: mgcv Engine (GAMs)

### Background

Generalized Additive Models (GAMs) allow non-linear relationships via smooth terms. When pruning GAMs, we want to remove **parametric (linear) terms** with weak evidence while preserving smooth terms that model non-linear patterns.

### Implementation

```{r, eval=FALSE}
# Custom engine for mgcv GAMs
mgcv_engine <- list(
  name = "mgcv_gam",

  fit = function(formula, data, ...) {
    mgcv::gam(formula, data = data, ...)
  },

  diagnostics = function(model, fixed_effects) {
    # Use p-values as badness metric
    # Higher p-value = less significant = candidate for removal
    summary_obj <- summary(model)

    # Extract parametric term p-values
    pvals <- summary_obj$p.pv

    # Return p-values for fixed effects
    pvals[fixed_effects]
  }
)

# Usage example
pruned <- modelPrune(
  y ~ x1 + x2 + s(x3),  # s() for smooth terms
  data = my_data,
  engine = mgcv_engine,
  limit = 0.05  # Remove if p > 0.05
)
```

### How It Works

1. **fit**: Calls `mgcv::gam()` to fit the additive model
2. **diagnostics**: Extracts p-values for **parametric terms only** (not smooth terms)
3. **limit**: Significance threshold (e.g., 0.05 for traditional α = 0.05)

**Important**: `modelPrune()` only removes parametric (linear) terms. Smooth terms specified with `s()`, `te()`, etc. are **never removed** automatically, as they're part of the model structure.

**When to use**: Non-linear regression, ecological response curves, time series with trends.

## 2.4 Example: Custom Criterion (AIC-Based)

### Background

Sometimes you want to prune based on **model comparison metrics** rather than coefficient-level diagnostics. This example shows how to remove variables that don't improve model fit according to AIC.

### Implementation

```{r, eval=FALSE}
# AIC-based pruning engine
aic_engine <- list(
  name = "aic_pruner",

  fit = function(formula, data, ...) {
    lm(formula, data = data)
  },

  diagnostics = function(model, fixed_effects) {
    # For each predictor, compute ΔAIC if removed
    full_aic <- AIC(model)

    scores <- numeric(length(fixed_effects))
    names(scores) <- fixed_effects

    for (var in fixed_effects) {
      # Refit without this variable
      reduced_formula <- update(formula(model), paste("~ . -", var))
      reduced_model <- lm(reduced_formula, data = model$model)

      # ΔAIC: negative means removing improves model
      # We negate so "higher = worse" convention holds
      scores[var] <- -(AIC(reduced_model) - full_aic)
    }

    scores
  }
)

# Usage: Remove predictors with ΔAIC < -2 (improve AIC by > 2 when removed)
pruned <- modelPrune(
  y ~ x1 + x2 + x3,
  data = my_data,
  engine = aic_engine,
  limit = -2
)
```

### How It Works

1. **fit**: Standard linear model
2. **diagnostics**: For each variable, refit the model **without** that variable and compute ΔAIC
3. **limit**: Threshold for ΔAIC (e.g., -2 means "remove if AIC improves by more than 2 points")

**Key insight**: The score is **negated** (multiplied by -1) to maintain the "higher is worse" convention. A variable with score > -2 means removing it would worsen AIC by more than 2, so it's kept.

**When to use**: Model selection focused on parsimony, comparing nested models, when VIF isn't the right metric.

**Alternative metrics**: You can adapt this pattern for BIC, likelihood ratio tests, or any other model comparison criterion.

## 2.5 Validation and Error Handling

### Automatic Validation

`modelPrune()` automatically validates custom engines to catch common errors early:

```{r, error=TRUE}
# Invalid engine: missing 'diagnostics'
bad_engine <- list(
  fit = function(formula, data, ...) lm(formula, data = data)
  # Missing 'diagnostics'
)

tryCatch({
  modelPrune(mpg ~ ., data = mtcars, engine = bad_engine, limit = 5)
}, error = function(e) {
  cat("Error:", e$message, "\n")
})
```

### Validation Checklist

Your custom engine must satisfy these requirements:

1. **Structure**: Named list with `fit` and `diagnostics` functions
2. **fit returns model object**: Must return something the `diagnostics` function can consume
3. **diagnostics returns numeric vector**: No characters, factors, or other types
4. **Correct length**: One diagnostic value per fixed effect (excluding intercept)
5. **Named vector**: Names must match variable names exactly
6. **No missing values**: NA, NaN, or Inf will cause errors
7. **Higher = worse**: Diagnostic values must increase for more problematic predictors

### Debugging Tips

If your custom engine fails, check:

```{r, eval=FALSE}
# Test your fit function in isolation
test_model <- my_engine$fit(y ~ x1 + x2, data = my_data)
summary(test_model)  # Does it work?

# Test your diagnostics function
test_diag <- my_engine$diagnostics(test_model, c("x1", "x2"))
print(test_diag)  # Numeric? Named? Correct length?

# Check that "higher = worse" convention is satisfied
# The variable with the highest score should be the one you'd remove first
```

---

# 3. Exact Subset Enumeration

## 3.1 When You Need ALL Maximal Subsets

`corrPrune()` returns a **single** subset. Sometimes you want **all** maximal subsets:

```{r}
# corrPrune: Single subset
single <- corrPrune(mtcars, threshold = 0.7)
cat("corrPrune returned:", ncol(single), "variables\n")

# corrSelect: ALL maximal subsets (use higher threshold to ensure subsets exist)
all_subsets <- corrSelect(mtcars, threshold = 0.9)
show(all_subsets)
```

## 3.2 Exploring Multiple Subsets

When multiple maximal subsets exist, you can explore them:

```{r}
if (length(all_subsets@subset_list) > 0) {
  # Display first few subsets
  cat("First few maximal subsets:\n")
  for (i in seq_len(min(3, length(all_subsets@subset_list)))) {
    cat(sprintf("\nSubset %d (avg corr = %.3f):\n", i, all_subsets@avg_corr[i]))
    cat(" ", paste(all_subsets@subset_list[[i]], collapse = ", "), "\n")
  }

  # Analyze subset characteristics
  subset_sizes <- lengths(all_subsets@subset_list)
  cat("\nSubset sizes:\n")
  print(table(subset_sizes))

  cat("\nAverage correlations:\n")
  print(summary(all_subsets@avg_corr))
} else {
  cat("No subsets found at threshold 0.9\n")
}
```

## 3.3 Extracting Specific Subsets

```{r}
if (length(all_subsets@subset_list) > 0) {
  # Extract subset with lowest average correlation
  best_idx <- which.min(all_subsets@avg_corr)
  best_subset <- corrSubset(all_subsets, mtcars, which = best_idx)

  cat("Best subset (lowest avg correlation):\n")
  print(names(best_subset))

  # Extract subset with most predictors
  subset_sizes <- lengths(all_subsets@subset_list)
  largest_idx <- which.max(subset_sizes)
  largest_subset <- corrSubset(all_subsets, mtcars, which = largest_idx)

  cat("\nLargest subset:\n")
  print(names(largest_subset))
} else {
  cat("No subsets to extract at threshold 0.9\n")
}
```

## 3.4 Advanced: Domain-Specific Subset Selection

You can define custom criteria for choosing among multiple subsets:

```{r}
if (length(all_subsets@subset_list) > 0) {
  # Custom criterion: Prefer subsets with specific variables
  preferred_vars <- c("mpg", "hp", "wt")

  # Compute score: number of preferred variables in each subset
  scores <- sapply(all_subsets@subset_list, function(vars) {
    sum(preferred_vars %in% vars)
  })

  # Select subset with most preferred variables
  best_idx <- which.max(scores)
  cat("Subset with most preferred variables (score:", scores[best_idx], "):\n")
  cat(paste(all_subsets@subset_list[[best_idx]], collapse = ", "), "\n")

  # Extract as data frame
  preferred_subset <- corrSubset(all_subsets, mtcars, which = best_idx)
  print(head(preferred_subset))
} else {
  cat("No subsets available for custom selection\n")
}
```

---

# 4. Performance Optimization

## 4.1 Precomputed Correlation Matrices

For repeated pruning with different thresholds, precompute the correlation matrix:

```{r}
# Create larger dataset for meaningful timing comparison
set.seed(123)
large_data <- as.data.frame(matrix(rnorm(100 * 50), ncol = 50))

# Benchmark: Recompute correlation every time
time1 <- median(microbenchmark(
  {
    result1 <- corrPrune(large_data, threshold = 0.7)
    result2 <- corrPrune(large_data, threshold = 0.8)
    result3 <- corrPrune(large_data, threshold = 0.9)
  },
  times = 3,
  unit = "ms"
)$time) / 1e6  # Convert nanoseconds to milliseconds

# Benchmark: Compute correlation once, reuse
cor_matrix <- cor(large_data)
time2 <- median(microbenchmark(
  {
    result1 <- MatSelect(cor_matrix, threshold = 0.7)
    result2 <- MatSelect(cor_matrix, threshold = 0.8)
    result3 <- MatSelect(cor_matrix, threshold = 0.9)
  },
  times = 3,
  unit = "ms"
)$time) / 1e6  # Convert nanoseconds to milliseconds

cat(sprintf("Recomputing each time: %.1f ms\n", time1))
cat(sprintf("Precomputed matrix: %.1f ms\n", time2))
cat(sprintf("Speedup: %.1fx faster\n", time1 / time2))
```

**Use precomputed matrices when**:

- Testing multiple thresholds
- Cross-validation workflows
- Sensitivity analysis

## 4.2 Memory Considerations for Large Data

For large datasets (n > 10,000, p > 500):

### Memory-Efficient Correlation Computation

```{r, eval=FALSE}
# Standard (memory-intensive for large n)
cor_matrix <- cor(large_data)

# Memory-efficient alternative (for very large n)
# Process in chunks if needed
compute_cor_chunked <- function(data, chunk_size = 1000) {
  n <- nrow(data)
  n_chunks <- ceiling(n / chunk_size)

  # Use online algorithm or chunked computation
  # (Implementation depends on your data size)
}
```

### Sparse Correlation Matrices

If most correlations are low, consider sparse storage:

```{r, eval=FALSE}
# Convert to sparse format (requires Matrix package)
library(Matrix)

# Compute correlation
cor_mat <- cor(data)

# Threshold and convert to sparse
cor_sparse <- cor_mat
cor_sparse[abs(cor_sparse) < 0.3] <- 0  # Set low correlations to 0
cor_sparse <- Matrix(cor_sparse, sparse = TRUE)

# Memory savings
object.size(cor_mat)
object.size(cor_sparse)
```

## 4.3 Parallel Processing Strategies

For multiple independent pruning operations:

```{r, eval=FALSE}
library(parallel)

# Create cluster
cl <- makeCluster(detectCores() - 1)

# Export functions to cluster
clusterEvalQ(cl, library(corrselect))

# Parallel pruning with different thresholds
thresholds <- seq(0.5, 0.9, by = 0.1)
results <- parLapply(cl, thresholds, function(thresh) {
  corrPrune(my_data, threshold = thresh)
})

# Clean up
stopCluster(cl)
```

**Note**: corrselect itself doesn't parallelize internally (for reproducibility), but you can parallelize **across** multiple analyses.

## 4.4 Choosing the Right Mode

Decision tree for mode selection:

```
p ≤ 15:  Use "exact" (fast enough, guaranteed optimal)
15 < p ≤ 25:  Use "exact" if time permits, "greedy" if speed critical
p > 25:  Use "greedy" or "auto"
p > 100: Always use "greedy"
```

---

# 5. Troubleshooting

## 5.1 Common Errors and Solutions

### Error: "No valid subsets found"

**Cause**: Threshold too strict - all variables exceed it

```{r, error=TRUE}
# Example: All correlations > 0.9
set.seed(123)
x <- rnorm(100)
high_cor_data <- data.frame(
  x1 = x,
  x2 = x + rnorm(100, sd = 0.01),
  x3 = x + rnorm(100, sd = 0.01)
)

tryCatch({
  corrPrune(high_cor_data, threshold = 0.5)
}, error = function(e) {
  cat("Error:", e$message, "\n")
})
```

**Solutions**:
1. Increase threshold
2. Use `force_in` to keep at least one variable
3. Check data for near-duplicates

```{r, error=TRUE}
# Solution 1: Increase threshold
result <- corrPrune(high_cor_data, threshold = 0.95)
print(names(result))

# Solution 2: Force keep one variable
result <- corrPrune(high_cor_data, threshold = 0.5, force_in = "x1")
print(names(result))
```

### Error: force_in variables conflict with threshold

**Cause**: Variables in `force_in` have |correlation| > threshold

```{r, error=TRUE}
# x1 and x2 are highly correlated
tryCatch({
  corrPrune(high_cor_data, threshold = 0.5, force_in = c("x1", "x2"))
}, error = function(e) {
  cat("Error:", e$message, "\n")
})
```

**Solution**: Either increase threshold or reduce `force_in` set

### Error: VIF computation fails in modelPrune()

**Cause**: Perfect multicollinearity (R² = 1)

```{r, error=TRUE}
# Create perfect multicollinearity
perfect_data <- data.frame(
  y = rnorm(100),
  x1 = rnorm(100),
  x2 = rnorm(100)
)
perfect_data$x3 <- perfect_data$x1 + perfect_data$x2  # Perfect collinearity

tryCatch({
  modelPrune(y ~ ., data = perfect_data, limit = 5)
}, error = function(e) {
  cat("Error:", e$message, "\n")
})
```

**Solution**: Use `corrPrune()` first to remove perfect collinearity:

```{r}
# Two-step approach
step1 <- corrPrune(perfect_data[, -1], threshold = 0.99)
step2_data <- data.frame(y = perfect_data$y, step1)
result <- modelPrune(y ~ ., data = step2_data, limit = 5)
print(attr(result, "selected_vars"))
```

## 5.2 Threshold Selection Guidance

### For corrPrune() (Correlation Threshold)

**Conservative (strict)**:

- threshold = 0.5: Very low redundancy, may lose information
- Use when: Interpretability is critical, small sample size

**Moderate (recommended)**:

- threshold = 0.7: Balances redundancy and information retention
- Use when: Standard regression, general analysis

**Lenient**:

- threshold = 0.9: Only removes near-duplicates
- Use when: Large sample size, prediction focus

### For modelPrune() (VIF Limit)

**Strict**:

- limit = 2: Very low multicollinearity, may over-prune
- Use when: Small sample size, interpretability critical

**Moderate (recommended)**:

- limit = 5: Standard threshold in literature
- Use when: General regression analysis

**Lenient**:

- limit = 10: Tolerates more multicollinearity
- Use when: Large sample size, prediction focus

### Empirical Approach: Visualize First

```{r, fig.width=10, fig.height=4, fig.alt="Two side-by-side plots for threshold selection guidance. Left panel: histogram of absolute correlations in mtcars showing distribution with vertical lines marking strict (0.5, red), moderate (0.7, blue), and lenient (0.9, green) thresholds. Right panel: line plot showing number of variables retained versus threshold, demonstrating sensitivity analysis with plateau beginning around 0.7, helping identify optimal threshold for balancing redundancy reduction and information retention."}
data(mtcars)

# Visualize correlation distribution
cor_mat <- cor(mtcars)
cor_vec <- cor_mat[upper.tri(cor_mat)]

par(mfrow = c(1, 2))

# Histogram of correlations
hist(abs(cor_vec), breaks = 30,
     main = "Distribution of |Correlations|",
     xlab = "|Correlation|",
     col = "lightblue")
abline(v = c(0.5, 0.7, 0.9), col = c("red", "blue", "green"), lwd = 2, lty = 2)
legend("topright",
       legend = c("0.5 (strict)", "0.7 (moderate)", "0.9 (lenient)"),
       col = c("red", "blue", "green"), lwd = 2, lty = 2,
       bty = "o", bg = "white")

# Subset size vs threshold
thresholds <- seq(0.3, 0.95, by = 0.05)
sizes <- sapply(thresholds, function(t) {
  tryCatch({
    ncol(corrPrune(mtcars, threshold = t))
  }, error = function(e) NA)
})

plot(thresholds, sizes, type = "b",
     xlab = "Threshold",
     ylab = "Number of Variables Retained",
     main = "Threshold Sensitivity",
     col = "blue", lwd = 2)
abline(h = ncol(mtcars), lty = 2, col = "gray")
text(0.3, ncol(mtcars), "Original", pos = 3)
```

**Strategy**: Choose threshold where curve begins to plateau.

## 5.3 Handling Edge Cases

### Single Predictor After Pruning

```{r}
# Very strict threshold may leave only 1 variable
strict_result <- corrPrune(mtcars, threshold = 0.3)
cat("Variables remaining:", ncol(strict_result), "\n")

# Check if result is usable
if (ncol(strict_result) < 2) {
  cat("Warning: Only 1 variable remaining. Consider:\n")
  cat("  1. Increasing threshold\n")
  cat("  2. Using force_in to keep important variables\n")
}
```

### All Variables Removed

```{r, error=TRUE}
# Impossible threshold
tryCatch({
  corrPrune(mtcars, threshold = 0.0)
}, error = function(e) {
  cat("Error:", e$message, "\n")
})
```

### Mixed-Type Data

```{r}
# Create mixed data
mixed_data <- mtcars
mixed_data$cyl <- factor(mixed_data$cyl)
mixed_data$am <- factor(mixed_data$am)

# Use assocSelect for mixed types
result <- assocSelect(mixed_data, threshold = 0.6)
show(result)
```

---

# 6. Best Practices

## 6.1 Workflow Recommendations

### For Exploratory Analysis

```{r, eval=FALSE}
# 1. Visualize correlations
corrplot::corrplot(cor(data), method = "circle")

# 2. Try multiple thresholds
results <- lapply(c(0.5, 0.7, 0.9), function(t) {
  corrPrune(data, threshold = t)
})

# 3. Compare subset sizes
sapply(results, ncol)

# 4. Choose based on your needs
final_data <- results[[2]]  # threshold = 0.7
```

### For Publication-Quality Analysis

```{r, eval=FALSE}
# 1. Use exact mode for reproducibility and optimality
data_pruned <- corrPrune(data, threshold = 0.7, mode = "exact")

# 2. Document in methods section
cat(sprintf(
  "Variables were pruned using corrselect::corrPrune() with threshold = 0.7, ",
  "exact mode, retaining %d of %d original predictors.",
  ncol(data_pruned), ncol(data)
))

# 3. Report which variables were removed
removed <- attr(data_pruned, "removed_vars")
cat("Removed variables:", paste(removed, collapse = ", "))
```

## 6.2 Combining with Other Methods

```{r, eval=FALSE}
# Comprehensive variable selection pipeline
pipeline <- function(data, response) {
  # Step 1: Remove correlations
  step1 <- corrPrune(data, threshold = 0.7, mode = "auto")

  # Step 2: VIF cleanup
  step2_data <- data.frame(response = response, step1)
  step2 <- modelPrune(response ~ ., data = step2_data, limit = 5)

  # Step 3: Feature importance (optional)
  if (requireNamespace("Boruta", quietly = TRUE)) {
    boruta_result <- Boruta::Boruta(response ~ ., data = step2)
    important <- Boruta::getSelectedAttributes(boruta_result)
    final_data <- step2[, c("response", important)]
  } else {
    final_data <- step2
  }

  final_data
}
```

---

# 7. Summary

## Key Takeaways

### Algorithms
- Use **exact mode** for p ≤ 20 (optimal, reproducible)
- Use **greedy mode** for p > 20 (fast, near-optimal)
- Use **auto mode** to let corrselect decide

### Custom Engines
- Integrate any modeling package (INLA, mgcv, brms)
- Define custom pruning criteria (AIC, BIC, p-values)
- Two required functions: `fit` and `diagnostics`

### Optimization
- Precompute correlation matrices for multiple thresholds
- Use greedy mode for large p
- Parallelize across analyses, not within

### Troubleshooting
- Visualize correlation distribution before choosing threshold
- Use `force_in` to protect important variables
- Two-step pruning (corrPrune → modelPrune) for robustness

---

# 8. References

**Algorithms**:

- Eppstein, D., Löffler, M., & Strash, D. (2010). Listing all maximal cliques in sparse graphs in near-optimal time. *Symposium on Algorithms and Computation*.
- Bron, C., & Kerbosch, J. (1973). Algorithm 457: Finding all cliques of an undirected graph. *Communications of the ACM*, 16(9), 575-577.

**Multicollinearity**:

- O'Brien, R. M. (2007). A caution regarding rules of thumb for variance inflation factors. *Quality & Quantity*, 41(5), 673-690.
- Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). *Regression Diagnostics*. Wiley.

**Software**:

- INLA: Rue, H., Martino, S., & Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian models. *Journal of the Royal Statistical Society: Series B*, 71(2), 319-392.
- mgcv: Wood, S. N. (2017). *Generalized Additive Models: An Introduction with R* (2nd ed.). Chapman and Hall/CRC.

---

## See Also

- `vignette("quickstart")` - 5-minute introduction
- `vignette("workflows")` - Real-world examples
- `vignette("comparison")` - vs caret, Boruta, glmnet
- `vignette("corrselect_vignette")` - Original exact methods vignette
- `?corrPrune` - Association-based pruning
- `?modelPrune` - Model-based pruning
- `?corrSelect` - Exact subset enumeration

## Session Info

```{r}
sessionInfo()
```