--- title: "Bayesian variable selection with geomc.vs" author: "Vivekananda Roy" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true pdf_document: toc: true toc_depth: 2 fig_width: 6 fig_height: 4 dev: 'pdf' keep_tex: false bibliography: references.bib link-citations: true vignette: > %\VignetteIndexEntry{Bayesian variable selection with geomc.vs} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = FALSE, tidy = FALSE, comment = "#>", progress = FALSE, echo = TRUE, dev = 'pdf', dpi = 120, fig.width = 6, fig.height = 4, out.width = "100%" ) ``` ## Introduction Bayesian variable selection is an approach to identifying important predictors in regression models, particularly when the number of predictors $p$ is large relative to the number of observations $n$. The `geomc.vs()` function implements geometric MCMC [@roy2024] for high-dimensional Bayesian variable selection in linear regression. In particular, `geomc.vs` implements the geometric MH algorithm [@roy2024] with specific choices for $f$ and $g$ for sampling from a popular 'spike and slab' Bayesian variable selection model. The details of the variable selection model and the MH algorithm can be found in @roy2024 (See also the appendix of this document for a description of the model.). This vignette illustrates the use of `geomc.vs`. ### The Variable Selection Problem Given: - Response vector $\mathbf{y} \in \mathbb{R}^n$ - Predictor matrix $\mathbf{X} \in \mathbb{R}^{n \times p}$ We want to identify which subset of the $p$ predictors are truly associated with the response. This is especially challenging when $p \gg n$ (high-dimensional setting). ### Why Bayesian Variable Selection? **Advantages:** - Quantifies uncertainty about which variables to include - Provides posterior inclusion probability for each variable - Handles model uncertainty systematically ```{r setup} library(geommc) ``` ## Basic Usage The `geomc.vs()` function requires minimal inputs: ```{r basic-usage, eval=FALSE} result <- geomc.vs( X = X, # n \times p predictor matrix y = y # n-vector response ) ``` # Example 1: ## Small Sparse Model Let's start with a simple example where only 3 out of 100 predictors are truly important. ## Data Generation ```{r data-1} # Set parameters n <- 50 # Sample size p <- 100 # Number of predictors nonzero <- 3 # Number of true predictors trueidx <- 1:3 # Indices of true predictors nonzero.value <- 4 # Effect size # True regression coefficients TrueBeta <- numeric(p) TrueBeta[trueidx] <- nonzero.value # Generate correlated predictors rho <- 0.5 set.seed(3) xone <- matrix(rnorm(n * p), n, p) X <- sqrt(1 - rho) * xone + sqrt(rho) * rnorm(n) # Generate response intercept <- 0.5 sigma_true <- 1 y <- intercept + X %*% TrueBeta + rnorm(n, 0, sigma_true) cat("Data dimensions:\n") cat(" n (observations):", n, "\n") cat(" p (predictors):", p, "\n") cat(" Number of true non-zero predictors:", nonzero, "\n") cat(" True coefficient indices:", paste(trueidx, collapse = ", "), "\n") cat(" True coefficient values:", nonzero.value, "\n") ``` ## Run geomc.vs for Variable Selection ```{r run-vs-1} # Run geomc.vs set.seed(123) result <- geomc.vs( X = X, y = y, model.summary = TRUE # Compute model summaries ) ``` ## Examining the Results If `model.summary` is set TRUE, `geomc.vs()` function returns a comprehensive list of results: ```{r results-structure} names(result) ``` Key components: - **`samples`**: MCMC samples of model indicators (which variables are included) returned as a `n.iter` $\times p$ sparse lgCMatrix. - **`acceptance.rate`**: Proportion of accepted proposals - **`log.p`**: Log unnormalized posterior probabilities of sampled models - **`mip`**: Marginal inclusion probabilities for each variable - **`wmip`**: Weighted marginal inclusion probabilities for each variable - **`median.model`**: The median probability model - **`wam`**: The weighted average model - **`beta.mean`**: Posterior mean of regression coefficients - **`beta.wam`**: An alternative (model probability weighted) estimate of posterior mean of regression coefficients On the other hand, if `model.summary` is set FALSE (Default value), `geomc.vs` only returns `samples`, `acceptance.rate` and `log.p`. ## Marginal Inclusion Probabilities (MIP) The MIP for variable $j$ is the posterior probability that variable $j$ should be included in the model: $$\text{MIP}_j = P(\beta_j \neq 0 | \mathbf{y}, \mathbf{X})$$ The formula used by `geomc.vs` to compute MIP is given in the appendix. ```{r mip-1} # Top 10 variables by MIP top_vars <- order(result$mip, decreasing = TRUE)[1:10] cat("Top 10 variables by Marginal Inclusion Probability:\n\n") mip_df <- data.frame( Variable = top_vars, MIP = round(result$mip[top_vars], 4), True = ifelse(top_vars %in% trueidx, "Yes", "No") ) print(mip_df, row.names = FALSE) ``` ## Visualizing Marginal Inclusion Probabilities ```{r plot-mip-1, fig.width=6, fig.height=4} par(mfrow = c(1, 1)) # Plot MIPs plot(1:p, result$mip, type = "h", lwd = 2, col = "gray", xlab = "Variable Index", ylab = "MIP", main = "Marginal Inclusion Probabilities", ylim = c(0, 1)) # Highlight true variables points(trueidx, result$mip[trueidx], col = "red", pch = 19, cex = 2) # Add threshold line abline(h = 0.5, col = "blue", lty = 2, lwd = 2) legend("topright", legend = c("True variables", "MIP > 0.5 threshold"), col = c("red", "blue"), pch = c(19, NA), lty = c(NA, 2), lwd = 2) ``` ## The Median Probability Model The returned median probability model includes all variables with `mip` > `model.threshold` (Default 0.5): ```{r median-model-1} cat("Median Probability Model:\n") cat(" Number of variables selected:", length(result$median.model), "\n") cat(" Variable indices:", paste(result$median.model, collapse = ", "), "\n\n") # Check if we recovered the true model correctly_identified <- all(trueidx %in% result$median.model) false_positives <- setdiff(result$median.model, trueidx) false_negatives <- setdiff(trueidx, result$median.model) cat("Model Recovery:\n") cat(" All true variables identified:", correctly_identified, "\n") cat(" Number of false positives:", length(false_positives), "\n") cat(" Number of false negatives:", length(false_negatives), "\n") if (length(false_positives) > 0) { cat(" False positive indices:", paste(false_positives, collapse = ", "), "\n") } ``` ## Posterior Mean of Coefficients ```{r coef-estimates-1} # Compare estimated vs true coefficients for true variables cat("Coefficient Estimates (True Variables):\n\n") coef_comparison <- data.frame( Variable = trueidx, True_Beta = TrueBeta[trueidx], Posterior_Mean = round(result$beta.mean[(1+trueidx)], 3), Beta_WAM = round(result$beta.wam[(1+trueidx)], 3), MIP = round(result$mip[trueidx], 4) ) print(coef_comparison) intercept_comparison<- data.frame( True_intercept =intercept, Posterior_Mean = round(result$beta.mean[1], 3), Beta_WAM = round(result$beta.wam[1], 3) ) print(intercept_comparison) ``` The formulas used by `geomc.vs` to compute `beta.mean` and `beta.wam` are given in the appendix. ## Model Space Exploration ```{r model-space, fig.width=7, fig.height=5} # Model sizes visited model_sizes <- apply(result$samples, 1, sum) hist(model_sizes, breaks = 20, main = "Distribution of Model Sizes Visited", xlab = "Number of Variables in Model", col = "lightblue", border = "white") abline(v = nonzero, col = "red", lwd = 3, lty = 2) legend("topright", legend = "True model size", col = "red", lty = 2, lwd = 2) cat("\nModel Size Summary:\n") cat(" Mean model size:", round(mean(model_sizes), 2), "\n") cat(" Median model size:", median(model_sizes), "\n") cat(" True model size:", nonzero, "\n") ``` # Example 2: ## Another Example to Illustrate Different Data Structures and Tuning Parameters This example demonstrates how to use `geomc.vs()` for variable selection with sparse matrices, such as document term matrices commonly found in text mining applications. Different tuning parameters allowed by `geomc.vs()` will also be discussed. In particular, we'll show how to: - Generate realistic sparse data - Handle zero-variance columns - Discuss tuning parameters in `geomc.vs()` - Run Bayesian variable selection - Analyze and visualize results ## Data Generation ## Generate Sparse Data We'll create a document-term matrix where most entries are zero (sparse), simulating a text classification problem. ```{r generate-sparse-data} # Problem dimensions n <- 80 # Number of documents p <- 200 # Vocabulary size (number of words) sparsity <- 0.95 # 95% of entries are zero # True important words (only 5 words matter) true_words <- c(10, 25, 50, 100, 150) true_effects <- c(2.5, -2.0, 3.0, -1.8, 2.2) # Create sparse design matrix (document-term matrix) # Most words don't appear in most documents X_dense <- matrix(0, n, p) # Add word counts (Poisson distributed when word appears) set.seed(3) for (i in 1:n) { # Each document has about 5% of words present present_words <- sample(1:p, size = round(p * (1 - sparsity))) X_dense[i, present_words] <- rpois(length(present_words), lambda = 2) } # Ensure true words appear with higher frequency for (word in true_words) { appears_in <- sample(1:n, size = round(0.7 * n)) X_dense[appears_in, word] <- rpois(length(appears_in), lambda = 3) } # Remove zero-variance columns col_vars <- apply(X_dense, 2, var) nonzero_cols <- which(col_vars > 0) cat("Original number of columns:", p, "\n") cat("Columns with zero variance:", sum(col_vars == 0), "\n") cat("Columns retained:", length(nonzero_cols), "\n\n") # Keep mapping from new to old indices old_to_new <- rep(NA, p) old_to_new[nonzero_cols] <- 1:length(nonzero_cols) # Update X_dense to only include non-zero variance columns X_dense <- X_dense[, nonzero_cols, drop = FALSE] # Update true_words indices to reflect new column positions true_words_old <- true_words true_words <- match(true_words_old, nonzero_cols) true_words <- true_words[!is.na(true_words)] # Keep only those that survived cat("Original true words:", paste(true_words_old, collapse = ", "), "\n") cat("New true words indices:", paste(true_words, collapse = ", "), "\n") cat("True words retained:", length(true_words), "out of", length(true_words_old), "\n\n") # Update true_effects to match retained true words true_effects <- true_effects[!is.na(match(true_words_old, nonzero_cols))] # Update p to reflect actual number of columns p <- ncol(X_dense) cat("Updated p:", p, "\n\n") library(Matrix) # Convert to sparse matrix (dgCMatrix) X_sparse <- Matrix(X_dense, sparse = TRUE) # Check sparsity actual_sparsity <- 1 - (nnzero(X_sparse) / (n * p)) cat("Actual sparsity:", round(actual_sparsity * 100, 1), "%\n") cat("Non-zero elements:", nnzero(X_sparse), "\n") cat("Matrix class:", class(X_sparse), "\n") ``` ## Visualize Sparse Matrix Structure ```{r visualize-sparse, fig.width=6, fig.height=4} # Visualize the sparse matrix pattern image( x = 1:min(100, p), y = 1:min(50, n), z = as.matrix(t(X_sparse[1:min(50, n), 1:min(100, p)])), main = "Sparse Matrix Pattern (50 docs × 100 words)", ylab = "Documents", xlab = "Words", col = colorRampPalette(c("white", "blue", "darkblue"))(100) ) ``` ## Memory Comparison One of the main advantages of sparse matrices is memory efficiency: ```{r memory-comparison} # Compare memory usage size_sparse <- object.size(X_sparse) size_dense <- object.size(X_dense) cat("Memory usage:\n") cat(" Sparse matrix:", format(size_sparse, units = "Kb"), "\n") cat(" Dense matrix:", format(size_dense, units = "Kb"), "\n") cat(" Reduction:", round(100 * (1 - as.numeric(size_sparse) / as.numeric(size_dense)), 1), "%\n") ``` ## Generate Response Variable ```{r generate-response} # Create coefficient vector (now with updated p) beta_true <- numeric(p) beta_true[true_words] <- true_effects # Generate response (document sentiment/category) set.seed(3) linear_predictor <- X_sparse %*% beta_true y <- as.numeric(linear_predictor) + rnorm(n, sd = 1.5) cat("Response variable summary:\n") print(summary(y)) ``` ## Variable Selection using geomc.vs with Symmetric Random Walk Base By default, `geomc.vs` implements the symmetric random walk base by setting `symm=TRUE` (for details, see @roy2024). Now, we apply `geomc.vs()` with the sparse matrix: ```{r run-geomc-vs} # Run variable selection set.seed(123) result <- geomc.vs( X = X_sparse, # Sparse dgCMatrix y = y, model.summary = TRUE # Compute model summaries (default: FALSE) ) ``` ## Understanding the Output ### MCMC Samples The `samples` matrix contains binary indicators for which variables are in the model at each iteration: ```{r samples-explained} # Dimensions cat("Sample matrix dimensions:", dim(result$samples), "\n") cat("(rows = iterations, columns = variables)\n\n") # First few iterations cat("First 5 iterations (first 10 variables):\n") print(result$samples[1:5, 1:20]) # Each row is a model cat("\nIteration 1 includes variables:", which(result$samples[1, ] == 1), "\n") cat("Iteration 2 includes variables:", which(result$samples[2, ] == 1), "\n") # The proportion of accepted proposals cat("\nAcceptance rate:", round(result$acceptance.rate, 3), "\n") ``` ### Log Posterior Values The `log.p` vector contains the log posterior probability for each sampled model: ```{r logpost, fig.width=6, fig.height=4} # Trace plot of log posterior plot(result$log.p, type = "l", xlab = "Iteration", ylab = "Log Posterior", main = "Trace of Log Posterior Probability") cat("\nLog posterior summary:\n") cat(" Min:", round(min(result$log.p), 2), "\n") cat(" Max:", round(max(result$log.p), 2), "\n") cat(" Mean:", round(mean(result$log.p), 2), "\n") ``` ## Results Analysis ## Marginal Inclusion Probabilities As mentioned before, MIP represents how often each variable was included in the sampled models. ```{r mip-analysis} # Identify top words by MIP top_n <- 20 top_indices <- order(result$mip, decreasing = TRUE)[1:top_n] cat("Top", top_n, "words by Marginal Inclusion Probability:\n\n") cat("(Note: Word indices are in the reduced matrix space)\n\n") mip_table <- data.frame( Rank = 1:top_n, Word_Index = top_indices, MIP = round(result$mip[top_indices], 4), True_Word = ifelse(top_indices %in% true_words, "Yes", "No"), True_Effect = ifelse( top_indices %in% true_words, sprintf("%.2f", beta_true[top_indices]), "" ) ) print(mip_table, row.names = FALSE) # Check recovery cat("\nRecovery of true words:\n") for (i in seq_along(true_words)) { word_idx <- true_words[i] cat(sprintf(" Word %3d (original %3d): MIP = %.4f, Rank = %2d\n", word_idx, true_words_old[i], result$mip[word_idx], which(order(result$mip, decreasing = TRUE) == word_idx))) } ``` ## Visualizing MIPs ```{r plot-mips, fig.width=7, fig.height=6} par(mfrow = c(2, 1), mar = c(4, 4, 3, 1)) # Plot 1: All MIPs plot( 1:p, result$mip, type = "h", col = "gray70", lwd = 1, xlab = "Word Index (in reduced matrix)", ylab = "MIP", main = "Marginal Inclusion Probabilities (All Words)", ylim = c(0, 1) ) points(true_words, result$mip[true_words], col = "red", pch = 19, cex = 2) abline(h = 0.5, col = "blue", lty = 2, lwd = 2) legend("topright", legend = c("True word", "MIP > 0.5"), col = c("red", "blue"), pch = c(19, NA), lty = c(NA, 2), lwd = 2) ``` ## Visualizing WMIPs `geomc.vs` also returns the weighted marginal inclusion probabilities (`wmip`). ```{r plot-wmips, fig.width=7, fig.height=6} par(mfrow = c(2, 1), mar = c(4, 4, 3, 1)) # Plot 2: All WMIPs plot( 1:p, result$wmip, type = "h", col = "gray70", lwd = 1, xlab = "Word Index (in reduced matrix)", ylab = "WMIP", main = "Weighted Marginal Inclusion Probabilities (All Words)", ylim = c(0, 1) ) points(true_words, result$wmip[true_words], col = "red", pch = 19, cex = 2) abline(h = 0.5, col = "blue", lty = 2, lwd = 2) legend("topright", legend = c("True word", "WMIP > 0.5"), col = c("red", "blue"), pch = c(19, NA), lty = c(NA, 2), lwd = 2) ``` ## Model Selection Performance ```{r model-performance} # Median probability model selected_words <- result$median.model cat("Median Probability Model:\n") cat(" Words selected:", length(selected_words), "\n") cat(" Word indices (in reduced matrix):", paste(selected_words, collapse = ", "), "\n\n") # Confusion matrix true_positive <- sum(true_words %in% selected_words) false_positive <- length(selected_words) - true_positive false_negative <- length(true_words) - true_positive true_negative <- p - length(true_words) - false_positive cat("Classification Performance:\n") cat(" True Positives:", true_positive, "/", length(true_words), "\n") cat(" False Positives:", false_positive, "\n") cat(" False Negatives:", false_negative, "\n") cat(" Sensitivity:", round(true_positive / length(true_words), 3), "\n") cat(" Precision:", round(true_positive / max(1, length(selected_words)), 3), "\n") ``` ## Weighted Average Model `geomc.vs` also returns the weighted average model. The weighted average model includes all variables with `wpip` > `model.threshold` (Default 0.5). ```{r wam-model-performance} # Weighted average model selected_words <- result$wam cat("Weighted Average Model:\n") cat(" Words selected:", length(selected_words), "\n") cat(" Word indices (in reduced matrix):", paste(selected_words, collapse = ", "), "\n\n") # Confusion matrix true_positive <- sum(true_words %in% selected_words) false_positive <- length(selected_words) - true_positive false_negative <- length(true_words) - true_positive true_negative <- p - length(true_words) - false_positive cat("Classification Performance:\n") cat(" True Positives:", true_positive, "/", length(true_words), "\n") cat(" False Positives:", false_positive, "\n") cat(" False Negatives:", false_negative, "\n") cat(" Sensitivity:", round(true_positive / length(true_words), 3), "\n") cat(" Precision:", round(true_positive / max(1, length(selected_words)), 3), "\n") ``` ## Coefficient Estimates ```{r coefficient-estimates} cat("Coefficient Estimates for True Words:\n\n") coef_table <- data.frame( Word_New = true_words, Word_Original = true_words_old[1:length(true_words)], True_Effect = true_effects, Post_Mean = round(result$beta.mean[(1+true_words)], 3), Beta_WAM = round(result$beta.wam[(1+true_words)], 3), MIP = round(result$mip[true_words], 4), WMIP = round(result$wmip[true_words], 4), Selected = ifelse(true_words %in% selected_words, "Yes", "") ) print(coef_table, row.names = FALSE) ``` ## Model Space Exploration ```{r sparse-model-space, fig.width=7, fig.height=6} # Sizes of visited models model_sizes <- apply(result$samples, 1, sum) hist( model_sizes, breaks = 30, main = "Distribution of Model Sizes Visited", xlab = "Number of Words in Model", ylab = "Frequency", col = "lightblue", border = "white" ) abline(v = length(true_words), col = "red", lwd = 3, lty = 2) abline(v = median(model_sizes), col = "blue", lwd = 2, lty = 2) legend("topleft", legend = c( paste("True size =", length(true_words)), paste("Median =", median(model_sizes)) ), col = c("red", "blue"), lty = 2, lwd = 2) cat("\nModel size summary:\n") cat(" Mean:", round(mean(model_sizes), 2), "\n") cat(" Median:", median(model_sizes), "\n") cat(" Range:", range(model_sizes), "\n") ``` ## Key Takeaways - Sparse matrices provide significant memory savings for high-dimensional data - Removing zero-variance columns is essential for numerical stability - The median probability model and MIPs help identify important variables - The method successfully recovers all true variables while controlling false positives ## Variable Selection using geomc.vs with an Asymmetric Random Walk Base `geomc.vs` can also implement an asymmetric random walk base. This is specified by setting `symm=FALSE` and indicating the vector of ('addition', 'deletion', 'swap') move probabilities via the argument `move.prob`. ```{r run-geomc-vs-asym} # Run variable selection set.seed(123) result_asym <- geomc.vs( X = X_sparse, # Sparse dgCMatrix y = y, symm = FALSE, move.prob = c(0.4,0.4,0.2), model.summary = TRUE # Compute model summaries (default: FALSE) ) ``` ```{r asym-analysis} cat("\nAcceptance rate:", round(result_asym$acceptance.rate, 3), "\n") # Identify top words by MIP top_n <- 20 top_indices <- order(result_asym$mip, decreasing = TRUE)[1:top_n] cat("Top", top_n, "words by Marginal Inclusion Probability:\n\n") cat("(Note: Word indices are in the reduced matrix space)\n\n") mip_table <- data.frame( Rank = 1:top_n, Word_Index = top_indices, MIP = round(result_asym$mip[top_indices], 4), True_Word = ifelse(top_indices %in% true_words, "Yes", "No"), True_Effect = ifelse( top_indices %in% true_words, sprintf("%.2f", beta_true[top_indices]), "" ) ) print(mip_table, row.names = FALSE) # Check recovery cat("\nRecovery of true words:\n") for (i in seq_along(true_words)) { word_idx <- true_words[i] cat(sprintf(" Word %3d (original %3d): MIP = %.4f, Rank = %2d\n", word_idx, true_words_old[i], result_asym$mip[word_idx], which(order(result_asym$mip, decreasing = TRUE) == word_idx))) } ``` ## Visualizing MIPs ```{r asym-plot-mips, fig.width=7, fig.height=6} par(mfrow = c(2, 1), mar = c(4, 4, 3, 1)) # Plot 1: All MIPs plot( 1:p, result_asym$mip, type = "h", col = "gray70", lwd = 1, xlab = "Word Index (in reduced matrix)", ylab = "MIP", main = "Marginal Inclusion Probabilities (All Words)", ylim = c(0, 1) ) points(true_words, result_asym$mip[true_words], col = "red", pch = 19, cex = 2) abline(h = 0.5, col = "blue", lty = 2, lwd = 2) legend("topright", legend = c("True words", "MIP > 0.5"), col = c("red", "blue"), pch = c(19, NA), lty = c(NA, 2), lwd = 2) ``` ## Model Selection Performance ```{r asym-model-performance} # Median probability model selected_words <- result_asym$median.model cat("Median Probability Model:\n") cat(" Words selected:", length(selected_words), "\n") cat(" Word indices (in reduced matrix):", paste(selected_words, collapse = ", "), "\n\n") # Confusion matrix true_positive <- sum(true_words %in% selected_words) false_positive <- length(selected_words) - true_positive false_negative <- length(true_words) - true_positive true_negative <- p - length(true_words) - false_positive cat("Classification Performance:\n") cat(" True Positives:", true_positive, "/", length(true_words), "\n") cat(" False Positives:", false_positive, "\n") cat(" False Negatives:", false_negative, "\n") cat(" Sensitivity:", round(true_positive / length(true_words), 3), "\n") cat(" Precision:", round(true_positive / max(1, length(selected_words)), 3), "\n") ``` ## Coefficient Estimates ```{r asym-coefficient-estimates} cat("Coefficient Estimates for True Words:\n\n") coef_table <- data.frame( Word_New = true_words, Word_Original = true_words_old[1:length(true_words)], True_Effect = true_effects, Post_Mean = round(result_asym$beta.mean[(1+true_words)], 3), Beta_WAM = round(result_asym$beta.wam[(1+true_words)], 3), MIP = round(result_asym$mip[true_words], 4), WMIP = round(result_asym$wmip[true_words], 4), Selected = ifelse(true_words %in% selected_words, "Yes", "") ) print(coef_table, row.names = FALSE) ``` ## Model Space Exploration ```{r asym-model-space, fig.width=7, fig.height=6} # Sizes of visited models model_sizes <- apply(result_asym$samples, 1, sum) hist( model_sizes, breaks = 30, main = "Distribution of Model Sizes Visited", xlab = "Number of Words in Model", ylab = "Frequency", col = "lightblue", border = "white" ) abline(v = length(true_words), col = "red", lwd = 3, lty = 2) abline(v = median(model_sizes), col = "blue", lwd = 2, lty = 2) legend("topleft", legend = c( paste("True size =", length(true_words)), paste("Median =", median(model_sizes)) ), col = c("red", "blue"), lty = 2, lwd = 2) cat("\nModel size summary:\n") cat(" Mean:", round(mean(model_sizes), 2), "\n") cat(" Median:", median(model_sizes), "\n") cat(" Range:", range(model_sizes), "\n") ``` We see that `geomc.vs` with either of the two random walk base pmfs, successfully recovers all true variables while controlling false positives. For a detailed analysis of the performance of `geomc.vs` in several ultra-high dimensional simulated and real data examples, see @roy2024. ## Additional Parameters The `geomc.vs()` function allows several arguments. `initial` is used to specify the initial model (the set of active variables) for running the Markov chain (Default: Null model). `n.iter` is the no. of MCMC samples needed (Default: 50). `burnin` is the value of burnin used to compute the median probability model and other model summaries based only on the post burnin samples (Default: 1). `eps` is the value for epsilon perturbation (Default: 0.5). `lam0` is the precision parameter for the normal prior of $\beta_0$ (Default: 0, corresponding to the improper uniform prior). The shape and scale parameters for the Inverse-Gamma prior on $\sigma^2$ are specified by `a0` and `b0`, respectively (The default values are 0, 0, corresponding to the prior $1/\sigma^2$). `lam` is the slab precision parameter for the normal prior of $\beta_i$'s (Default: $n/p^2$ as suggested by the theoretical results of @liduttroy2023). `w` is the prior inclusion probability of each variable (Default: $n/p$ as suggested by @liduttroy2023). ### Key Parameters ```{r tuning, eval=FALSE} result <- geomc.vs( X = X, y = y, initial = c(1,2), # Initial model (the set of active variables) n.iter = 100, # Number of MCMC iterations burnin = 2, # Burn-in period (default: 1) eps = 0.5, # Perturbation parameter (default: 0.5) symm = FALSE, # Use symmetric proposals (default: TRUE) move.prob = c(.3, .3, .4), # Probabilities for add/delete/swap moves lam0 = 0, # Prior parameter (default: 0) a0 = 0, # Prior shape parameter (default: 0) b0 = 0, # Prior scale parameter (default: 0) lam = n / p^2, # Prior parameter (default: n/p^2) w = sqrt(n) / p, # Prior inclusion probability (default: sqrt(n)/p) model.summary = TRUE, # Compute model summaries (default: FALSE) model.threshold = 0.5, # Threshold for median model (default: 0.5) show.progress = TRUE # Show progress during sampling ) ``` ## Summary 1. `geomc.vs()` performs Bayesian variable selection using geometric MCMC 2. Returns marginal inclusion probabilities (MIPs) for each variable 3. Provides multiple model summaries: median model, WAM, posterior means 4. Scales to high-dimensional problems ($p >> n$) 5. Quantifies uncertainty about variable selection ## Appendix: The Model @roy2024 considers the following linear regression for variable selection: $$y_i | \mathbf{X}, \beta_0,\mathbf{\beta},\mathbf{\gamma},\sigma^2,w,\lambda \stackrel{ind}{\sim} N(\beta_0 + \sum_{j=1}^p \gamma_j \beta_j x_{ij}, \sigma^2), i=1,\dots, n,$$ or in vector notations: $$\mathbf{y} | \mathbf{X}, \beta_0,\mathbf{\beta},\mathbf{\gamma},\sigma^2,w,\lambda \sim N(\beta_01 + \mathbf{X}_{\mathbf{\gamma}}\mathbf{\beta}_{\mathbf{\gamma}},\sigma^2I_n).$$ where $\mathbf{\gamma} = (\gamma_1, \ldots, \gamma_p) \in \{0, 1\}^p$ are binary indicators, $\mathbf{\beta} = (\beta_1, \ldots, \beta_p)$ is the vector of regression coefficients, $\mathbf{X}_{\mathbf{\gamma}}$ is the $n \times |\mathbf{\gamma}|$ submatrix of $\mathbf{X}$ consisting of those columns of $\mathbf{X}$ for which $\gamma_i=1$ and similarly, $\mathbf{\beta}_{\mathbf{\gamma}}$ is the $|\mathbf{\gamma}|$ subvector of $\mathbf{\beta}$ corresponding to $\mathbf{\gamma}$ with $|\gamma| = \sum_{j=1}^p \gamma_j$ being the model size. Next, the hierarchical Gaussian linear model in @roy2024 places priors on the regression coefficients as well as on the model space as follows: ### Prior Specification - **Coefficient prior**: $\beta_i|\beta_0,\mathbf{\gamma},\sigma^2,w,\lambda \stackrel{ind}{\sim} N(0, \gamma_i\sigma^2/\lambda),~i=1,\ldots,p,$ - **Intercept prior**: $\beta_0|\mathbf{\gamma},\sigma^2,w,\lambda \sim N(0, \sigma^2/\lambda_0)$ - **Error variance**: $\sigma^2|\mathbf{\gamma},w,\lambda \sim \text{Inverse-Gamma}(a_0, b_0)$ - **Model prior**: $\gamma_i|w,\lambda \stackrel{iid}{\sim} \text{Bernoulli} (w), i=1,\dots,p$ The density $\pi(\sigma^2)$ of $\sigma^2 \sim$ Inv-Gamma $(a_0, b_0)$ has the form $$\pi(\sigma^2) \propto (\sigma^2)^{-a_0-1} \exp(-b_0/\sigma^2).$$ `geomc.vs` also allows the non-informative prior $$(\beta_{0}, \sigma^2)|\mathbf{\gamma}, w \sim 1 / \sigma^{2},$$ which is obtained by setting $\lambda_0=a_0=b_0=0$. ### Posterior Computation The posterior is: $$P(\mathbf{\gamma} | \mathbf{y}) \propto P(\mathbf{y} | \mathbf{\gamma}) P(\mathbf{\gamma}).$$ `geomc.vs` explores this posterior using the geometric MCMC proposed in @roy2024. `geomc.vs` returns the Markov chain samples, the empirical MH acceptance rate, and the log-posterior values (up to an additive constant) at the observed MCMC samples. If `model.summary` is set TRUE, `geomc.vs` also returns other model summaries. In particular, it returns the marginal inclusion probabilities (`mip`) computed by the Monte Carlo average as well as the weighted marginal inclusion probabilities (`wmip`) computed with weights $$w_i = P(\gamma^{(i)}|\mathbf{y})/\sum_{k=1}^K P(\gamma^{(k)}|\mathbf{y}), i=1,2,...,K$$ where $\gamma^{(k)}, k=1,2,...,K$ are the distinct models sampled. Thus, if $N_k$ is the no. of times the $k$th distinct model $\gamma^{(k)}$ is repeated in the MCMC samples, the `mip` for the $j$th variable is $$\texttt{mip}_j =\sum_{k=1}^{K} N_k I(\gamma^{(k)}_j = 1)/\texttt{n.iter}$$ and `wmip` for the $j$th variable is $$\texttt{wmip}_j = \sum_{k=1}^K w_k I(\gamma^{(k)}_j = 1).$$ The `median.model` is the model containing variables $j$ with $$\texttt{mip}_j >\texttt{model.threshold}$$ and the `wam` is the model containing variables $j$ with $$\texttt{wmip}_j > \texttt{model.threshold}.$$ Note that $E(\beta|\gamma, \mathbf{y})$, the conditional posterior mean of $\mathbf{\beta}$ given a model $\gamma$ is available in closed form (see @liduttroy2023 for details). `geomc.vs` returns two estimates (`beta.mean`, `beta.wam`) of the posterior mean of $\beta$ computed as $$ \texttt{beta.mean} = \sum_{k=1}^{K} N_k E(\beta|\gamma^{(k)},\mathbf{y})/\text{n.iter}$$ and $$\texttt{beta.wam} = \sum_{k=1}^K w_k E(\beta|\gamma^{(k)},\mathbf{y}),$$ respectively. ## References