| Type: | Package |
| Title: | Cross-Validation for Linear and Ridge Regression Models |
| Version: | 2.0.0 |
| Date: | 2026-02-01 |
| Description: | Implements cross-validation methods for linear and ridge regression models. The package provides grid-based selection of the ridge penalty parameter using Singular Value Decomposition (SVD) and supports K-fold cross-validation, Leave-One-Out Cross-Validation (LOOCV), and Generalized Cross-Validation (GCV). Computations are implemented in C++ via 'RcppArmadillo' with optional parallelization using 'RcppParallel'. The methods are suitable for high-dimensional settings where the number of predictors exceeds the number of observations. |
| License: | MIT + file LICENSE |
| Imports: | stats, Rcpp (≥ 1.0.13), RcppParallel (≥ 5.1.8) |
| Suggests: | boot, RhpcBLASctl, testthat (≥ 3.0.0) |
| LinkingTo: | Rcpp, RcppArmadillo, RcppParallel |
| SystemRequirements: | GNU make, C++17 |
| URL: | https://github.com/phipnye/CV-LM |
| Config/testthat/edition: | 3 |
| NeedsCompilation: | yes |
| Packaged: | 2026-02-02 05:39:57 UTC; philip |
| Author: | Philip Nye [aut, cre] |
| Maintainer: | Philip Nye <phipnye@proton.me> |
| Repository: | CRAN |
| Date/Publication: | 2026-02-03 09:20:23 UTC |
Cross-validation for linear and ridge regression models
Description
cvLM provides a high-performance framework for cross-validating linear models using
RcppArmadillo and RcppParallel. It implements numerically stable algorithms to compute K-fold,
Leave-One-Out (LOO), and Generalized Cross-Validation (GCV) metrics for both linear and ridge regression.
Usage
cvLM(object, ...)
## S3 method for class 'formula'
cvLM(object, data, subset, na.action, K.vals = 10L,
lambda = 0, generalized = FALSE, seed = 1L, n.threads = 1L,
tol = 1e-7, center = TRUE, ...)
## S3 method for class 'lm'
cvLM(object, data, K.vals = 10L, lambda = 0,
generalized = FALSE, seed = 1L, n.threads = 1L,
tol = 1e-7, center = TRUE, ...)
## S3 method for class 'glm'
cvLM(object, data, K.vals = 10L, lambda = 0,
generalized = FALSE, seed = 1L, n.threads = 1L,
tol = 1e-7, center = TRUE, ...)
Arguments
object |
|
data |
a data frame, list or environment containing the variables in the model. See
|
subset |
a specification of the rows/observations to be used. For |
na.action |
a function indicating how |
K.vals |
an integer vector specifying the number of folds. For Leave-One-Out CV, set |
lambda |
a non-negative numeric scalar for the ridge regularization parameter. Defaults to 0 (OLS). |
generalized |
if |
seed |
an integer used to initialize the random number generator for reproducible fold splits. |
n.threads |
an integer specifying the number of threads for parallel computation. Set to -1 to use the
|
tol |
a numeric scalar specifying the tolerance for rank estimation in the underlying orthogonal and singular value decompositions. See 'Details' below. |
center |
if |
... |
additional arguments passed to or from other methods (currently disregarded). |
Details
Numerical Stability and Underdetermined Systems
The newest release of cvLM prioritizes numerical robustness. For Ordinary Least Squares
(\lambda = 0), the package employs Complete Orthogonal Decomposition (COD). In the presence of column
rank-deficiency—where predictors are perfectly collinear or the number of parameters p exceeds the
number of observations n—COD allows for the identification of the unique minimum L_2 norm
solution, providing a stable platform for cross-validation in high-dimensional or ill-conditioned settings.
The numerical rank is determined by the tol argument; a pivot is considered nonzero if its absolute
value is strictly greater than the threshold relative to the largest pivot:
|r_{ii}| > \text{tol} \times |r_{\text{max}}|
This approach differs from R's lm, which handles column rank-deficiency by zeroing out coefficients
for redundant columns, and may result in different coefficient estimates and subsequent predictions when the
design matrix is not of full-column rank.
For ridge regression (\lambda > 0), the package utilizes Singular Value Decomposition (SVD). By
decomposing the design matrix into X = UDV^T, the ridge solution can be computed with high precision,
avoiding the potential information loss associated with forming the cross-product matrix X^TX required
for Cholesky-based methods used in previous versions. Similar to the COD, a singular value is considered
nonzero only if it satisfies:
\sigma_i > \text{tol} \times \sigma_{\text{max}}
While this numerical stability comes at a speed cost, it allows for consistency with methods used for
efficient shrinkage parameter search used by grid.search.
Regularization and Data Centering
Following the methodology in The Elements of Statistical Learning (Hastie et al.), cvLM now
defaults to centering the data (center = TRUE). Centering the predictors X and the response
y by their respective means is mathematically equivalent to excluding the intercept term from the
regularization penalty.
In the uncentered case (center = FALSE), the objective function penalizes all coefficients,
including the intercept \beta_0:
\hat{\beta} = \arg\min_{\beta} \|y - X\beta\|^2 + \lambda \|\beta\|^2
In the centered case (center = TRUE), the optimization problem is effectively reformulated as:
\hat{\beta} = \arg\min_{\beta} \left\{ \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} x_{ij}
\beta_j \right)^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \right\}
Efficiency of LOOCV and GCV
For Leave-One-Out (LOOCV) and Generalized Cross-Validation (GCV), cvLM avoids the computational cost
of n refits by utilizing closed-form solutions derived through use of the hat matrix.
Parallel Computation and Oversubscription
cvLM uses RcppParallel to distribute K-fold cross-validation tasks across multiple threads.
However, many modern R installations link to multi-threaded BLAS libraries. Running multi-threaded BLAS
operations within multi-threaded folds can lead to oversubscription, where the total number of active
threads exceeds the physical CPU capacity, causing significant performance penalties.
To mitigate this, if the RhpcBLASctl package is installed, cvLM will automatically attempt to
set the BLAS thread count to 1 during parallel execution. It is highly recommended to install
RhpcBLASctl when using n.threads > 1.
Value
A data.frame with the following columns:
Kthe number of folds requested by the user. Note that if the internal algorithm determines a more appropriate number of folds to use (based on the number of rows), a warning is issued and a modified fold count is used for the computation; however, this column retains the input value.
CVthe cross-validation metric calculated during the procedure.
seedthe integer seed used for fold randomization.
Author(s)
Philip Nye, phipnye@gmail.com
References
Eddelbuettel D, Sanderson C (2014). "RcppArmadillo: Accelerating R with high-performance C++ linear algebra." Computational Statistics and Data Analysis, 71, 1054—1063. doi:10.1016/j.csda.2013.02.005.
Aggarwal, C. C. (2020). Linear Algebra and Optimization for Machine Learning: A Textbook. Springer Cham. doi:10.1007/978-3-030-40344-7.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer New York, NY. doi:10.1007/978-0-387-84858-7.
Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press, Baltimore, MD. doi:10.56021/9781421407944.
See Also
Examples
## Not run:
data(mtcars)
n <- nrow(mtcars)
# 10-fold CV for a linear regression model
cvLM(mpg ~ ., data = mtcars, K.vals = 10)
# Comparing 5-fold, 10-fold, and Leave-One-Out CV configurations using 2 threads
cvLM(mpg ~ ., data = mtcars, K.vals = c(5, 10, nrow(mtcars)), n.threads = 2)
# Ridge regression with analytic GCV (using lm interface)
fitted.lm <- lm(mpg ~ ., data = mtcars)
cvLM(fitted.lm, data = mtcars, lambda = 0.5, generalized = TRUE)
## End(Not run)
Efficient Grid Search for Optimal Ridge Regularization
Description
Performs an optimized grid search to find the regularization parameter \lambda that minimizes the
cross-validation metric for ridge regression.
Usage
grid.search(formula, data, subset, na.action, K = 10L,
generalized = FALSE, seed = 1L, n.threads = 1L,
tol = 1e-7, max.lambda = 10000, precision = 0.1,
center = TRUE)
Arguments
formula |
a |
data |
a data frame, list or environment containing the variables in the model. See
|
subset |
a specification of the rows/observations to be used. See |
na.action |
a function indicating how |
K |
an integer specifying the number of folds. For Leave-One-Out CV, set |
generalized |
if |
seed |
an integer used to initialize the random number generator for reproducible K-fold splits. |
n.threads |
an integer specifying the number of threads. For K-fold CV, parallelization occurs across
folds; for GCV/LOOCV, it occurs across the |
tol |
numeric scalar specifying the tolerance for rank estimation in the SVD. See |
max.lambda |
numeric scalar for the maximum |
precision |
numeric scalar specifying the step size (increment) between |
center |
if |
Details
grid.search is designed for high-performance parameter tuning. Unlike naive implementations that
refit the model for every grid point, this function utilizes Singular Value Decomposition (SVD) of the
design matrix to evaluate the entire grid analytically.
For Generalized Cross-Validation (GCV) and Leave-One-Out (LOOCV), the SVD is computed once. Each
\lambda in the grid is then evaluated by updating the diagonal "shrinkage" matrix, reducing the cost
of each grid point evaluation from O(np^2) to O(\min(n,p)).
The search begins at \lambda = 0 and increments by precision until max.lambda is reached
(inclusive). The function returns the \lambda that achieves the minimum cross-validation metric across
the scheme.
Value
A list with the following components:
- CV
the minimum cross-validation metric.
- lambda
the value of
\lambdaassociated with the minimum metric.
See Also
Examples
## Not run:
data(mtcars)
grid.search(
formula = mpg ~ .,
data = mtcars,
K = 5L, # Use 5-fold CV
max.lambda = 100, # Search values between 0 and 100
precision = 0.01 # Increment in steps of 0.01
)
grid.search(
formula = mpg ~ .,
data = mtcars,
K = nrow(mtcars), # Use LOOCV
max.lambda = 10000, # Search values between 0 and 10000
precision = 0.5 # Increment in steps of 0.5
)
## End(Not run)
Create Regression Tables in LaTeX or HTML
Description
reg.table generates formatted regression tables in either LaTeX or HTML format for one or more linear regression models.
The tables include coefficient estimates, standard errors, and various model statistics.
Usage
reg.table(models, type = c("latex", "html"), split.size = 4L, n.digits = 3L,
big.mark = "", caption = "Regression Results", spacing = 5L,
stats = c("AIC", "BIC", "CV", "r.squared", "adj.r.squared",
"fstatistic", "nobs"),
...)
Arguments
models |
A list of linear regression models (objects of class |
type |
A string specifying the output format. Must be one of |
split.size |
An integer specifying the number of models per table. Tables with more models than this number will be split. |
n.digits |
An integer specifying the number of decimal places for numerical values. |
big.mark |
A string to use for thousands separators in numeric formatting. |
caption |
A string for the table caption. |
spacing |
A numeric specifying the spacing between columns in the output table. |
stats |
A character vector specifying which model statistics to include in the table. Options include:
|
... |
Additional arguments passed to |
Details
The function generates tables summarizing linear regression models, either in LaTeX or HTML format. The tables include coefficients, standard errors, and optional statistics such as AIC, BIC, cross-validation scores, and more. The output is formatted based on the specified type and options. The names of the models list will be used as column headers if provided.
Value
A character vector of length equal to the number of tables with each element containing the LaTeX or HTML code for a regression table.
Examples
## Not run:
data(mtcars)
# Create a list of 6 different linear regression models with names
# Names get used for column names of the table
models <- list(
`Model 1` = lm(mpg ~ wt + hp, data = mtcars),
`Model 2` = lm(mpg ~ wt + qsec, data = mtcars),
`Model 3` = lm(mpg ~ wt + hp + qsec, data = mtcars),
`Model 4` = lm(mpg ~ wt + cyl, data = mtcars),
`Model 5` = lm(mpg ~ wt + hp + cyl, data = mtcars),
`Model 6` = lm(mpg ~ hp + qsec + drat, data = mtcars)
)
# Example usage for LaTeX
reg.table(
models,
"latex",
split.size = 3L,
n.digits = 3L,
big.mark = ",",
caption = "My regression results",
spacing = 7.5,
stats = c("AIC", "BIC", "CV"),
data = mtcars,
K.vals = 5L,
seed = 123L
)
# Example usage for HTML
reg.table(
models,
"html",
split.size = 3L,
n.digits = 3L,
big.mark = ",",
caption = "My regression results",
spacing = 7.5,
stats = c("AIC", "BIC", "CV"),
data = mtcars,
K.vals = 5L,
seed = 123L
)
## End(Not run)