--- title: "Big-memory workflows with bigPLScox" shorttitle: "Big-memory workflows" author: - name: "Frédéric Bertrand" affiliation: - Cedric, Cnam, Paris email: frederic.bertrand@lecnam.net date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Big-memory workflows with bigPLScox} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "figures/bigmemory-", fig.width = 7, fig.height = 4.5, dpi = 150, message = FALSE, warning = FALSE ) ``` # Motivation A central feature of **bigPLScox** is the ability to operate on file-backed `bigmemory::big.matrix` objects. This vignette demonstrates how to prepare such datasets, fit models with `big_pls_cox()` and `big_pls_cox_gd()`, and integrate them with cross-validation helpers. The examples complement the introductory vignette "Getting started with bigPLScox". # Preparing a big.matrix We simulate a moderately large design matrix and persist it to disk via `bigmemory::filebacked.big.matrix()`. Using file-backed storage allows models to train on datasets that exceed the available RAM. ```{r simulate-bigmatrix} library(bigPLScox) library(bigmemory) set.seed(2024) n_obs <- 5000 n_pred <- 100 X_dense <- matrix(rnorm(n_obs * n_pred), nrow = n_obs) time <- rexp(n_obs, rate = 0.2) status <- rbinom(n_obs, 1, 0.7) big_dir <- tempfile("bigPLScox-") dir.create(big_dir) X_big <- filebacked.big.matrix( nrow = n_obs, ncol = n_pred, backingpath = big_dir, backingfile = "X.bin", descriptorfile = "X.desc", init = X_dense ) ``` The resulting `big.matrix` can be reopened in future sessions via its descriptor file. All big-memory modelling functions accept either an in-memory matrix or a `big.matrix` reference. # Fitting big-memory models `big_pls_cox()` runs the classical PLS-Cox algorithm while streaming data from disk. ```{r big-pls-cox} fit_big <- big_pls_cox( X = X_big, time = time, status = status, ncomp = 5 ) head(fit_big$scores) str(fit_big) ``` The gradient-descent variant `big_pls_cox_gd()` uses stochastic optimisation and is well-suited for very large datasets. ```{r big-pls-cox-gd} fit_big_gd <- big_pls_cox_gd( X = X_big, time = time, status = status, ncomp = 5, max_iter = 100, tol = 1e-4 ) head(fit_big$scores) str(fit_big) ``` Both functions return objects that expose the latent scores and loading vectors, allowing downstream visualisations and diagnostics identical to their in-memory counterparts. # Cross-validation on big matrices Cross-validation for big-memory models is supported through the list interface. This enables streaming each fold directly from disk. ```{r big-cv, eval = FALSE} set.seed(2024) data_big <- list(x = X_big, time = time, status = status) cv_big <- cv.coxgpls( data_big, nt = 5, ncores = 1, ind.block.x = c(10, 40) ) cv_big$opt_nt ``` For large experiments consider combining `foreach::foreach()` with `doParallel::registerDoParallel()` to parallelise folds. # Timing snapshot The native C++ solvers substantially reduce wall-clock time compared to fitting through the R interface alone. The `bench` package provides convenient instrumentation; the chunk below only runs when it is available. ```{r big-timing} if (requireNamespace("bench", quietly = TRUE)) { bench::mark( streaming = big_pls_cox(X_big, time, status, ncomp = 5, keepX = 0), gd = big_pls_cox_gd(X_big, time, status, ncomp = 5, max_iter = 150), iterations = 5, check = FALSE ) } ``` # Deviance residuals with big matrices Once a model has been fitted we can evaluate deviance residuals using the new C++ backend. Supplying the linear predictor avoids recomputing it in R and works with any matrix backend. ```{r big-deviance} eta_big <- predict(fit_big, type = "link") dr_cpp <- computeDR(time, status, engine = "cpp", eta = eta_big) max(abs(dr_cpp - computeDR(time, status))) ``` # Cleaning up Temporary backing files can be removed after the analysis. In production pipelines you will typically keep the descriptor file alongside the binary data. ```{r cleanup} rm(X_big) file.remove(file.path(big_dir, "X.bin")) file.remove(file.path(big_dir, "X.desc")) unlink(big_dir, recursive = TRUE) ``` # Additional resources * `help(big_pls_cox)` and `help(big_pls_cox_gd)` document all tuning parameters for the big-memory solvers. * The benchmarking vignette demonstrates how to measure performance improvements obtained with file-backed matrices. * Consider persisting fitted objects with `saveRDS()` to avoid recomputing large models when iterating on analyses.