--- title: "Benchmarking bigPLScox" shorttitle: "Benchmarking bigPLScox" author: - name: "Frédéric Bertrand" affiliation: - Cedric, Cnam, Paris email: frederic.bertrand@lecnam.net date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Benchmarking bigPLScox} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "figures/benchmarking-", fig.width = 7, fig.height = 5, dpi = 150, message = FALSE, warning = FALSE ) ``` # Motivation High-dimensional survival datasets can be computationally demanding. **bigPLScox** implements algorithms that scale to large numbers of predictors and observations via component-based models, sparse penalties, and stochastic gradient descent routines. This vignette demonstrates how to benchmark the package against baseline approaches using the [`bench`](https://bench.r-lib.org) package. We focus on simulated data to illustrate reproducible comparisons between the classical `coxgpls()` solver, its big-memory counterparts, and the `survival::coxph()` implementation. # Setup The examples below require a recent version of **bench** together with **survival**. ```{r packages, include=FALSE} needed <- c("plsRcox","boot","survival","glmnet","bigPLScox") has <- vapply(needed, requireNamespace, logical(1), quietly = TRUE) if (!all(has)) { missing <- paste(needed[!has], collapse = ", ") knitr::opts_chunk$set(eval = FALSE) message("Note: skipping code execution because these packages are missing: ", missing) } ``` ```{r packages_lib} library(bigPLScox) library(survival) library(bench) library(bigmemory) library(plsRcox) ``` The helper `dataCox()` simulates survival outcomes with right censoring. We work with a moderately sized problem here, but larger values for `n` and `p` can be used to stress test performance. ```{r simulate-data} set.seed(2024) sim_design <- dataCox( n = 2000, lambda = 2, rho = 1.5, x = matrix(rnorm(2000 * 50), ncol = 50), beta = c(1, 3, rep(0, 48)), cens.rate = 5 ) cox_data <- list( x = as.matrix(sim_design[, -(1:3)]), time = sim_design$time, status = sim_design$status ) X_big <- bigmemory::as.big.matrix(cox_data$x) ``` # Running the benchmark We compare the classical Cox proportional hazards model with `coxgpls()` and the two `big_pls_cox()` solvers. The `bench::mark()` helper executes the estimators multiple times and records timing statistics alongside memory usage information. ```{r run-benchmarks} bench_res <- bench::mark( coxgpls = coxgpls( cox_data$x, cox_data$time, cox_data$status, ncomp = 5, ind.block.x = c(3, 10) ), big_pls = big_pls_cox(X_big, cox_data$time, cox_data$status, ncomp = 5), big_pls_gd = big_pls_cox_gd(X_big, cox_data$time, cox_data$status, ncomp = 5, max_iter = 100), survival = coxph(Surv(cox_data$time, cox_data$status) ~ cox_data$x, ties = "breslow"), iterations = 100, check = FALSE ) bench_res bench_summary <- bench_res[, c("expression", "median", "itr/sec")] bench_summary ``` The resulting tibble reports elapsed time, memory allocations, and garbage collection statistics for each estimator. The `itr/sec` column is often the most useful indicator when comparing multiple implementations. The `bench_summary` object summarises the median runtime and iterations per second. # Visualising the results `bench` provides ggplot-based helpers to visualise the distributions of elapsed and memory usage. ```{r bench-plot} plot(bench_res, type = "jitter") ``` Additional geometries, such as ridge plots, are available via `autoplot(bench_res, type = "ridge")`. # Accuracy on a shared test set We use a common train/test split across four estimators and compare predictive performance using the survival concordance index. The test set is identical for each method to ensure a fair comparison. ```{r accuracy-shared} set.seed(4242) split_id <- sample(c("train", "test"), size = nrow(cox_data$x), replace = TRUE, prob = c(0.7, 0.3)) train_idx <- split_id == "train" test_idx <- !train_idx train_x <- cox_data$x[train_idx, , drop = FALSE] test_x <- cox_data$x[test_idx, , drop = FALSE] train_time <- cox_data$time[train_idx] test_time <- cox_data$time[test_idx] train_status <- cox_data$status[train_idx] test_status <- cox_data$status[test_idx] big_fit <- big_pls_cox(bigmemory::as.big.matrix(train_x), train_time, train_status, ncomp = 5) gd_fit <- big_pls_cox_gd(bigmemory::as.big.matrix(train_x), train_time, train_status, ncomp = 5, max_iter = 120) pls_fit <- plsRcox::plsRcox(train_x, train_time, train_status, nt = 5, verbose = FALSE) lp_big <- predict(big_fit, newdata = test_x, type = "link") lp_gd <- predict(gd_fit, newdata = test_x, type = "link") lp_pls <- predict(pls_fit, newdata = test_x) surv_test <- survival::Surv(test_time, test_status) acc_results <- data.frame( method = c("big_pls_cox", "big_pls_cox_gd", "plsRcox"), c_index = c( survival::concordance(surv_test ~ lp_big)$concordance, survival::concordance(surv_test ~ lp_gd)$concordance, survival::concordance(surv_test ~ lp_pls)$concordance ) ) acc_results ``` All methods operate on the same train/test split and the concordance index facilitates direct comparison with the reference implementation from \pkg{plsRcox}. # Exporting benchmark results Use the function `write.csv()` to store the benchmarking table as part of a reproducible pipeline. For larger studies consider varying the number of latent components, sparsity constraints, or the dataset dimensions. ```{r export-benchmark, eval = FALSE} if (!dir.exists("inst/benchmarks/results")) { dir.create("inst/benchmarks/results", recursive = TRUE) } write.csv(bench_res[,1:9], file = "inst/benchmarks/results/benchmarking-demo.csv", row.names = FALSE) ``` # Reusing the benchmarking scripts The package also ships with standalone scripts under `inst/benchmarks/` that mirror this vignette while exposing additional configuration points. Run them from the repository root as: ```bash Rscript inst/benchmarks/cox-benchmark.R Rscript inst/benchmarks/benchmark_bigPLScox.R Rscript inst/benchmarks/cox_pls_benchmark.R ``` Each script accepts environment variables to adjust the problem size and stores results under `inst/benchmarks/results/` with time-stamped filenames.