---
title: "Benchmarking bigPLScox"
shorttitle: "Benchmarking bigPLScox"
author:
- name: "Frédéric Bertrand"
  affiliation:
  - Cedric, Cnam, Paris
  email: frederic.bertrand@lecnam.net
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{Benchmarking bigPLScox}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "figures/benchmarking-",
  fig.width = 7,
  fig.height = 5,
  dpi = 150,
  message = FALSE,
  warning = FALSE
)
```

# Motivation

High-dimensional survival datasets can be computationally demanding. **bigPLScox**
implements algorithms that scale to large numbers of predictors and observations
via component-based models, sparse penalties, and stochastic gradient descent
routines. This vignette demonstrates how to benchmark the package against
baseline approaches using the [`bench`](https://bench.r-lib.org) package.

We focus on simulated data to illustrate reproducible comparisons between the
classical `coxgpls()` solver, its big-memory counterparts, and the
`survival::coxph()` implementation.

# Setup

The examples below require a recent version of **bench** together with
**survival**.

```{r packages, include=FALSE}
needed <- c("plsRcox","boot","survival","glmnet","bigPLScox")
has    <- vapply(needed, requireNamespace, logical(1), quietly = TRUE)

if (!all(has)) {
  missing <- paste(needed[!has], collapse = ", ")
  knitr::opts_chunk$set(eval = FALSE)
  message("Note: skipping code execution because these packages are missing: ", missing)
}
```

```{r packages_lib}
library(bigPLScox)
library(survival)
library(bench)
library(bigmemory)
library(plsRcox)
```

The helper `dataCox()` simulates survival outcomes with right censoring. We work
with a moderately sized problem here, but larger values for `n` and `p` can be
used to stress test performance.

```{r simulate-data}
set.seed(2024)
sim_design <- dataCox(
  n = 2000,
  lambda = 2,
  rho = 1.5,
  x = matrix(rnorm(2000 * 50), ncol = 50),
  beta = c(1, 3, rep(0, 48)),
  cens.rate = 5
)

cox_data <- list(
  x = as.matrix(sim_design[, -(1:3)]),
  time = sim_design$time,
  status = sim_design$status
)
X_big <- bigmemory::as.big.matrix(cox_data$x)
```

# Running the benchmark

We compare the classical Cox proportional hazards model with `coxgpls()` and
the two `big_pls_cox()` solvers. The `bench::mark()` helper executes the
estimators multiple times and records timing statistics alongside memory usage
information.

```{r run-benchmarks}
bench_res <- bench::mark(
  coxgpls = coxgpls(
    cox_data$x,
    cox_data$time,
    cox_data$status,
    ncomp = 5,
    ind.block.x = c(3, 10)
  ),
  big_pls = big_pls_cox(X_big, cox_data$time, cox_data$status, ncomp = 5),
  big_pls_gd = big_pls_cox_gd(X_big, cox_data$time, cox_data$status, ncomp = 5, max_iter = 100),
  survival = coxph(Surv(cox_data$time, cox_data$status) ~ cox_data$x, ties = "breslow"),
  iterations = 100,
  check = FALSE
)
bench_res

bench_summary <- bench_res[, c("expression", "median", "itr/sec")]
bench_summary
```

The resulting tibble reports elapsed time, memory allocations, and garbage
collection statistics for each estimator. The `itr/sec` column is often the most
useful indicator when comparing multiple implementations. The `bench_summary`
object summarises the median runtime and iterations per second.

# Visualising the results

`bench` provides ggplot-based helpers to visualise the distributions of elapsed
and memory usage.

```{r bench-plot}
plot(bench_res, type = "jitter")
```

Additional geometries, such as ridge plots, are available via
`autoplot(bench_res, type = "ridge")`.

# Accuracy on a shared test set

We use a common train/test split across four estimators and compare predictive
performance using the survival concordance index. The test set is identical for
each method to ensure a fair comparison.

```{r accuracy-shared}
set.seed(4242)
split_id <- sample(c("train", "test"), size = nrow(cox_data$x), replace = TRUE, prob = c(0.7, 0.3))
train_idx <- split_id == "train"
test_idx <- !train_idx

train_x <- cox_data$x[train_idx, , drop = FALSE]
test_x <- cox_data$x[test_idx, , drop = FALSE]
train_time <- cox_data$time[train_idx]
test_time <- cox_data$time[test_idx]
train_status <- cox_data$status[train_idx]
test_status <- cox_data$status[test_idx]

big_fit <- big_pls_cox(bigmemory::as.big.matrix(train_x), train_time, train_status, ncomp = 5)
gd_fit <- big_pls_cox_gd(bigmemory::as.big.matrix(train_x), train_time, train_status, ncomp = 5, max_iter = 120)
pls_fit <- plsRcox::plsRcox(train_x, train_time, train_status, nt = 5, verbose = FALSE)

lp_big <- predict(big_fit, newdata = test_x, type = "link")
lp_gd <- predict(gd_fit, newdata = test_x, type = "link")
lp_pls <- predict(pls_fit, newdata = test_x)

surv_test <- survival::Surv(test_time, test_status)
acc_results <- data.frame(
  method = c("big_pls_cox", "big_pls_cox_gd", "plsRcox"),
  c_index = c(
    survival::concordance(surv_test ~ lp_big)$concordance,
    survival::concordance(surv_test ~ lp_gd)$concordance,
    survival::concordance(surv_test ~ lp_pls)$concordance
  )
)
acc_results
```

All methods operate on the same train/test split and the concordance index
facilitates direct comparison with the reference implementation from
\pkg{plsRcox}.

# Exporting benchmark results

Use the function `write.csv()` to store the benchmarking table as part of a
reproducible pipeline. For larger studies consider varying the number of latent
components, sparsity constraints, or the dataset dimensions.

```{r export-benchmark, eval = FALSE}
if (!dir.exists("inst/benchmarks/results")) {
  dir.create("inst/benchmarks/results", recursive = TRUE)
}
write.csv(bench_res[,1:9], file = "inst/benchmarks/results/benchmarking-demo.csv", row.names = FALSE)
```

# Reusing the benchmarking scripts

The package also ships with standalone scripts under `inst/benchmarks/` that
mirror this vignette while exposing additional configuration points. Run them
from the repository root as:

```bash
Rscript inst/benchmarks/cox-benchmark.R
Rscript inst/benchmarks/benchmark_bigPLScox.R
Rscript inst/benchmarks/cox_pls_benchmark.R
```

Each script accepts environment variables to adjust the problem size and stores
results under `inst/benchmarks/results/` with time-stamped filenames.