---
title: "Using `parSim` on a supercomputer"
date: 'October 26th, 2024'
author: "Mihai Constantin and Sacha Epskamp"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Using `parSim` on a supercomputer}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## Introduction

In this short vignette we illustrate the main steps for using the `parSim`
package to execute a simulation on a supercomputer. We use as example the job
scheduler employed by the Dutch [Lisa Computing
Cluster](https://www.surf.nl/en/lisa-computing-cluster-extra-computing-power-for-research),
however the steps are similar for other supercomputers.

## Writing the Simulation Script

The first step is to write a simulation script compatible with the `parSim`
package. Suppose we have have the following simulation conditions:

```r
# Sample size.
sample_size = c(50, 100, 250)

# Beta values.
beta = c(0, 0.5, 1)

# Sigma values.
sigma = c(0.25, 0.5, 1)
```

Then, our simulation script could look like the following, where we are
interested in computing the bias:

```r
# The function to evaluate for each simulation condition.
bias <- function(x, y) {
    # Perform some computation.
    result <- abs(x - y)

    # Return the result.
    return(result)
}

# The simulation expression to evaluate for each simulation condition.
{
    # Generate the data.
    x <- rnorm(sample_size)
    y <- beta * x + rnorm(sample_size, sigma)

    # Fit the model.
    fit <- lm(y ~ x)

    # Compute the relevant quantities.
    beta_estimate <- coef(fit)[2]
    r_squared <- summary(fit)$r.squared
    bias <- bias(beta, beta_estimate)

    # Return in a compatible format.
    list(
        beta_estimate = beta_estimate,
        r_squared = r_squared,
        bias = bias
    )
}
```

If we put these pieces of code together, then our `parSim` simulation function
looks like the following:

```r
# Run the simulation.
results <- parSim(
    # The simulation conditions.
    sample_size = c(50, 100, 250),
    beta = c(0, 0.5, 1),
    sigma = c(0.25, 0.5, 1),

    # The expression to evaluate for each simulation condition.
    expression = {
        # Generate the data.
        x <- rnorm(sample_size)
        y <- beta * x + rnorm(sample_size, sigma)

        # Fit the model.
        fit <- lm(y ~ x)

        # Compute the relevant quantities.
        beta_estimate <- coef(fit)[2]
        r_squared <- summary(fit)$r.squared
        bias <- bias(beta, beta_estimate)

        # Return in a compatible format.
        list(
            beta_estimate = beta_estimate,
            r_squared = r_squared,
            bias = bias
        )
    },

    # The variables to export.
    exports = c("bias")
)
```

There are a few adjustments to make to the `parSim` function call above before
we are ready to place it in an `R` script file and deploy it on a supercomputer.
At a bare minimum, we need to:

- Decide on the number of `replications` to run for each simulation condition,
  e.g., `replications = 1000`.
- Export any external objects (i.e., via the `exports` argument) or `R` packages
  (i.e., via the `packages` argument) that are needed for the simulation script
  to run. In the example above, we need to export the `bias` function, so we
  would set `exports = c("bias")`, and no additional packages are required.
- Decide on the number of `cores` to use for the simulation. The Lisa Compute
  Cluster has `16` cores per node, so we could set `cores = 15`. Note that the
  parallel backend [`parabar`](https://parabar.mihaiconstantin.com) used by
  `parSim` will ensure that at lease one core is available for the main process.
- Finally, we need to save the results of each `parSim` function execution to a
  file. Considering the intricacies of supercomputers, it is best to specify the
  file name and path ourselves. To do so, we can pass the file name as a string
  to the `save` argument and ensure that the file name is unique for each
  `parSim` function execution (e.g., by appending the job or task ID).

*Note.* Check out the documentation for `parSim` for more information on the
function arguments (e.g., on how to exclude certain simulation conditions via
the `exclude` argument).

With the considerations above in mind, we can now write our `simulation.R` file
as follows:

```r
#!/usr/bin/env Rscript

# Clear the environment to ensure a fresh start.
rm(list = ls(all.names = TRUE))

# Get the temporary directory from the environment.
TMP_DIR <- Sys.getenv("TMP_DIR")

# Create the output path.
out_dir <- paste0(TMP_DIR, "/output")

# Load the current job ID in the batch array.
job_id <- Sys.getenv("SLURM_ARRAY_TASK_ID")

# Prepare a unique file name for the results.
file_name <- paste0(out_dir, "/simulation_results_id_", job_id)

# Load libraries.
library(parSim)
library(flexiblas)

# For `R` complied with the `foss` (i.e., `flexiblas`) toolchain, we set the
# number of threads to one.
flexiblas::flexiblas_set_num_threads(1)

# Run the simulation.
results <- parSim(
    # The simulation conditions.
    sample_size = c(50, 100, 250),
    beta = c(0, 0.5, 1),
    sigma = c(0.25, 0.5, 1),

    # The expression to evaluate for each simulation condition.
    expression = {
        # Generate the data.
        x <- rnorm(sample_size)
        y <- beta * x + rnorm(sample_size, sigma)

        # Fit the model.
        fit <- lm(y ~ x)

        # Compute the relevant quantities.
        beta_estimate <- coef(fit)[2]
        r_squared <- summary(fit)$r.squared
        bias <- bias(beta, beta_estimate)

        # Return in a compatible format.
        list(
            beta_estimate = beta_estimate,
            r_squared = r_squared,
            bias = bias
        )
    },

    # The variables to export.
    exports = c("bias"),

    # The number of replications.
    replications = 1000,

    # The variables to export.
    exports = c("bias"),

    # Save the results.
    save = file_name,

    # Execute the simulation on a single core.
    cores = 16
)
```

It is always a good idea to test the `simulation.R` script locally before
deploying it on a supercomputer (i.e., debugging could be much easier on your
local machine).

## Preparing the Job Script

The next step consist in preparing the job script that will be used to submit
the `simulation.R` script to the supercomputer. The job script is a shell script
that specifies the resources needed for the job. For more information on the
specifics of the job script, please refer to the documentation of the
supercomputer you are using.

Let's assume we are using the Lisa Compute Cluster and our `simulation.R` script
is located in the `~/simulation-example` directory on the `login` node. Then,
our job script could look like the following:

```bash
#!/bin/bash
#SBATCH -N 1
#SBATCH -p normal
#SBATCH -t 01:00:00

# Load modules.
module load 2021
module load R/4.1.0-foss-2021a

# Set the user.
export USERNAME=mihaiconstantin

# Define the simulation path on the login node.
export SIM_DIR=$HOME/simulation-example

# Export packages on the login node.
export R_LIBS=$HOME/R/x86_64-pc-linux-gnu-library/4.1

# Define temporary directory.
export TMP_DIR=$TMPDIR/$USERNAME

# Make output directory.
mkdir -p $TMP_DIR/output

# Run the script (i.e., using 15 cores).
# Note. The script sets the number of threads to 1.
Rscript $SIM_DIR/simulation.R

# Copy output from scratch.
cp -r $TMP_DIR/output/. $SIM_DIR/output

# Exit message.
echo "Job $SLURM_ARRAY_JOB_ID ($SLURM_ARRAY_TASK_ID / $SLURM_ARRAY_TASK_COUNT) honored."
```

Please make sure to adjust the job script above to:

- Load the correct version of `R`.
- Set the correct paths for the `SIM_DIR` and `TMP_DIR` variables.
- Replace `mihaiconstantin` with your username for the `USERNAME` variable.
- Adjust the `#SBATCH` directives to match the resources needed for your job.
  This part is crucial if you want to avoid having your job killed by the
  scheduler. For instance, the `#SBATCH -t 01:00:00` directive refers to the
  *wall time* and specifies that the job will run for one hour. This value
  should be adjusted according to the expected execution duration of the
  `parSim` function specified in the `simulation.R` script.

## Submitting the Job

At this point, we are ready to submit the job to the supercomputer. To do so, we
can dispatch an array of jobs using the `sbatch` command. For instance, to
submit the job script above `100` times to the Lisa Compute Cluster, we can use
the following command:

```bash
sbatch --array=1-100 job.sh
```

This will result in replicating the `simulation.R` script `100` times, each time
with a different task ID. The results of each replication will be saved in a
unique file in the `$SIM_DIR/output` directory. In total, each simulation
condition will be replicated `1000 * 100 = 100000` times.

Assuming the username `mihaiconstantin`, we can monitor the job queue using the
following command:

```bash
squeue -u mihaiconstantin
```

Finally, we can cancel a job (i.e., or task in an job array) using the `scancel`
command as follows:

```bash
scancel job_id
```

If `job_id` is an array of tasks (i.e., `job_id_1`, `job_id_2`, etc.), then
`scancel job_id` will cancel all tasks.

## Folder Structure

Following the example above, our `~/simulation-example` directory on the Lisa
Compute Cluster should have the following structure:

```txt
~/simulation-example/
├── job.sh
├── simulation.R
└── output/
```

## Downloading the Results

Once the job is completed, we can download the results from the supercomputer to
our local machine using the `scp` command. For instance, to download the results
from the Lisa Compute Cluster, we can use the following command:

```bash
# Navigate on your local machine to a directory of your choice.
cd ~/Downloads/simulation-results/

# Make a directory where to download the results from the supercomputer.
mkdir output

# Download the results.
scp mihaiconstantin@lisa.surfsara.nl:~/simulation-example/output/simulation_results* ./output/
```

*Note.* Make sure to replace `mihaiconstantin` with your username and adjust the
paths accordingly.

## Processing the Results

Finally, we can process the results on our local machine. Suppose we have a
script called `analysis.R` that contains the following code:

```r
# Load relevant libraries.
library(ggplot2)
library(tidyr)

# Set the working directory.
setwd("~/Downloads/simulation-results/")

# All simulation result files.
files <- list.files("output/", pattern = "simulation_results_", full.names = TRUE)

# Read all files and bind them together.
results <- do.call(
    rbind, lapply(files, read.table, header = TRUE)
)

# Pre-process the results in long format for plotting.
results_long <- tidyr::gather(results, metric, value, beta_estimate:bias)

# Make factors with nice labels for plotting.
results_long$sigma_factor <- factor(
    x = results_long$sigma,
    levels = c(0.25, 0.5, 1),
    labels = c("Sigma: 0.025", "Sigma: 0.5", "Sigma: 1")
)

# Plot.
ggplot2::ggplot(
    results_long, ggplot2::aes(
        x = factor(sample_size), y = value, fill = factor(beta))
    ) +
    ggplot2::facet_grid(
        metric ~ sigma_factor, scales = "free"
    ) +
    ggplot2::geom_boxplot() +
    ggplot2::theme_bw() +
    ggplot2::xlab("Sample size") +
    ggplot2::ylab("") +
    ggplot2::scale_fill_discrete("Beta")
```

We hope this helps you get started with running simulations on a supercomputer
using the `parSim` package. Good luck with your simulations!