---
title: "Introduction to GSABenchmark"
author: "Andrei-Florian Stoica"
package: GSABenchmark
date: March 4, 2025
output: BiocStyle::html_document
vignette: >
    %\VignetteIndexEntry{GSABenchmark}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction

GSABenchmark is a package designed for benchmarking scRNA-seq gene 
set analysis methods. It provides both traditional and novel benchmark 
metrics as well as visualization tools. Currently, GSABenchmark supports 
17 gene set analysis methods. 

# Installation
To install GSABenchmark, run the following commands in an R session:

```{r setup, eval=FALSE}
if (!require("BiocManager", quietly=TRUE))
    install.packages("BiocManager")

BiocManager::install("GSABenchmark")
```

# Prerequisites
In addition to GSABenchmark, you need to install 
[scRNAseq](https://bioconductor.org/packages/release/data/experiment/html/scRNAseq.html) 
and 
[scater](https://bioconductor.org/packages/release/bioc/html/scater.html) 
for this tutorial.

# Loading data

This tutorial uses an scRNA-seq human pancreas dataset with provided cell type 
annotations. 

After loading the required packages, download the dataset using the 
`BaronPancreasData` function from `scRNAseq`. The dataset will be stored as 
a `SingleCellExperiment` object. 

```{r message=FALSE, warning=FALSE, results=FALSE}
library(GSABenchmark)
library(scater)
library(scRNAseq)

scObj <- BaronPancreasData('human')

```

GSABenchmark requires normalized and log-transformed data and PCA coordinates.
We will use the functions `logNormCounts` from `scuttle` (loaded with `scater`)
and runPCA from `scater` for these steps:

```{r}
scObj <- logNormCounts(scObj)
scObj <- runPCA(scObj)
```

# Creating gene sets

First, we will take a look at the cell types found in this dataset:

```{r}
table(colData(scObj)$label)
```

We will create gene sets for a cell type found in the data, using
cell type markers from [PanglaoDB](https://www.panglaodb.se/markers.html):

```{r}
alphaMarkers <- c('GCG', 'TTR', 'PCSK2', 'FXYD5', 'LDB2', 'MAFB',
                  'CHGA', 'SCGB2A1', 'GLS', 'FAP', 'DPP4', 'GPR119',
                  'PAX6', 'NEUROD1', 'LOXL4', 'PLCE1', 'GC', 'KLHL41',
                  'FEV', 'PTGER3', 'RFX6', 'SMARCA1', 'PGR', 'IRX1',
                  'UCP2', 'RGS4', 'KCNK16', 'GLP1R', 'ARX', 'POU3F4',
                  'RESP18', 'PYY', 'SLC38A5', 'TM4SF4', 'CRYBA2', 'SH3GL2',
                  'PCSK1', 'PRRG2', 'IRX2', 'ALDH1A1','PEMT', 'SMIM24',
                  'F10', 'SCGN', 'SLC30A8')


geneSets <- setNames(list(alphaMarkers), 'alpha')
```

**Note**: The gene set names cannot contain spaces or underscores, and must all
be found in the column of assessed class identities. For this dataset, 
this column is named `label`:

```{r}
setdiff(names(geneSets), colData(scObj)[['label']])
```

# Running the methods

The `supportedMethods` function lists all the methods currently available in 
`GSABenchmark`:

```{r}
supportedMethods()
```

For the purposes of this tutorial, we will select three of these methods:

```{r}
gsaMethods <- c('CSOA', 'PLAGE', 'Zscore')
```

We now run the three methods for each gene set:

```{r}
scObj <- runGSAMethods(scObj, 'label', geneSets, gsaMethods)
```
# Running the benchmark

The benchmark is run in a single line of code. In this vignette, we will only 
perform the correctness benchmark. To achieve this, we will skip the speed and 
memory benchmarks by setting `runEFBenchmark` to `FALSE`:


```{r}
smr <- runBenchmark(scObj, 'label', geneSets, gsaMethods, runEFBenchmark=FALSE)
```

**Note**: This tutorial illustrates using `GSABenchmark` with 
Bioconductor's native `SingleCellExperiment` class. However, `GSABenchmark` 
is also fully compatible with `Seurat`, as `runGSAMethods`, `runBenchmark`, 
`runMethodShuffle` and `runBenchmarkShuffle` can all take a `Seurat` object as
their first argument.

# Exploring the results

`runBenchmark` returns a list of lists of data frames. We can further explore
the structure of this object:

```{r}
names(smr)
names(smr$boundary)
names(smr$MCC)
names(smr$global)
names(smr$predictions)
```

We can access the main summaries of the results as follows:

```{r}
smr$boundary$metricSummary
smr$MCC
smr$global$metricSummary
```

To visualize the results, we will call the `allBenchmarkPlots` function. This
function plots each summary result data frame:

```{r}
plots <- allBenchmarkPlots(smr)
length(plots)
```

Below, we display one of the resulting plots:

```{r, out.height='80%', out.width='80%', fig.height=5, fig.width=5}
plots[[3]]
```

The output of `runBenchmark` can also be used to create other plots. 

We will visualize the aggregate ranks of the methods using the 
`aggregateRankPlot` function:

```{r, out.height='80%', out.width='80%', fig.height=5, fig.width=5}
aggregateRankPlot(smr)
```

Next, we will use the `ratioRank` function to visualize the top results among 
all method-metric-gene set combinations as ranked by the ratio of the 
maximum score and the mean:

```{r, out.height='100%', out.width='100%', fig.height=5, fig.width=5}
ratioPlot(smr, labelSize=2.5)
```

To illustrate the similarity of the results obtained by different methods, 
we will create MDS plots for each gene set, as well as an aggregate MDS plot. 
We showcase the aggregate MDS plot below:

```{r, out.height='100%', out.width='100%', fig.height=5, fig.width=5}
plots <- mdsPlots(scObj, smr)
plots[[length(geneSets) + 1]]
```

Another way to look at the similarity between the results of different methods
is through correlation plots. Below, we display the aggregate correlation plot:

```{r, out.height='80%', out.width='80%', fig.height=5, fig.width=5}
plots <- corrPlots(scObj, smr)
plots[[length(geneSets) + 1]]
```

We can also display the Jaccard tile plots of the binary predictions made by
each method. Here, we show the aggregate Jaccard tile plot:

```{r, out.height='80%', out.width='80%', fig.height=5, fig.width=5}
plots <- predJaccardPlots(smr$predictions)
plots[[length(geneSets) + 1]]
```

# Benchmarking a single method at different choices of gene loss and noise

GSABenchmark also offers the option of benchmarking a single method at 
different choices of:

- Gene loss (the fraction of genes removed from each gene
signature)
- Noise (the fraction of retained genes in the final adjusted
gene signatures after random genes have been added).

Each condition determined by a gene loss-noise combination can be run in
replicates (that is, different gene signatures will be constructed with the
same gene loss and noise parameters for each replicate). The `seeds` argument
controls the random seeds used in the generation of gene sets. Its length
determines the number of replicates. To save execution time, we will use only
one replicate:

```{r}
scObj <- runMethodShuffle(scObj, 'label', geneSets, 'CSOA',
                          loss=1/3,
                          noise=0.5,
                          seeds=20,
                          averageReplicates=FALSE)

```

Subsequently, the same benchmark as before can be run with 
`runBenchmarkShuffle`, a thin wrapper around `runBenchmark`:

```{r}
smr <- runBenchmarkShuffle(scObj, 'label', geneSets, 'CSOA',
                           runEFBenchmark=FALSE)
```

The output of `runBenchmarkShuffle` can be inspected as before, e.g.:

```{r}
smr$boundary$metricSummary
```

# Session information {-}
```{r}
sessionInfo()
```