--- title: "Introduction to GSABenchmark" author: "Andrei-Florian Stoica" package: GSABenchmark date: March 4, 2025 output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{GSABenchmark} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction GSABenchmark is a package designed for benchmarking scRNA-seq gene set analysis methods. It provides both traditional and novel benchmark metrics as well as visualization tools. Currently, GSABenchmark supports 17 gene set analysis methods. # Installation To install GSABenchmark, run the following commands in an R session: ```{r setup, eval=FALSE} if (!require("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("GSABenchmark") ``` # Prerequisites In addition to GSABenchmark, you need to install [scRNAseq](https://bioconductor.org/packages/release/data/experiment/html/scRNAseq.html) and [scater](https://bioconductor.org/packages/release/bioc/html/scater.html) for this tutorial. # Loading data This tutorial uses an scRNA-seq human pancreas dataset with provided cell type annotations. After loading the required packages, download the dataset using the `BaronPancreasData` function from `scRNAseq`. The dataset will be stored as a `SingleCellExperiment` object. ```{r message=FALSE, warning=FALSE, results=FALSE} library(GSABenchmark) library(scater) library(scRNAseq) scObj <- BaronPancreasData('human') ``` GSABenchmark requires normalized and log-transformed data and PCA coordinates. We will use the functions `logNormCounts` from `scuttle` (loaded with `scater`) and runPCA from `scater` for these steps: ```{r} scObj <- logNormCounts(scObj) scObj <- runPCA(scObj) ``` # Creating gene sets First, we will take a look at the cell types found in this dataset: ```{r} table(colData(scObj)$label) ``` We will create gene sets for a cell type found in the data, using cell type markers from [PanglaoDB](https://www.panglaodb.se/markers.html): ```{r} alphaMarkers <- c('GCG', 'TTR', 'PCSK2', 'FXYD5', 'LDB2', 'MAFB', 'CHGA', 'SCGB2A1', 'GLS', 'FAP', 'DPP4', 'GPR119', 'PAX6', 'NEUROD1', 'LOXL4', 'PLCE1', 'GC', 'KLHL41', 'FEV', 'PTGER3', 'RFX6', 'SMARCA1', 'PGR', 'IRX1', 'UCP2', 'RGS4', 'KCNK16', 'GLP1R', 'ARX', 'POU3F4', 'RESP18', 'PYY', 'SLC38A5', 'TM4SF4', 'CRYBA2', 'SH3GL2', 'PCSK1', 'PRRG2', 'IRX2', 'ALDH1A1','PEMT', 'SMIM24', 'F10', 'SCGN', 'SLC30A8') geneSets <- setNames(list(alphaMarkers), 'alpha') ``` **Note**: The gene set names cannot contain spaces or underscores, and must all be found in the column of assessed class identities. For this dataset, this column is named `label`: ```{r} setdiff(names(geneSets), colData(scObj)[['label']]) ``` # Running the methods The `supportedMethods` function lists all the methods currently available in `GSABenchmark`: ```{r} supportedMethods() ``` For the purposes of this tutorial, we will select three of these methods: ```{r} gsaMethods <- c('CSOA', 'PLAGE', 'Zscore') ``` We now run the three methods for each gene set: ```{r} scObj <- runGSAMethods(scObj, 'label', geneSets, gsaMethods) ``` # Running the benchmark The benchmark is run in a single line of code. In this vignette, we will only perform the correctness benchmark. To achieve this, we will skip the speed and memory benchmarks by setting `runEFBenchmark` to `FALSE`: ```{r} smr <- runBenchmark(scObj, 'label', geneSets, gsaMethods, runEFBenchmark=FALSE) ``` **Note**: This tutorial illustrates using `GSABenchmark` with Bioconductor's native `SingleCellExperiment` class. However, `GSABenchmark` is also fully compatible with `Seurat`, as `runGSAMethods`, `runBenchmark`, `runMethodShuffle` and `runBenchmarkShuffle` can all take a `Seurat` object as their first argument. # Exploring the results `runBenchmark` returns a list of lists of data frames. We can further explore the structure of this object: ```{r} names(smr) names(smr$boundary) names(smr$MCC) names(smr$global) names(smr$predictions) ``` We can access the main summaries of the results as follows: ```{r} smr$boundary$metricSummary smr$MCC smr$global$metricSummary ``` To visualize the results, we will call the `allBenchmarkPlots` function. This function plots each summary result data frame: ```{r} plots <- allBenchmarkPlots(smr) length(plots) ``` Below, we display one of the resulting plots: ```{r, out.height='80%', out.width='80%', fig.height=5, fig.width=5} plots[[3]] ``` The output of `runBenchmark` can also be used to create other plots. We will visualize the aggregate ranks of the methods using the `aggregateRankPlot` function: ```{r, out.height='80%', out.width='80%', fig.height=5, fig.width=5} aggregateRankPlot(smr) ``` Next, we will use the `ratioRank` function to visualize the top results among all method-metric-gene set combinations as ranked by the ratio of the maximum score and the mean: ```{r, out.height='100%', out.width='100%', fig.height=5, fig.width=5} ratioPlot(smr, labelSize=2.5) ``` To illustrate the similarity of the results obtained by different methods, we will create MDS plots for each gene set, as well as an aggregate MDS plot. We showcase the aggregate MDS plot below: ```{r, out.height='100%', out.width='100%', fig.height=5, fig.width=5} plots <- mdsPlots(scObj, smr) plots[[length(geneSets) + 1]] ``` Another way to look at the similarity between the results of different methods is through correlation plots. Below, we display the aggregate correlation plot: ```{r, out.height='80%', out.width='80%', fig.height=5, fig.width=5} plots <- corrPlots(scObj, smr) plots[[length(geneSets) + 1]] ``` We can also display the Jaccard tile plots of the binary predictions made by each method. Here, we show the aggregate Jaccard tile plot: ```{r, out.height='80%', out.width='80%', fig.height=5, fig.width=5} plots <- predJaccardPlots(smr$predictions) plots[[length(geneSets) + 1]] ``` # Benchmarking a single method at different choices of gene loss and noise GSABenchmark also offers the option of benchmarking a single method at different choices of: - Gene loss (the fraction of genes removed from each gene signature) - Noise (the fraction of retained genes in the final adjusted gene signatures after random genes have been added). Each condition determined by a gene loss-noise combination can be run in replicates (that is, different gene signatures will be constructed with the same gene loss and noise parameters for each replicate). The `seeds` argument controls the random seeds used in the generation of gene sets. Its length determines the number of replicates. To save execution time, we will use only one replicate: ```{r} scObj <- runMethodShuffle(scObj, 'label', geneSets, 'CSOA', loss=1/3, noise=0.5, seeds=20, averageReplicates=FALSE) ``` Subsequently, the same benchmark as before can be run with `runBenchmarkShuffle`, a thin wrapper around `runBenchmark`: ```{r} smr <- runBenchmarkShuffle(scObj, 'label', geneSets, 'CSOA', runEFBenchmark=FALSE) ``` The output of `runBenchmarkShuffle` can be inspected as before, e.g.: ```{r} smr$boundary$metricSummary ``` # Session information {-} ```{r} sessionInfo() ```