[["index.html", "Basics of Single-Cell Analysis with Bioconductor Welcome", " Basics of Single-Cell Analysis with Bioconductor Authors: Robert Amezquita [aut], Aaron Lun [aut, cre], Stephanie Hicks [aut], Raphael Gottardo [aut] Version: 1.2.0 Modified: 2021-05-16 Compiled: 2021-10-27 Environment: R version 4.1.1 (2021-08-10), Bioconductor 3.14 License: CC BY 4.0 Copyright: Bioconductor, 2021 Source: https://github.com/OSCA-source/OSCA.basic Welcome This site contains the basic analysis chapters for the “Orchestrating Single-Cell Analysis with Bioconductor” book. This describes the steps of a simple single-cell RNA-seq analysis, involving quality control, normalization, various forms of dimensionality reduction, clustering into subpopulations, detection of marker genes, and annotation of cell types. It is intended for users who already have some familiarity with R and want to get hands-on with some basic single-cell analyses. "],["quality-control.html", "Chapter 1 Quality Control 1.1 Motivation 1.2 Common choices of QC metrics 1.3 Identifying low-quality cells 1.4 Checking diagnostic plots 1.5 Removing low-quality cells Session Info", " Chapter 1 Quality Control .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 1.1 Motivation Low-quality libraries in scRNA-seq data can arise from a variety of sources such as cell damage during dissociation or failure in library preparation (e.g., inefficient reverse transcription or PCR amplification). These usually manifest as “cells” with low total counts, few expressed genes and high mitochondrial or spike-in proportions. These low-quality libraries are problematic as they can contribute to misleading results in downstream analyses: They form their own distinct cluster(s), complicating interpretation of the results. This is most obviously driven by increased mitochondrial proportions or enrichment for nuclear RNAs after cell damage. In the worst case, low-quality libraries generated from different cell types can cluster together based on similarities in the damage-induced expression profiles, creating artificial intermediate states or trajectories between otherwise distinct subpopulations. Additionally, very small libraries can form their own clusters due to shifts in the mean upon transformation (Lun 2018). They interfere with characterization of population heterogeneity during variance estimation or principal components analysis. The first few principal components will capture differences in quality rather than biology, reducing the effectiveness of dimensionality reduction. Similarly, genes with the largest variances will be driven by differences between low- and high-quality cells. The most obvious example involves low-quality libraries with very low counts where scaling normalization inflates the apparent variance of genes that happen to have a non-zero count in those libraries. They contain genes that appear to be strongly “upregulated” due to aggressive scaling to normalize for small library sizes. This is most problematic for contaminating transcripts (e.g., from the ambient solution) that are present in all libraries at low but constant levels. Increased scaling in low-quality libraries transforms small counts for these transcripts in large normalized expression values, resulting in apparent upregulation compared to other cells. This can be misleading as the affected genes are often biologically sensible but are actually expressed in another subpopulation. To avoid - or at least mitigate - these problems, we need to remove the problematic cells at the start of the analysis. This step is commonly referred to as quality control (QC) on the cells. (We will use “library” and “cell” rather interchangeably here, though the distinction will become important when dealing with droplet-based data.) We demonstrate using a small scRNA-seq dataset from Lun et al. (2017), which is provided with no prior QC so that we can apply our own procedures. View set-up code (Workflow Chapter 1) #--- loading ---# library(scRNAseq) sce.416b &lt;- LunSpikeInData(which=&quot;416b&quot;) sce.416b$block &lt;- factor(sce.416b$block) sce.416b ## class: SingleCellExperiment ## dim: 46604 192 ## metadata(0): ## assays(1): counts ## rownames(46604): ENSMUSG00000102693 ENSMUSG00000064842 ... ## ENSMUSG00000095742 CBFB-MYH11-mcherry ## rowData names(1): Length ## colnames(192): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ... ## SLX-11312.N712_S508.H5H5YBBXX.s_8.r_1 ## SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1 ## colData names(9): Source Name cell line ... spike-in addition block ## reducedDimNames(0): ## mainExpName: endogenous ## altExpNames(2): ERCC SIRV 1.2 Common choices of QC metrics We use several common QC metrics to identify low-quality cells based on their expression profiles. These metrics are described below in terms of reads for SMART-seq2 data, but the same definitions apply to UMI data generated by other technologies like MARS-seq and droplet-based protocols. The library size is defined as the total sum of counts across all relevant features for each cell. Here, we will consider the relevant features to be the endogenous genes. Cells with small library sizes are of low quality as the RNA has been lost at some point during library preparation, either due to cell lysis or inefficient cDNA capture and amplification. The number of expressed features in each cell is defined as the number of endogenous genes with non-zero counts for that cell. Any cell with very few expressed genes is likely to be of poor quality as the diverse transcript population has not been successfully captured. The proportion of reads mapped to spike-in transcripts is calculated relative to the total count across all features (including spike-ins) for each cell. As the same amount of spike-in RNA should have been added to each cell, any enrichment in spike-in counts is symptomatic of loss of endogenous RNA. Thus, high proportions are indicative of poor-quality cells where endogenous RNA has been lost due to, e.g., partial cell lysis or RNA degradation during dissociation. In the absence of spike-in transcripts, the proportion of reads mapped to genes in the mitochondrial genome can be used. High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), presumably because of loss of cytoplasmic RNA from perforated cells. The reasoning is that, in the presence of modest damage, the holes in the cell membrane permit efflux of individual transcript molecules but are too small to allow mitochondria to escape, leading to a relative enrichment of mitochondrial transcripts. For single-nuclei RNA-seq experiments, high proportions are also useful as they can mark cells where the cytoplasm has not been successfully stripped. For each cell, we calculate these QC metrics using the perCellQCMetrics() function from the scater package (McCarthy et al. 2017). The sum column contains the total count for each cell and the detected column contains the number of detected genes. The subsets_Mito_percent column contains the percentage of reads mapped to mitochondrial transcripts. Finally, the altexps_ERCC_percent column contains the percentage of reads mapped to ERCC transcripts. # Identifying the mitochondrial transcripts in our SingleCellExperiment. location &lt;- rowRanges(sce.416b) is.mito &lt;- any(seqnames(location)==&quot;MT&quot;) library(scuttle) df &lt;- perCellQCMetrics(sce.416b, subsets=list(Mito=is.mito)) summary(df$sum) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 27084 856350 1111252 1165865 1328301 4398883 summary(df$detected) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 5609 7502 8341 8397 9208 11380 summary(df$subsets_Mito_percent) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 4.59 7.29 8.14 8.15 9.04 15.62 summary(df$altexps_ERCC_percent) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 2.24 4.29 6.03 6.41 8.13 19.43 Alternatively, users may prefer to use the addPerCellQC() function. This computes and appends the per-cell QC statistics to the colData of the SingleCellExperiment object, allowing us to retain all relevant information in a single object for later manipulation. sce.416b &lt;- addPerCellQCMetrics(sce.416b, subsets=list(Mito=is.mito)) colnames(colData(sce.416b)) ## [1] &quot;Source Name&quot; &quot;cell line&quot; ## [3] &quot;cell type&quot; &quot;single cell well quality&quot; ## [5] &quot;genotype&quot; &quot;phenotype&quot; ## [7] &quot;strain&quot; &quot;spike-in addition&quot; ## [9] &quot;block&quot; &quot;sum&quot; ## [11] &quot;detected&quot; &quot;subsets_Mito_sum&quot; ## [13] &quot;subsets_Mito_detected&quot; &quot;subsets_Mito_percent&quot; ## [15] &quot;altexps_ERCC_sum&quot; &quot;altexps_ERCC_detected&quot; ## [17] &quot;altexps_ERCC_percent&quot; &quot;altexps_SIRV_sum&quot; ## [19] &quot;altexps_SIRV_detected&quot; &quot;altexps_SIRV_percent&quot; ## [21] &quot;total&quot; A key assumption here is that the QC metrics are independent of the biological state of each cell. Poor values (e.g., low library sizes, high mitochondrial proportions) are presumed to be driven by technical factors rather than biological processes, meaning that the subsequent removal of cells will not misrepresent the biology in downstream analyses. Major violations of this assumption would potentially result in the loss of cell types that have, say, systematically low RNA content or high numbers of mitochondria. We can check for such violations using diagnostic plots described in Section 1.4 and Advanced Section 1.5. 1.3 Identifying low-quality cells 1.3.1 With fixed thresholds The simplest approach to identifying low-quality cells involves applying fixed thresholds to the QC metrics. For example, we might consider cells to be low quality if they have library sizes below 100,000 reads; express fewer than 5,000 genes; have spike-in proportions above 10%; or have mitochondrial proportions above 10%. qc.lib &lt;- df$sum &lt; 1e5 qc.nexprs &lt;- df$detected &lt; 5e3 qc.spike &lt;- df$altexps_ERCC_percent &gt; 10 qc.mito &lt;- df$subsets_Mito_percent &gt; 10 discard &lt;- qc.lib | qc.nexprs | qc.spike | qc.mito # Summarize the number of cells removed for each reason. DataFrame(LibSize=sum(qc.lib), NExprs=sum(qc.nexprs), SpikeProp=sum(qc.spike), MitoProp=sum(qc.mito), Total=sum(discard)) ## DataFrame with 1 row and 5 columns ## LibSize NExprs SpikeProp MitoProp Total ## &lt;integer&gt; &lt;integer&gt; &lt;integer&gt; &lt;integer&gt; &lt;integer&gt; ## 1 3 0 19 14 33 While simple, this strategy requires considerable experience to determine appropriate thresholds for each experimental protocol and biological system. Thresholds for read count-based data are not applicable for UMI-based data, and vice versa. Differences in mitochondrial activity or total RNA content require constant adjustment of the mitochondrial and spike-in thresholds, respectively, for different biological systems. Indeed, even with the same protocol and system, the appropriate threshold can vary from run to run due to the vagaries of cDNA capture efficiency and sequencing depth per cell. 1.3.2 With adaptive thresholds Here, we assume that most of the dataset consists of high-quality cells. We then identify cells that are outliers for the various QC metrics, based on the median absolute deviation (MAD) from the median value of each metric across all cells. By default, we consider a value to be an outlier if it is more than 3 MADs from the median in the “problematic” direction. This is loosely motivated by the fact that such a filter will retain 99% of non-outlier values that follow a normal distribution. We demonstrate by using the perCellQCFilters() function on the QC metrics from the 416B dataset. reasons &lt;- perCellQCFilters(df, sub.fields=c(&quot;subsets_Mito_percent&quot;, &quot;altexps_ERCC_percent&quot;)) colSums(as.matrix(reasons)) ## low_lib_size low_n_features high_subsets_Mito_percent ## 4 0 2 ## high_altexps_ERCC_percent discard ## 1 6 This function will identify cells with log-transformed library sizes that are more than 3 MADs below the median. A log-transformation is used to improve resolution at small values when type=\"lower\" and to avoid negative thresholds that would be meaningless for a non-negative metric. Furthermore, it is not uncommon for the distribution of library sizes to exhibit a heavy right tail; the log-transformation avoids inflation of the MAD in a manner that might compromise outlier detection on the left tail. (More generally, it makes the distribution seem more normal to justify the 99% rationale mentioned above.) The function will also do the same for the log-transformed number of expressed genes. perCellQCFilters() will also identify outliers for the proportion-based metrics specified in the sub.fields= arguments. These distributions frequently exhibit a heavy right tail, but unlike the two previous metrics, it is the right tail itself that contains the putative low-quality cells. Thus, we do not perform any transformation to shrink the tail - rather, our hope is that the cells in the tail are identified as large outliers. (While it is theoretically possible to obtain a meaningless threshold above 100%, this is rare enough to not be of practical concern.) A cell that is an outlier for any of these metrics is considered to be of low quality and discarded. This is captured in the discard column, which can be used for later filtering (Section 1.5). summary(reasons$discard) ## Mode FALSE TRUE ## logical 186 6 We can also extract the exact filter thresholds from the attributes of each of the logical vectors. This may be useful for checking whether the automatically selected thresholds are appropriate. attr(reasons$low_lib_size, &quot;thresholds&quot;) ## lower higher ## 434083 Inf attr(reasons$low_n_features, &quot;thresholds&quot;) ## lower higher ## 5231 Inf With this strategy, the thresholds adapt to both the location and spread of the distribution of values for a given metric. This allows the QC procedure to adjust to changes in sequencing depth, cDNA capture efficiency, mitochondrial content, etc. without requiring any user intervention or prior experience. However, the underlying assumption of a high-quality majority may not always be appropriate, which is discussed in more detail in Advanced Section 1.3. 1.3.3 Other approaches Another strategy is to identify outliers in high-dimensional space based on the QC metrics for each cell. We use methods from robustbase to quantify the “outlyingness” of each cells based on their QC metrics, and then use isOutlier() to identify low-quality cells that exhibit unusually high levels of outlyingness. stats &lt;- cbind(log10(df$sum), log10(df$detected), df$subsets_Mito_percent, df$altexps_ERCC_percent) library(robustbase) outlying &lt;- adjOutlyingness(stats, only.outlyingness = TRUE) multi.outlier &lt;- isOutlier(outlying, type = &quot;higher&quot;) summary(multi.outlier) ## Mode FALSE TRUE ## logical 180 12 This and related approaches like PCA-based outlier detection and support vector machines can provide more power to distinguish low-quality cells from high-quality counterparts (Ilicic et al. 2016) as they can exploit patterns across many QC metrics. However, this comes at some cost to interpretability, as the reason for removing a given cell may not always be obvious. For completeness, we note that outliers can also be identified from the gene expression profiles, rather than QC metrics. We consider this to be a risky strategy as it can remove high-quality cells in rare populations. 1.4 Checking diagnostic plots It is good practice to inspect the distributions of QC metrics (Figure 1.1) to identify possible problems. In the most ideal case, we would see normal distributions that would justify the 3 MAD threshold used in outlier detection. A large proportion of cells in another mode suggests that the QC metrics might be correlated with some biological state, potentially leading to the loss of distinct cell types during filtering; or that there were inconsistencies with library preparation for a subset of cells, a not-uncommon phenomenon in plate-based protocols. colData(sce.416b) &lt;- cbind(colData(sce.416b), df) sce.416b$block &lt;- factor(sce.416b$block) sce.416b$phenotype &lt;- ifelse(grepl(&quot;induced&quot;, sce.416b$phenotype), &quot;induced&quot;, &quot;wild type&quot;) sce.416b$discard &lt;- reasons$discard library(scater) gridExtra::grid.arrange( plotColData(sce.416b, x=&quot;block&quot;, y=&quot;sum&quot;, colour_by=&quot;discard&quot;, other_fields=&quot;phenotype&quot;) + facet_wrap(~phenotype) + scale_y_log10() + ggtitle(&quot;Total count&quot;), plotColData(sce.416b, x=&quot;block&quot;, y=&quot;detected&quot;, colour_by=&quot;discard&quot;, other_fields=&quot;phenotype&quot;) + facet_wrap(~phenotype) + scale_y_log10() + ggtitle(&quot;Detected features&quot;), plotColData(sce.416b, x=&quot;block&quot;, y=&quot;subsets_Mito_percent&quot;, colour_by=&quot;discard&quot;, other_fields=&quot;phenotype&quot;) + facet_wrap(~phenotype) + ggtitle(&quot;Mito percent&quot;), plotColData(sce.416b, x=&quot;block&quot;, y=&quot;altexps_ERCC_percent&quot;, colour_by=&quot;discard&quot;, other_fields=&quot;phenotype&quot;) + facet_wrap(~phenotype) + ggtitle(&quot;ERCC percent&quot;), ncol=1 ) Figure 1.1: Distribution of QC metrics for each batch and phenotype in the 416B dataset. Each point represents a cell and is colored according to whether it was discarded, respectively. Another useful diagnostic involves plotting the proportion of mitochondrial counts against some of the other QC metrics. The aim is to confirm that there are no cells with both large total counts and large mitochondrial counts, to ensure that we are not inadvertently removing high-quality cells that happen to be highly metabolically active (e.g., hepatocytes). We demonstrate using data from a larger experiment involving the mouse brain (Zeisel et al. 2015); in this case, we do not observe any points in the top-right corner in Figure 1.2 that might potentially correspond to metabolically active, undamaged cells. View set-up code (Workflow Chapter 2) #--- loading ---# library(scRNAseq) sce.zeisel &lt;- ZeiselBrainData() library(scater) sce.zeisel &lt;- aggregateAcrossFeatures(sce.zeisel, id=sub(&quot;_loc[0-9]+$&quot;, &quot;&quot;, rownames(sce.zeisel))) #--- gene-annotation ---# library(org.Mm.eg.db) rowData(sce.zeisel)$Ensembl &lt;- mapIds(org.Mm.eg.db, keys=rownames(sce.zeisel), keytype=&quot;SYMBOL&quot;, column=&quot;ENSEMBL&quot;) sce.zeisel &lt;- addPerCellQC(sce.zeisel, subsets=list(Mt=rowData(sce.zeisel)$featureType==&quot;mito&quot;)) qc &lt;- quickPerCellQC(colData(sce.zeisel), sub.fields=c(&quot;altexps_ERCC_percent&quot;, &quot;subsets_Mt_percent&quot;)) sce.zeisel$discard &lt;- qc$discard plotColData(sce.zeisel, x=&quot;sum&quot;, y=&quot;subsets_Mt_percent&quot;, colour_by=&quot;discard&quot;) Figure 1.2: Percentage of UMIs assigned to mitochondrial transcripts in the Zeisel brain dataset, plotted against the total number of UMIs (top). Each point represents a cell and is colored according to whether it was considered low-quality and discarded. Comparison of the ERCC and mitochondrial percentages can also be informative (Figure 1.3). Low-quality cells with small mitochondrial percentages, large spike-in percentages and small library sizes are likely to be stripped nuclei, i.e., they have been so extensively damaged that they have lost all cytoplasmic content. On the other hand, cells with high mitochondrial percentages and low ERCC percentages may represent undamaged cells that are metabolically active. This interpretation also applies for single-nuclei studies but with a switch of focus: the stripped nuclei become the libraries of interest while the undamaged cells are considered to be low quality. plotColData(sce.zeisel, x=&quot;altexps_ERCC_percent&quot;, y=&quot;subsets_Mt_percent&quot;, colour_by=&quot;discard&quot;) Figure 1.3: Percentage of UMIs assigned to mitochondrial transcripts in the Zeisel brain dataset, plotted against the percentage of UMIs assigned to spike-in transcripts (bottom). Each point represents a cell and is colored according to whether it was considered low-quality and discarded. We see that all of these metrics exhibit weak correlations to each other, presumably a manifestation of a common underlying effect of cell damage. The weakness of the correlations motivates the use of several metrics to capture different aspects of technical quality. Of course, the flipside is that these metrics may also represent different aspects of biology, increasing the risk of inadvertently discarding entire cell types. 1.5 Removing low-quality cells Once low-quality cells have been identified, we can choose to either remove them or mark them. Removal is the most straightforward option and is achieved by subsetting the SingleCellExperiment by column. In this case, we use the low-quality calls from Section 1.3.2 to generate a subsetted SingleCellExperiment that we would use for downstream analyses. # Keeping the columns we DON&#39;T want to discard. filtered &lt;- sce.416b[,!reasons$discard] The other option is to simply mark the low-quality cells as such and retain them in the downstream analysis. The aim here is to allow clusters of low-quality cells to form, and then to identify and ignore such clusters during interpretation of the results. This approach avoids discarding cell types that have poor values for the QC metrics, deferring the decision on whether a cluster of such cells represents a genuine biological state. marked &lt;- sce.416b marked$discard &lt;- reasons$discard The downside is that it shifts the burden of QC to the manual interpretation of the clusters, which is already a major bottleneck in scRNA-seq data analysis (Chapters 5, 6 and 7). Indeed, if we do not trust the QC metrics, we would have to distinguish between genuine cell types and low-quality cells based only on marker genes, and this is not always easy due to the tendency of the latter to “express” interesting genes (Section 1.1). Retention of low-quality cells also compromises the accuracy of the variance modelling, requiring, e.g., use of more PCs to offset the fact that the early PCs are driven by differences between low-quality and other cells. For routine analyses, we suggest performing removal by default to avoid complications from low-quality cells. This allows most of the population structure to be characterized with no - or, at least, fewer - concerns about its validity. Once the initial analysis is done, and if there are any concerns about discarded cell types (Advanced Section 1.5), a more thorough re-analysis can be performed where the low-quality cells are only marked. This recovers cell types with low RNA content, high mitochondrial proportions, etc. that only need to be interpreted insofar as they “fill the gaps” in the initial analysis. Session Info View session info R version 4.1.1 (2021-08-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] scater_1.22.0 ggplot2_3.3.5 [3] robustbase_0.93-9 scuttle_1.4.0 [5] SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0 [7] Biobase_2.54.0 GenomicRanges_1.46.0 [9] GenomeInfoDb_1.30.0 IRanges_2.28.0 [11] S4Vectors_0.32.0 BiocGenerics_0.40.0 [13] MatrixGenerics_1.6.0 matrixStats_0.61.0 [15] BiocStyle_2.22.0 rebook_1.4.0 loaded via a namespace (and not attached): [1] bitops_1.0-7 filelock_1.0.2 [3] tools_4.1.1 bslib_0.3.1 [5] utf8_1.2.2 R6_2.5.1 [7] irlba_2.3.3 vipor_0.4.5 [9] DBI_1.1.1 colorspace_2.0-2 [11] withr_2.4.2 tidyselect_1.1.1 [13] gridExtra_2.3 compiler_4.1.1 [15] graph_1.72.0 BiocNeighbors_1.12.0 [17] DelayedArray_0.20.0 labeling_0.4.2 [19] bookdown_0.24 sass_0.4.0 [21] scales_1.1.1 DEoptimR_1.0-9 [23] rappdirs_0.3.3 stringr_1.4.0 [25] digest_0.6.28 rmarkdown_2.11 [27] XVector_0.34.0 pkgconfig_2.0.3 [29] htmltools_0.5.2 sparseMatrixStats_1.6.0 [31] highr_0.9 fastmap_1.1.0 [33] rlang_0.4.12 DelayedMatrixStats_1.16.0 [35] farver_2.1.0 jquerylib_0.1.4 [37] generics_0.1.1 jsonlite_1.7.2 [39] BiocParallel_1.28.0 dplyr_1.0.7 [41] RCurl_1.98-1.5 magrittr_2.0.1 [43] BiocSingular_1.10.0 GenomeInfoDbData_1.2.7 [45] Matrix_1.3-4 Rcpp_1.0.7 [47] ggbeeswarm_0.6.0 munsell_0.5.0 [49] fansi_0.5.0 viridis_0.6.2 [51] lifecycle_1.0.1 stringi_1.7.5 [53] yaml_2.2.1 zlibbioc_1.40.0 [55] grid_4.1.1 parallel_4.1.1 [57] ggrepel_0.9.1 crayon_1.4.1 [59] dir.expiry_1.2.0 lattice_0.20-45 [61] cowplot_1.1.1 beachmat_2.10.0 [63] CodeDepends_0.6.5 knitr_1.36 [65] pillar_1.6.4 codetools_0.2-18 [67] ScaledMatrix_1.2.0 XML_3.99-0.8 [69] glue_1.4.2 evaluate_0.14 [71] BiocManager_1.30.16 vctrs_0.3.8 [73] gtable_0.3.0 purrr_0.3.4 [75] assertthat_0.2.1 xfun_0.27 [77] rsvd_1.0.5 viridisLite_0.4.0 [79] tibble_3.1.5 beeswarm_0.4.0 [81] ellipsis_0.3.2 References "],["normalization.html", "Chapter 2 Normalization 2.1 Motivation 2.2 Library size normalization 2.3 Normalization by deconvolution 2.4 Normalization by spike-ins 2.5 Scaling and log-transforming Session Info", " Chapter 2 Normalization .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 2.1 Motivation Systematic differences in sequencing coverage between libraries are often observed in single-cell RNA sequencing data (Stegle, Teichmann, and Marioni 2015). They typically arise from technical differences in cDNA capture or PCR amplification efficiency across cells, attributable to the difficulty of achieving consistent library preparation with minimal starting material. Normalization aims to remove these differences such that they do not interfere with comparisons of the expression profiles between cells. This ensures that any observed heterogeneity or differential expression within the cell population are driven by biology and not technical biases. We will mostly focus our attention on scaling normalization, which is the simplest and most commonly used class of normalization strategies. This involves dividing all counts for each cell by a cell-specific scaling factor, often called a “size factor” (Anders and Huber 2010). The assumption here is that any cell-specific bias (e.g., in capture or amplification efficiency) affects all genes equally via scaling of the expected mean count for that cell. The size factor for each cell represents the estimate of the relative bias in that cell, so division of its counts by its size factor should remove that bias. The resulting “normalized expression values” can then be used for downstream analyses such as clustering and dimensionality reduction. To demonstrate, we will use the Zeisel et al. (2015) dataset from the scRNAseq package. View set-up code (Workflow Chapter 2) #--- loading ---# library(scRNAseq) sce.zeisel &lt;- ZeiselBrainData() library(scater) sce.zeisel &lt;- aggregateAcrossFeatures(sce.zeisel, id=sub(&quot;_loc[0-9]+$&quot;, &quot;&quot;, rownames(sce.zeisel))) #--- gene-annotation ---# library(org.Mm.eg.db) rowData(sce.zeisel)$Ensembl &lt;- mapIds(org.Mm.eg.db, keys=rownames(sce.zeisel), keytype=&quot;SYMBOL&quot;, column=&quot;ENSEMBL&quot;) #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.zeisel, subsets=list( Mt=rowData(sce.zeisel)$featureType==&quot;mito&quot;)) qc &lt;- quickPerCellQC(stats, percent_subsets=c(&quot;altexps_ERCC_percent&quot;, &quot;subsets_Mt_percent&quot;)) sce.zeisel &lt;- sce.zeisel[,!qc$discard] sce.zeisel ## class: SingleCellExperiment ## dim: 19839 2816 ## metadata(0): ## assays(1): counts ## rownames(19839): 0610005C13Rik 0610007N19Rik ... mt-Tw mt-Ty ## rowData names(2): featureType Ensembl ## colnames(2816): 1772071015_C02 1772071017_G12 ... 1772063068_D01 ## 1772066098_A12 ## colData names(10): tissue group # ... level1class level2class ## reducedDimNames(0): ## mainExpName: endogenous ## altExpNames(2): ERCC repeat 2.2 Library size normalization Library size normalization is the simplest strategy for performing scaling normalization. We define the library size as the total sum of counts across all genes for each cell, the expected value of which is assumed to scale with any cell-specific biases. The “library size factor” for each cell is then directly proportional to its library size where the proportionality constant is defined such that the mean size factor across all cells is equal to 1. This definition ensures that the normalized expression values are on the same scale as the original counts, which is useful for interpretation - especially when dealing with transformed data (see Section 2.5). library(scater) lib.sf.zeisel &lt;- librarySizeFactors(sce.zeisel) summary(lib.sf.zeisel) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.176 0.568 0.868 1.000 1.278 4.084 In the Zeisel brain data, the library size factors differ by up to 10-fold across cells (Figure 2.1). This is typical of the variability in coverage in scRNA-seq data. hist(log10(lib.sf.zeisel), xlab=&quot;Log10[Size factor]&quot;, col=&#39;grey80&#39;) Figure 2.1: Distribution of size factors derived from the library size in the Zeisel brain dataset. Strictly speaking, the use of library size factors assumes that there is no “imbalance” in the differentially expressed (DE) genes between any pair of cells. That is, any upregulation for a subset of genes is cancelled out by the same magnitude of downregulation in a different subset of genes. This ensures that the library size is an unbiased estimate of the relative cell-specific bias by avoiding composition effects (Robinson and Oshlack 2010). However, balanced DE is not generally present in scRNA-seq applications, which means that library size normalization may not yield accurate normalized expression values for downstream analyses. In practice, normalization accuracy is not a major consideration for exploratory scRNA-seq data analyses. Composition biases do not usually affect the separation of clusters, only the magnitude - and to a lesser extent, direction - of the log-fold changes between clusters or cell types. As such, library size normalization is usually sufficient in many applications where the aim is to identify clusters and the top markers that define each cluster. 2.3 Normalization by deconvolution As previously mentioned, composition biases will be present when any unbalanced differential expression exists between samples. Consider the simple example of two cells where a single gene \\(X\\) is upregulated in one cell \\(A\\) compared to the other cell \\(B\\). This upregulation means that either (i) more sequencing resources are devoted to \\(X\\) in \\(A\\), thus decreasing coverage of all other non-DE genes when the total library size of each cell is experimentally fixed (e.g., due to library quantification); or (ii) the library size of \\(A\\) increases when \\(X\\) is assigned more reads or UMIs, increasing the library size factor and yielding smaller normalized expression values for all non-DE genes. In both cases, the net effect is that non-DE genes in \\(A\\) will incorrectly appear to be downregulated compared to \\(B\\). The removal of composition biases is a well-studied problem for bulk RNA sequencing data analysis. Normalization can be performed with the estimateSizeFactorsFromMatrix() function in the DESeq2 package (Anders and Huber 2010; Love, Huber, and Anders 2014) or with the calcNormFactors() function (Robinson and Oshlack 2010) in the edgeR package. These assume that most genes are not DE between cells. Any systematic difference in count size across the non-DE majority of genes between two cells is assumed to represent bias that is used to compute an appropriate size factor for its removal. However, single-cell data can be problematic for these bulk normalization methods due to the dominance of low and zero counts. To overcome this, we pool counts from many cells to increase the size of the counts for accurate size factor estimation (Lun, Bach, and Marioni 2016). Pool-based size factors are then “deconvolved” into cell-based factors for normalization of each cell’s expression profile. This is performed using the calculateSumFactors() function from scran, as shown below. library(scran) set.seed(100) clust.zeisel &lt;- quickCluster(sce.zeisel) table(clust.zeisel) ## clust.zeisel ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ## 170 254 441 178 393 148 219 240 189 123 112 103 135 111 deconv.sf.zeisel &lt;- calculateSumFactors(sce.zeisel, cluster=clust.zeisel) summary(deconv.sf.zeisel) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.119 0.486 0.831 1.000 1.321 4.509 We use a pre-clustering step with quickCluster() where cells in each cluster are normalized separately and the size factors are rescaled to be comparable across clusters. This avoids the assumption that most genes are non-DE across the entire population - only a non-DE majority is required between pairs of clusters, which is a weaker assumption for highly heterogeneous populations. By default, quickCluster() will use an approximate algorithm for PCA based on methods from the irlba package. The approximation relies on stochastic initialization so we need to set the random seed (via set.seed()) for reproducibility. We see that the deconvolution size factors exhibit cell type-specific deviations from the library size factors in Figure 2.2. This is consistent with the presence of composition biases that are introduced by strong differential expression between cell types. Use of the deconvolution size factors adjusts for these biases to improve normalization accuracy for downstream applications. plot(lib.sf.zeisel, deconv.sf.zeisel, xlab=&quot;Library size factor&quot;, ylab=&quot;Deconvolution size factor&quot;, log=&#39;xy&#39;, pch=16, col=as.integer(factor(sce.zeisel$level1class))) abline(a=0, b=1, col=&quot;red&quot;) Figure 2.2: Deconvolution size factor for each cell in the Zeisel brain dataset, compared to the equivalent size factor derived from the library size. The red line corresponds to identity between the two size factors. Accurate normalization is most important for procedures that involve estimation and interpretation of per-gene statistics. For example, composition biases can compromise DE analyses by systematically shifting the log-fold changes in one direction or another. However, it tends to provide less benefit over simple library size normalization for cell-based analyses such as clustering. The presence of composition biases already implies strong differences in expression profiles, so changing the normalization strategy is unlikely to affect the outcome of a clustering procedure. 2.4 Normalization by spike-ins Spike-in normalization is based on the assumption that the same amount of spike-in RNA was added to each cell (Lun et al. 2017). Systematic differences in the coverage of the spike-in transcripts can only be due to cell-specific biases, e.g., in capture efficiency or sequencing depth. To remove these biases, we equalize spike-in coverage across cells by scaling with “spike-in size factors”. Compared to the previous methods, spike-in normalization requires no assumption about the biology of the system (i.e., the absence of many DE genes). Instead, it assumes that the spike-in transcripts were (i) added at a constant level to each cell, and (ii) respond to biases in the same relative manner as endogenous genes. Practically, spike-in normalization should be used if differences in the total RNA content of individual cells are of interest and must be preserved in downstream analyses. For a given cell, an increase in its overall amount of endogenous RNA will not increase its spike-in size factor. This ensures that the effects of total RNA content on expression across the population will not be removed upon scaling. By comparison, the other normalization methods described above will simply interpret any change in total RNA content as part of the bias and remove it. We demonstrate the use of spike-in normalization on a different dataset involving T cell activation after stimulation with T cell recepter ligands of varying affinity (Richard et al. 2018). library(scRNAseq) sce.richard &lt;- RichardTCellData() sce.richard &lt;- sce.richard[,sce.richard$`single cell quality`==&quot;OK&quot;] sce.richard ## class: SingleCellExperiment ## dim: 46603 528 ## metadata(0): ## assays(1): counts ## rownames(46603): ENSMUSG00000102693 ENSMUSG00000064842 ... ## ENSMUSG00000096730 ENSMUSG00000095742 ## rowData names(0): ## colnames(528): SLX-12611.N701_S502. SLX-12611.N702_S502. ... ## SLX-12612.i712_i522. SLX-12612.i714_i522. ## colData names(13): age individual ... stimulus time ## reducedDimNames(0): ## mainExpName: endogenous ## altExpNames(1): ERCC We apply the computeSpikeFactors() method to estimate spike-in size factors for all cells. This is defined by converting the total spike-in count per cell into a size factor, using the same reasoning as in librarySizeFactors(). Scaling will subsequently remove any differences in spike-in coverage across cells. sce.richard &lt;- computeSpikeFactors(sce.richard, &quot;ERCC&quot;) summary(sizeFactors(sce.richard)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.125 0.428 0.627 1.000 1.070 23.316 We observe a positive correlation between the spike-in size factors and deconvolution size factors within each treatment condition (Figure 2.3), indicating that they are capturing similar technical biases in sequencing depth and capture efficiency. However, we also observe that increasing stimulation of the T cell receptor - in terms of increasing affinity or time - results in a decrease in the spike-in factors relative to the library size factors. This is consistent with an increase in biosynthetic activity and total RNA content during stimulation, which reduces the relative spike-in coverage in each library (thereby decreasing the spike-in size factors) but increases the coverage of endogenous genes (thus increasing the library size factors). to.plot &lt;- data.frame( DeconvFactor=calculateSumFactors(sce.richard), SpikeFactor=sizeFactors(sce.richard), Stimulus=sce.richard$stimulus, Time=sce.richard$time ) ggplot(to.plot, aes(x=DeconvFactor, y=SpikeFactor, color=Time)) + geom_point() + facet_wrap(~Stimulus) + scale_x_log10() + scale_y_log10() + geom_abline(intercept=0, slope=1, color=&quot;red&quot;) Figure 2.3: Size factors from spike-in normalization, plotted against the library size factors for all cells in the T cell dataset. Each plot represents a different ligand treatment and each point is a cell coloured according by time from stimulation. The differences between these two sets of size factors have real consequences for downstream interpretation. If the spike-in size factors were applied to the counts, the expression values in unstimulated cells would be scaled up while expression in stimulated cells would be scaled down. However, the opposite would occur if the deconvolution size factors were used. This can manifest as shifts in the magnitude and direction of DE between conditions when we switch between normalization strategies, as shown below for Malat1 (Figure 2.4). # See below for explanation of logNormCounts(). sce.richard.deconv &lt;- logNormCounts(sce.richard, size_factors=to.plot$DeconvFactor) sce.richard.spike &lt;- logNormCounts(sce.richard, size_factors=to.plot$SpikeFactor) gridExtra::grid.arrange( plotExpression(sce.richard.deconv, x=&quot;stimulus&quot;, colour_by=&quot;time&quot;, features=&quot;ENSMUSG00000092341&quot;) + theme(axis.text.x = element_text(angle = 90)) + ggtitle(&quot;After deconvolution&quot;), plotExpression(sce.richard.spike, x=&quot;stimulus&quot;, colour_by=&quot;time&quot;, features=&quot;ENSMUSG00000092341&quot;) + theme(axis.text.x = element_text(angle = 90)) + ggtitle(&quot;After spike-in normalization&quot;), ncol=2 ) Figure 2.4: Distribution of log-normalized expression values for Malat1 after normalization with the deconvolution size factors (left) or spike-in size factors (right). Cells are stratified by the ligand affinity and colored by the time after stimulation. Whether or not total RNA content is relevant – and thus, the choice of normalization strategy – depends on the biological hypothesis. In most cases, changes in total RNA content are not interesting and can be normalized out by applying the library size or deconvolution factors. However, this may not always be appropriate if differences in total RNA are associated with a biological process of interest, e.g., cell cycle activity or T cell activation. Spike-in normalization will preserve these differences such that any changes in expression between biological groups have the correct sign. However! Regardless of whether we care about total RNA content, it is critical that the spike-in transcripts are normalized using the spike-in size factors. Size factors computed from the counts for endogenous genes should not be applied to the spike-in transcripts, precisely because the former captures differences in total RNA content that are not experienced by the latter. Attempting to normalize the spike-in counts with the gene-based size factors will lead to over-normalization and incorrect quantification. Thus, if normalized spike-in data is required, we must compute a separate set of size factors for the spike-in transcripts; this is automatically performed by functions such as modelGeneVarWithSpikes(). 2.5 Scaling and log-transforming Once we have computed the size factors, we use the logNormCounts() function from scater to compute normalized expression values for each cell. This is done by dividing the count for each gene/spike-in transcript with the appropriate size factor for that cell. The function also log-transforms the normalized values, creating a new assay called \"logcounts\". (Technically, these are “log-transformed normalized expression values”, but that’s too much of a mouthful to fit into the assay name.) These log-values will be the basis of our downstream analyses in the following chapters. set.seed(100) clust.zeisel &lt;- quickCluster(sce.zeisel) sce.zeisel &lt;- computeSumFactors(sce.zeisel, cluster=clust.zeisel, min.mean=0.1) sce.zeisel &lt;- logNormCounts(sce.zeisel) assayNames(sce.zeisel) ## [1] &quot;counts&quot; &quot;logcounts&quot; The log-transformation is useful as differences in the log-values represent log-fold changes in expression. This is important in downstream procedures based on Euclidean distances, which includes many forms of clustering and dimensionality reduction. By operating on log-transformed data, we ensure that these procedures are measuring distances between cells based on log-fold changes in expression. Or in other words, which is more interesting - a gene that is expressed at an average count of 50 in cell type \\(A\\) and 10 in cell type \\(B\\), or a gene that is expressed at an average count of 1100 in \\(A\\) and 1000 in \\(B\\)? Log-transformation focuses on the former by promoting contributions from genes with strong relative differences. See Advanced Chapter 2 for further comments on transformation strategies. Session Info View session info R version 4.1.1 (2021-08-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] ensembldb_2.18.0 AnnotationFilter_1.18.0 [3] GenomicFeatures_1.46.0 AnnotationDbi_1.56.0 [5] scRNAseq_2.7.2 scran_1.22.0 [7] scater_1.22.0 ggplot2_3.3.5 [9] scuttle_1.4.0 SingleCellExperiment_1.16.0 [11] SummarizedExperiment_1.24.0 Biobase_2.54.0 [13] GenomicRanges_1.46.0 GenomeInfoDb_1.30.0 [15] IRanges_2.28.0 S4Vectors_0.32.0 [17] BiocGenerics_0.40.0 MatrixGenerics_1.6.0 [19] matrixStats_0.61.0 BiocStyle_2.22.0 [21] rebook_1.4.0 loaded via a namespace (and not attached): [1] AnnotationHub_3.2.0 BiocFileCache_2.2.0 [3] igraph_1.2.7 lazyeval_0.2.2 [5] BiocParallel_1.28.0 digest_0.6.28 [7] htmltools_0.5.2 viridis_0.6.2 [9] fansi_0.5.0 magrittr_2.0.1 [11] memoise_2.0.0 ScaledMatrix_1.2.0 [13] cluster_2.1.2 limma_3.50.0 [15] Biostrings_2.62.0 prettyunits_1.1.1 [17] colorspace_2.0-2 blob_1.2.2 [19] rappdirs_0.3.3 ggrepel_0.9.1 [21] xfun_0.27 dplyr_1.0.7 [23] crayon_1.4.1 RCurl_1.98-1.5 [25] jsonlite_1.7.2 graph_1.72.0 [27] glue_1.4.2 gtable_0.3.0 [29] zlibbioc_1.40.0 XVector_0.34.0 [31] DelayedArray_0.20.0 BiocSingular_1.10.0 [33] scales_1.1.1 DBI_1.1.1 [35] edgeR_3.36.0 Rcpp_1.0.7 [37] viridisLite_0.4.0 xtable_1.8-4 [39] progress_1.2.2 dqrng_0.3.0 [41] bit_4.0.4 rsvd_1.0.5 [43] metapod_1.2.0 httr_1.4.2 [45] dir.expiry_1.2.0 ellipsis_0.3.2 [47] farver_2.1.0 pkgconfig_2.0.3 [49] XML_3.99-0.8 CodeDepends_0.6.5 [51] sass_0.4.0 dbplyr_2.1.1 [53] locfit_1.5-9.4 utf8_1.2.2 [55] labeling_0.4.2 tidyselect_1.1.1 [57] rlang_0.4.12 later_1.3.0 [59] munsell_0.5.0 BiocVersion_3.14.0 [61] tools_4.1.1 cachem_1.0.6 [63] generics_0.1.1 RSQLite_2.2.8 [65] ExperimentHub_2.2.0 evaluate_0.14 [67] stringr_1.4.0 fastmap_1.1.0 [69] yaml_2.2.1 knitr_1.36 [71] bit64_4.0.5 purrr_0.3.4 [73] KEGGREST_1.34.0 sparseMatrixStats_1.6.0 [75] mime_0.12 xml2_1.3.2 [77] biomaRt_2.50.0 compiler_4.1.1 [79] beeswarm_0.4.0 filelock_1.0.2 [81] curl_4.3.2 png_0.1-7 [83] interactiveDisplayBase_1.32.0 tibble_3.1.5 [85] statmod_1.4.36 bslib_0.3.1 [87] stringi_1.7.5 highr_0.9 [89] lattice_0.20-45 bluster_1.4.0 [91] ProtGenerics_1.26.0 Matrix_1.3-4 [93] vctrs_0.3.8 pillar_1.6.4 [95] lifecycle_1.0.1 BiocManager_1.30.16 [97] jquerylib_0.1.4 BiocNeighbors_1.12.0 [99] cowplot_1.1.1 bitops_1.0-7 [101] irlba_2.3.3 httpuv_1.6.3 [103] rtracklayer_1.54.0 R6_2.5.1 [105] BiocIO_1.4.0 bookdown_0.24 [107] promises_1.2.0.1 gridExtra_2.3 [109] vipor_0.4.5 codetools_0.2-18 [111] assertthat_0.2.1 rjson_0.2.20 [113] withr_2.4.2 GenomicAlignments_1.30.0 [115] Rsamtools_2.10.0 GenomeInfoDbData_1.2.7 [117] parallel_4.1.1 hms_1.1.1 [119] grid_4.1.1 beachmat_2.10.0 [121] rmarkdown_2.11 DelayedMatrixStats_1.16.0 [123] shiny_1.7.1 ggbeeswarm_0.6.0 [125] restfulr_0.0.13 References "],["feature-selection.html", "Chapter 3 Feature selection 3.1 Motivation 3.2 Quantifying per-gene variation 3.3 Quantifying technical noise 3.4 Handling batch effects 3.5 Selecting highly variable genes 3.6 Putting it all together Session Info", " Chapter 3 Feature selection .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 3.1 Motivation We often use scRNA-seq data in exploratory analyses to characterize heterogeneity across cells. Procedures like clustering and dimensionality reduction compare cells based on their gene expression profiles, which involves aggregating per-gene differences into a single (dis)similarity metric between a pair of cells. The choice of genes to use in this calculation has a major impact on the behavior of the metric and the performance of downstream methods. We want to select genes that contain useful information about the biology of the system while removing genes that contain random noise. This aims to preserve interesting biological structure without the variance that obscures that structure, and to reduce the size of the data to improve computational efficiency of later steps. The simplest approach to feature selection is to select the most variable genes based on their expression across the population. This assumes that genuine biological differences will manifest as increased variation in the affected genes, compared to other genes that are only affected by technical noise or a baseline level of “uninteresting” biological variation (e.g., from transcriptional bursting). Several methods are available to quantify the variation per gene and to select an appropriate set of highly variable genes (HVGs). We will discuss these below using the 10X PBMC dataset for demonstration: View set-up code (Workflow Chapter 3) #--- loading ---# library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.pbmc &lt;- read10xCounts(fname, col.names=TRUE) #--- gene-annotation ---# library(scater) rownames(sce.pbmc) &lt;- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol) library(EnsDb.Hsapiens.v86) location &lt;- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, column=&quot;SEQNAME&quot;, keytype=&quot;GENEID&quot;) #--- cell-detection ---# set.seed(100) e.out &lt;- emptyDrops(counts(sce.pbmc)) sce.pbmc &lt;- sce.pbmc[,which(e.out$FDR &lt;= 0.001)] #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location==&quot;MT&quot;))) high.mito &lt;- isOutlier(stats$subsets_Mito_percent, type=&quot;higher&quot;) sce.pbmc &lt;- sce.pbmc[,!high.mito] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.pbmc) sce.pbmc &lt;- computeSumFactors(sce.pbmc, cluster=clusters) sce.pbmc &lt;- logNormCounts(sce.pbmc) sce.pbmc ## class: SingleCellExperiment ## dim: 33694 3985 ## metadata(1): Samples ## assays(2): counts logcounts ## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B ## rowData names(2): ID Symbol ## colnames(3985): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ... ## TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1 ## colData names(3): Sample Barcode sizeFactor ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): As well as the 416B dataset: View set-up code (Workflow Chapter 1) #--- loading ---# library(scRNAseq) sce.416b &lt;- LunSpikeInData(which=&quot;416b&quot;) sce.416b$block &lt;- factor(sce.416b$block) #--- gene-annotation ---# library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] rowData(sce.416b)$ENSEMBL &lt;- rownames(sce.416b) rowData(sce.416b)$SYMBOL &lt;- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype=&quot;GENEID&quot;, column=&quot;SYMBOL&quot;) rowData(sce.416b)$SEQNAME &lt;- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype=&quot;GENEID&quot;, column=&quot;SEQNAME&quot;) library(scater) rownames(sce.416b) &lt;- uniquifyFeatureNames(rowData(sce.416b)$ENSEMBL, rowData(sce.416b)$SYMBOL) #--- quality-control ---# mito &lt;- which(rowData(sce.416b)$SEQNAME==&quot;MT&quot;) stats &lt;- perCellQCMetrics(sce.416b, subsets=list(Mt=mito)) qc &lt;- quickPerCellQC(stats, percent_subsets=c(&quot;subsets_Mt_percent&quot;, &quot;altexps_ERCC_percent&quot;), batch=sce.416b$block) sce.416b &lt;- sce.416b[,!qc$discard] #--- normalization ---# library(scran) sce.416b &lt;- computeSumFactors(sce.416b) sce.416b &lt;- logNormCounts(sce.416b) sce.416b ## class: SingleCellExperiment ## dim: 46604 185 ## metadata(0): ## assays(2): counts logcounts ## rownames(46604): 4933401J01Rik Gm26206 ... CAAA01147332.1 ## CBFB-MYH11-mcherry ## rowData names(4): Length ENSEMBL SYMBOL SEQNAME ## colnames(185): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ... ## SLX-11312.N712_S507.H5H5YBBXX.s_8.r_1 ## SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1 ## colData names(10): Source Name cell line ... block sizeFactor ## reducedDimNames(0): ## mainExpName: endogenous ## altExpNames(2): ERCC SIRV 3.2 Quantifying per-gene variation The simplest approach to quantifying per-gene variation is to compute the variance of the log-normalized expression values (i.e., “log-counts” ) for each gene across all cells (A. T. L. Lun, McCarthy, and Marioni 2016). The advantage of this approach is that the feature selection is based on the same log-values that are used for later downstream steps. In particular, genes with the largest variances in log-values will contribute most to the Euclidean distances between cells during procedures like clustering and dimensionality reduction. By using log-values here, we ensure that our quantitative definition of heterogeneity is consistent throughout the entire analysis. Calculation of the per-gene variance is simple but feature selection requires modelling of the mean-variance relationship. The log-transformation is not a variance stabilizing transformation in most cases, which means that the total variance of a gene is driven more by its abundance than its underlying biological heterogeneity. To account for this effect, we use the modelGeneVar() function to fit a trend to the variance with respect to abundance across all genes (Figure 3.1). library(scran) dec.pbmc &lt;- modelGeneVar(sce.pbmc) # Visualizing the fit: fit.pbmc &lt;- metadata(dec.pbmc) plot(fit.pbmc$mean, fit.pbmc$var, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curve(fit.pbmc$trend(x), col=&quot;dodgerblue&quot;, add=TRUE, lwd=2) Figure 3.1: Variance in the PBMC data set as a function of the mean. Each point represents a gene while the blue line represents the trend fitted to all genes. At any given abundance, we assume that the variation in expression for most genes is driven by uninteresting processes like sampling noise. Under this assumption, the fitted value of the trend at any given gene’s abundance represents an estimate of its uninteresting variation, which we call the technical component. We then define the biological component for each gene as the difference between its total variance and the technical component. This biological component represents the “interesting” variation for each gene and can be used as the metric for HVG selection. # Ordering by most interesting genes for inspection. dec.pbmc[order(dec.pbmc$bio, decreasing=TRUE),] ## DataFrame with 33694 rows and 6 columns ## mean total tech bio p.value FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## LYZ 1.95605 5.05854 0.835343 4.22320 1.10535e-270 2.17411e-266 ## S100A9 1.93416 4.53551 0.835439 3.70007 2.71037e-208 7.61576e-205 ## S100A8 1.69961 4.41084 0.824342 3.58650 4.31572e-201 9.43177e-198 ## HLA-DRA 2.09785 3.75174 0.831239 2.92050 5.93941e-132 4.86760e-129 ## CD74 2.90176 3.36879 0.793188 2.57560 4.83931e-113 2.50485e-110 ## ... ... ... ... ... ... ... ## TMSB4X 6.08142 0.441718 0.679215 -0.237497 0.992447 1 ## PTMA 3.82978 0.486454 0.731275 -0.244821 0.990002 1 ## HLA-B 4.50032 0.486130 0.739577 -0.253447 0.991376 1 ## EIF1 3.23488 0.482869 0.768946 -0.286078 0.995135 1 ## B2M 5.95196 0.314948 0.654228 -0.339280 0.999843 1 (Careful readers will notice that some genes have negative biological components, which have no obvious interpretation and can be ignored in most applications. They are inevitable when fitting a trend to the per-gene variances as approximately half of the genes will lie below the trend.) Strictly speaking, the interpretation of the fitted trend as the technical component assumes that the expression profiles of most genes are dominated by random technical noise. In practice, all expressed genes will exhibit some non-zero level of biological variability due to events like transcriptional bursting. Thus, it would be more appropriate to consider these estimates as technical noise plus “uninteresting” biological variation, under the assumption that most genes do not participate in the processes driving interesting heterogeneity across the population. 3.3 Quantifying technical noise The assumption in Section 3.2 may be problematic in rare scenarios where many genes at a particular abundance are affected by a biological process. For example, strong upregulation of cell type-specific genes may result in an enrichment of HVGs at high abundances. This would inflate the fitted trend in that abundance interval and compromise the detection of the relevant genes. We can avoid this problem by fitting a mean-dependent trend to the variance of the spike-in transcripts (Figure 3.2), if they are available. The premise here is that spike-ins should not be affected by biological variation, so the fitted value of the spike-in trend should represent a better estimate of the technical component for each gene. dec.spike.416b &lt;- modelGeneVarWithSpikes(sce.416b, &quot;ERCC&quot;) dec.spike.416b[order(dec.spike.416b$bio, decreasing=TRUE),] ## DataFrame with 46604 rows and 6 columns ## mean total tech bio p.value FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Lyz2 6.61097 13.8497 1.57131 12.2784 1.48993e-186 1.54156e-183 ## Ccl9 6.67846 13.1869 1.50035 11.6866 2.21855e-185 2.19979e-182 ## Top2a 5.81024 14.1787 2.54776 11.6310 3.80016e-65 1.13040e-62 ## Cd200r3 4.83180 15.5613 4.22984 11.3314 9.46221e-24 6.08574e-22 ## Ccnb2 5.97776 13.1393 2.30177 10.8375 3.68706e-69 1.20193e-66 ## ... ... ... ... ... ... ... ## Rpl5-ps2 3.60625 0.612623 6.32853 -5.71590 0.999616 0.999726 ## Gm11942 3.38768 0.798570 6.51473 -5.71616 0.999459 0.999726 ## Gm12816 2.91276 0.838670 6.57364 -5.73497 0.999422 0.999726 ## Gm13623 2.72844 0.708071 6.45448 -5.74641 0.999544 0.999726 ## Rps12l1 3.15420 0.746615 6.59332 -5.84670 0.999522 0.999726 plot(dec.spike.416b$mean, dec.spike.416b$total, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) fit.spike.416b &lt;- metadata(dec.spike.416b) points(fit.spike.416b$mean, fit.spike.416b$var, col=&quot;red&quot;, pch=16) curve(fit.spike.416b$trend(x), col=&quot;dodgerblue&quot;, add=TRUE, lwd=2) Figure 3.2: Variance in the 416B data set as a function of the mean. Each point represents a gene (black) or spike-in transcript (red) and the blue line represents the trend fitted to all spike-ins. In the absence of spike-in data, one can attempt to create a trend by making some distributional assumptions about the noise. For example, UMI counts typically exhibit near-Poisson variation if we only consider technical noise from library preparation and sequencing. This can be used to construct a mean-variance trend in the log-counts (Figure 3.3) with the modelGeneVarByPoisson() function. Note the increased residuals of the high-abundance genes, which can be interpreted as the amount of biological variation that was assumed to be “uninteresting” when fitting the gene-based trend in Figure 3.1. set.seed(0010101) dec.pois.pbmc &lt;- modelGeneVarByPoisson(sce.pbmc) dec.pois.pbmc &lt;- dec.pois.pbmc[order(dec.pois.pbmc$bio, decreasing=TRUE),] head(dec.pois.pbmc) ## DataFrame with 6 rows and 6 columns ## mean total tech bio p.value FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## LYZ 1.95605 5.05854 0.631190 4.42735 0 0 ## S100A9 1.93416 4.53551 0.635102 3.90040 0 0 ## S100A8 1.69961 4.41084 0.671491 3.73935 0 0 ## HLA-DRA 2.09785 3.75174 0.604448 3.14730 0 0 ## CD74 2.90176 3.36879 0.444928 2.92386 0 0 ## CST3 1.47546 2.95646 0.691386 2.26507 0 0 plot(dec.pois.pbmc$mean, dec.pois.pbmc$total, pch=16, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curve(metadata(dec.pois.pbmc)$trend(x), col=&quot;dodgerblue&quot;, add=TRUE) Figure 3.3: Variance of normalized log-expression values for each gene in the PBMC dataset, plotted against the mean log-expression. The blue line represents represents the mean-variance relationship corresponding to Poisson noise. Interestingly, trends based purely on technical noise tend to yield large biological components for highly-expressed genes. This often includes so-called “house-keeping” genes coding for essential cellular components such as ribosomal proteins, which are considered uninteresting for characterizing cellular heterogeneity. These observations suggest that a more accurate noise model does not necessarily yield a better ranking of HVGs, though one should keep an open mind - house-keeping genes are regularly DE in a variety of conditions (Glare et al. 2002; Nazari, Parham, and Maleki 2015; Guimaraes and Zavolan 2016), and the fact that they have large biological components indicates that there is strong variation across cells that may not be completely irrelevant. 3.4 Handling batch effects Data containing multiple batches will often exhibit batch effects - see Multi-sample Chapter 1 for more details. We are usually not interested in HVGs that are driven by batch effects; instead, we want to focus on genes that are highly variable within each batch. This is naturally achieved by performing trend fitting and variance decomposition separately for each batch. We demonstrate this approach by treating each plate (block) in the 416B dataset as a different batch, using the modelGeneVarWithSpikes() function. (The same argument is available in all other variance-modelling functions.) dec.block.416b &lt;- modelGeneVarWithSpikes(sce.416b, &quot;ERCC&quot;, block=sce.416b$block) head(dec.block.416b[order(dec.block.416b$bio, decreasing=TRUE),1:6]) ## DataFrame with 6 rows and 6 columns ## mean total tech bio p.value FDR ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Lyz2 6.61235 13.8619 1.58416 12.2777 0.00000e+00 0.00000e+00 ## Ccl9 6.67841 13.2599 1.44553 11.8143 0.00000e+00 0.00000e+00 ## Top2a 5.81275 14.0192 2.74571 11.2734 3.89855e-137 8.43398e-135 ## Cd200r3 4.83305 15.5909 4.31892 11.2719 1.17783e-54 7.00722e-53 ## Ccnb2 5.97999 13.0256 2.46647 10.5591 1.20380e-151 2.98405e-149 ## Hbb-bt 4.91683 14.6539 4.12156 10.5323 2.52639e-49 1.34197e-47 The use of a batch-specific trend fit is useful as it accommodates differences in the mean-variance trends between batches. This is especially important if batches exhibit systematic technical differences, e.g., differences in coverage or in the amount of spike-in RNA added. In this case, there are only minor differences between the trends in Figure 3.4, which indicates that the experiment was tightly replicated across plates. The analysis of each plate yields estimates of the biological and technical components for each gene, which are averaged across plates to take advantage of information from multiple batches. par(mfrow=c(1,2)) blocked.stats &lt;- dec.block.416b$per.block for (i in colnames(blocked.stats)) { current &lt;- blocked.stats[[i]] plot(current$mean, current$total, main=i, pch=16, cex=0.5, xlab=&quot;Mean of log-expression&quot;, ylab=&quot;Variance of log-expression&quot;) curfit &lt;- metadata(current) points(curfit$mean, curfit$var, col=&quot;red&quot;, pch=16) curve(curfit$trend(x), col=&#39;dodgerblue&#39;, add=TRUE, lwd=2) } Figure 3.4: Variance in the 416B data set as a function of the mean after blocking on the plate of origin. Each plot represents the results for a single plate, each point represents a gene (black) or spike-in transcript (red) and the blue line represents the trend fitted to all spike-ins. Alternatively, we might consider using a linear model to account for batch effects and other unwanted factors of variation. This is more flexible as it can handle multiple factors and continuous covariates, though it is less accurate than block= in the special case of a multi-batch design. See Advanced Section 3.3 for more details. As an aside, the wave-like shape observed above is typical of the mean-variance trend for log-expression values. (The same wave is present but much less pronounced for UMI data.) A linear increase in the variance is observed as the mean increases from zero, as larger variances are obviously possible when the counts are not all equal to zero. In contrast, the relative contribution of sampling noise decreases at high abundances, resulting in a downward trend. The peak represents the point at which these two competing effects cancel each other out. 3.5 Selecting highly variable genes Once we have quantified the per-gene variation, the next step is to select the subset of HVGs to use in downstream analyses. A larger subset will reduce the risk of discarding interesting biological signal by retaining more potentially relevant genes, at the cost of increasing noise from irrelevant genes that might obscure said signal. It is difficult to determine the optimal trade-off for any given application as noise in one context may be useful signal in another. For example, heterogeneity in T cell activation responses is an interesting phenomena (Richard et al. 2018) but may be irrelevant noise in studies that only care about distinguishing the major immunophenotypes. The most obvious selection strategy is to take the top \\(n\\) genes with the largest values for the relevant variance metric. The main advantage of this approach is that the user can directly control the number of genes retained, which ensures that the computational complexity of downstream calculations is easily predicted. For modelGeneVar() and modelGeneVarWithSpikes(), we would select the genes with the largest biological components. This is conveniently done for us via getTopHVgs(), as shown below with \\(n=1000\\). # Taking the top 1000 genes here: hvg.pbmc.var &lt;- getTopHVGs(dec.pbmc, n=1000) str(hvg.pbmc.var) ## chr [1:1000] &quot;LYZ&quot; &quot;S100A9&quot; &quot;S100A8&quot; &quot;HLA-DRA&quot; &quot;CD74&quot; &quot;CST3&quot; &quot;TYROBP&quot; ... The choice of \\(n\\) also has a fairly straightforward biological interpretation. Recall our trend-fitting assumption that most genes do not exhibit biological heterogeneity; this implies that they are not differentially expressed between cell types or states in our population. If we quantify this assumption into a statement that, e.g., no more than 5% of genes are differentially expressed, we can naturally set \\(n\\) to 5% of the number of genes. In practice, we usually do not know the proportion of DE genes beforehand so this interpretation just exchanges one unknown for another. Nonetheless, it is still useful as it implies that we should lower \\(n\\) for less heterogeneous datasets, retaining most of the biological signal without unnecessary noise from irrelevant genes. Conversely, more heterogeneous datasets should use larger values of \\(n\\) to preserve secondary factors of variation beyond those driving the most obvious HVGs. The main disadvantage of this approach that it turns HVG selection into a competition between genes, whereby a subset of very highly variable genes can push other informative genes out of the top set. This can be problematic for analyses of highly heterogeneous populations if the loss of important markers prevents the resolution of certain subpopulations. In the most extreme example, consider a situation where a single subpopulation is very different from the others. In such cases, the top set will be dominated by differentially expressed genes involving that distinct subpopulation, compromising resolution of heterogeneity between the other populations. (This can be recovered with a nested analysis, as discussed in Section 5.5, but we would prefer to avoid the problem in the first place.) Another potential concern with this approach is the fact that the choice of \\(n\\) is fairly arbitrary, with any value from 500 to 5000 considered “reasonable”. We have chosen \\(n=1000\\) in the code above though there is no particular a priori reason for doing so. Our recommendation is to simply pick an arbitrary \\(n\\) and proceed with the rest of the analysis, with the intention of testing other choices later, rather than spending much time worrying about obtaining the “optimal” value. Alternatively, we may pick one of the other selection strategies discussed in Advanced Section 3.5. 3.6 Putting it all together The code chunk below will select the top 10% of genes with the highest biological components. dec.pbmc &lt;- modelGeneVar(sce.pbmc) chosen &lt;- getTopHVGs(dec.pbmc, prop=0.1) str(chosen) ## chr [1:1274] &quot;LYZ&quot; &quot;S100A9&quot; &quot;S100A8&quot; &quot;HLA-DRA&quot; &quot;CD74&quot; &quot;CST3&quot; &quot;TYROBP&quot; ... We then have several options to enforce our HVG selection on the rest of the analysis. We can subset the SingleCellExperiment to only retain our selection of HVGs. This ensures that downstream methods will only use these genes for their calculations. The downside is that the non-HVGs are discarded from the new SingleCellExperiment, making it slightly more inconvenient to interrogate the full dataset for interesting genes that are not HVGs. sce.pbmc.hvg &lt;- sce.pbmc[chosen,] dim(sce.pbmc.hvg) ## [1] 1274 3985 We can keep the original SingleCellExperiment object and specify the genes to use for downstream functions via an extra argument like subset.row=. This is useful if the analysis uses multiple sets of HVGs at different steps, whereby one set of HVGs can be easily swapped for another in specific steps. # Performing PCA only on the chosen HVGs. library(scater) sce.pbmc &lt;- runPCA(sce.pbmc, subset_row=chosen) reducedDimNames(sce.pbmc) ## [1] &quot;PCA&quot; This approach is facilitated by the rowSubset() utility, which allows us to easily store one or more sets of interest in our SingleCellExperiment. By doing so, we avoid the need to keep track of a separate chosen variable and ensure that our HVG set is synchronized with any downstream row subsetting of sce.pbmc. rowSubset(sce.pbmc) &lt;- chosen # stored in the default &#39;subset&#39;. rowSubset(sce.pbmc, &quot;HVGs.more&quot;) &lt;- getTopHVGs(dec.pbmc, prop=0.2) rowSubset(sce.pbmc, &quot;HVGs.less&quot;) &lt;- getTopHVGs(dec.pbmc, prop=0.3) colnames(rowData(sce.pbmc)) ## [1] &quot;ID&quot; &quot;Symbol&quot; &quot;subset&quot; &quot;HVGs.more&quot; &quot;HVGs.less&quot; It can be inconvenient to repeatedly specify the desired feature set across steps, so some downstream functions will automatically subset to the default rowSubset() if present in the SingleCellExperiment. However, we find that it is generally safest to be explicit about which set is being used for a particular step. We can have our cake and eat it too by (ab)using the “alternative Experiment” system in the SingleCellExperiment class. Initially designed for storing alternative features like spike-ins or antibody tags, we can instead use it to hold our full dataset while we perform our downstream operations conveniently on the HVG subset. This avoids book-keeping problems in long analyses when the original dataset is not synchronized with the HVG subsetted data. # Recycling the class above. altExp(sce.pbmc.hvg, &quot;original&quot;) &lt;- sce.pbmc altExpNames(sce.pbmc.hvg) ## [1] &quot;original&quot; # No need for explicit subset_row= specification in downstream operations. sce.pbmc.hvg &lt;- runPCA(sce.pbmc.hvg) # Recover original data: sce.pbmc.original &lt;- altExp(sce.pbmc.hvg, &quot;original&quot;, withColData=TRUE) Session Info View session info R version 4.1.1 (2021-08-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] scater_1.22.0 ggplot2_3.3.5 [3] scran_1.22.0 scuttle_1.4.0 [5] SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0 [7] Biobase_2.54.0 GenomicRanges_1.46.0 [9] GenomeInfoDb_1.30.0 IRanges_2.28.0 [11] S4Vectors_0.32.0 BiocGenerics_0.40.0 [13] MatrixGenerics_1.6.0 matrixStats_0.61.0 [15] BiocStyle_2.22.0 rebook_1.4.0 loaded via a namespace (and not attached): [1] bitops_1.0-7 filelock_1.0.2 [3] tools_4.1.1 bslib_0.3.1 [5] utf8_1.2.2 R6_2.5.1 [7] irlba_2.3.3 vipor_0.4.5 [9] DBI_1.1.1 colorspace_2.0-2 [11] withr_2.4.2 gridExtra_2.3 [13] tidyselect_1.1.1 compiler_4.1.1 [15] graph_1.72.0 BiocNeighbors_1.12.0 [17] DelayedArray_0.20.0 bookdown_0.24 [19] sass_0.4.0 scales_1.1.1 [21] rappdirs_0.3.3 stringr_1.4.0 [23] digest_0.6.28 rmarkdown_2.11 [25] XVector_0.34.0 pkgconfig_2.0.3 [27] htmltools_0.5.2 sparseMatrixStats_1.6.0 [29] fastmap_1.1.0 limma_3.50.0 [31] highr_0.9 rlang_0.4.12 [33] DelayedMatrixStats_1.16.0 jquerylib_0.1.4 [35] generics_0.1.1 jsonlite_1.7.2 [37] BiocParallel_1.28.0 dplyr_1.0.7 [39] RCurl_1.98-1.5 magrittr_2.0.1 [41] BiocSingular_1.10.0 GenomeInfoDbData_1.2.7 [43] Matrix_1.3-4 ggbeeswarm_0.6.0 [45] Rcpp_1.0.7 munsell_0.5.0 [47] fansi_0.5.0 viridis_0.6.2 [49] lifecycle_1.0.1 stringi_1.7.5 [51] yaml_2.2.1 edgeR_3.36.0 [53] zlibbioc_1.40.0 grid_4.1.1 [55] ggrepel_0.9.1 parallel_4.1.1 [57] dqrng_0.3.0 crayon_1.4.1 [59] dir.expiry_1.2.0 lattice_0.20-45 [61] beachmat_2.10.0 locfit_1.5-9.4 [63] CodeDepends_0.6.5 metapod_1.2.0 [65] knitr_1.36 pillar_1.6.4 [67] igraph_1.2.7 codetools_0.2-18 [69] ScaledMatrix_1.2.0 XML_3.99-0.8 [71] glue_1.4.2 evaluate_0.14 [73] BiocManager_1.30.16 vctrs_0.3.8 [75] purrr_0.3.4 gtable_0.3.0 [77] assertthat_0.2.1 xfun_0.27 [79] rsvd_1.0.5 viridisLite_0.4.0 [81] tibble_3.1.5 beeswarm_0.4.0 [83] cluster_2.1.2 bluster_1.4.0 [85] statmod_1.4.36 ellipsis_0.3.2 References "],["dimensionality-reduction.html", "Chapter 4 Dimensionality reduction 4.1 Overview 4.2 Principal components analysis 4.3 Choosing the number of PCs 4.4 Visualizing the PCs 4.5 Non-linear methods for visualization Session Info", " Chapter 4 Dimensionality reduction .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 4.1 Overview Many scRNA-seq analysis procedures involve comparing cells based on their expression values across multiple genes. For example, clustering aims to identify cells with similar transcriptomic profiles by computing Euclidean distances across genes. In these applications, each individual gene represents a dimension of the data. More intuitively, if we had a scRNA-seq data set with two genes, we could make a two-dimensional plot where each axis represents the expression of one gene and each point in the plot represents a cell. This concept can be extended to data sets with thousands of genes where each cell’s expression profile defines its location in the high-dimensional expression space. As the name suggests, dimensionality reduction aims to reduce the number of separate dimensions in the data. This is possible because different genes are correlated if they are affected by the same biological process. Thus, we do not need to store separate information for individual genes, but can instead compress multiple features into a single dimension, e.g., an “eigengene” (Langfelder and Horvath 2007). This reduces computational work in downstream analyses like clustering, as calculations only need to be performed for a few dimensions rather than thousands of genes; reduces noise by averaging across multiple genes to obtain a more precise representation of the patterns in the data; and enables effective plotting of the data, for those of us who are not capable of visualizing more than 3 dimensions. We will use the Zeisel et al. (2015) dataset to demonstrate the applications of various dimensionality reduction methods in this chapter. View set-up code (Workflow Chapter 2) #--- loading ---# library(scRNAseq) sce.zeisel &lt;- ZeiselBrainData() library(scater) sce.zeisel &lt;- aggregateAcrossFeatures(sce.zeisel, id=sub(&quot;_loc[0-9]+$&quot;, &quot;&quot;, rownames(sce.zeisel))) #--- gene-annotation ---# library(org.Mm.eg.db) rowData(sce.zeisel)$Ensembl &lt;- mapIds(org.Mm.eg.db, keys=rownames(sce.zeisel), keytype=&quot;SYMBOL&quot;, column=&quot;ENSEMBL&quot;) #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.zeisel, subsets=list( Mt=rowData(sce.zeisel)$featureType==&quot;mito&quot;)) qc &lt;- quickPerCellQC(stats, percent_subsets=c(&quot;altexps_ERCC_percent&quot;, &quot;subsets_Mt_percent&quot;)) sce.zeisel &lt;- sce.zeisel[,!qc$discard] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.zeisel) sce.zeisel &lt;- computeSumFactors(sce.zeisel, cluster=clusters) sce.zeisel &lt;- logNormCounts(sce.zeisel) #--- variance-modelling ---# dec.zeisel &lt;- modelGeneVarWithSpikes(sce.zeisel, &quot;ERCC&quot;) top.hvgs &lt;- getTopHVGs(dec.zeisel, prop=0.1) sce.zeisel ## class: SingleCellExperiment ## dim: 19839 2816 ## metadata(0): ## assays(2): counts logcounts ## rownames(19839): 0610005C13Rik 0610007N19Rik ... mt-Tw mt-Ty ## rowData names(2): featureType Ensembl ## colnames(2816): 1772071015_C02 1772071017_G12 ... 1772063068_D01 ## 1772066098_A12 ## colData names(11): tissue group # ... level2class sizeFactor ## reducedDimNames(0): ## mainExpName: endogenous ## altExpNames(2): ERCC repeat 4.2 Principal components analysis Principal components analysis (PCA) discovers axes in high-dimensional space that capture the largest amount of variation. This is best understood by imagining each axis as a line. Say we draw a line anywhere, and we move each cell in our data set onto the closest position on the line. The variance captured by this axis is defined as the variance in the positions of cells along that line. In PCA, the first axis (or “principal component”, PC) is chosen such that it maximizes this variance. The next PC is chosen such that it is orthogonal to the first and captures the greatest remaining amount of variation, and so on. By definition, the top PCs capture the dominant factors of heterogeneity in the data set. In the context of scRNA-seq, our assumption is that biological processes affect multiple genes in a coordinated manner. This means that the earlier PCs are likely to represent biological structure as more variation can be captured by considering the correlated behavior of many genes. By comparison, random technical or biological noise is expected to affect each gene independently. There is unlikely to be an axis that can capture random variation across many genes, meaning that noise should mostly be concentrated in the later PCs. This motivates the use of the earlier PCs in our downstream analyses, which concentrates the biological signal to simultaneously reduce computational work and remove noise. The use of the earlier PCs for denoising and data compaction is a strategy that is simple, highly effective and widely used in a variety of fields. It takes advantage of the well-studied theoretical properties of the PCA - namely, that a low-rank approximation formed from the top PCs is the optimal approximation of the original data for a given matrix rank. Indeed, the Euclidean distances between cells in PC space can be treated as an approximation of the same distances in the original dataset The literature for PCA also provides us with a range of fast implementations for scalable and efficient data analysis. We perform a PCA on the log-normalized expression values using the fixedPCA() function from scran. By default, fixedPCA() will compute the first 50 PCs and store them in the reducedDims() of the output SingleCellExperiment object, as shown below. Here, we use only the top 2000 genes with the largest biological components to reduce both computational work and high-dimensional random noise. In particular, while PCA is robust to random noise, an excess of it may cause the earlier PCs to capture noise instead of biological structure (Johnstone and Lu 2009). This effect can be mitigated by restricting the PCA to a subset of HVGs, for which we can use any of the strategies described in Chapter 3. library(scran) top.zeisel &lt;- getTopHVGs(dec.zeisel, n=2000) set.seed(100) # See below. sce.zeisel &lt;- fixedPCA(sce.zeisel, subset.row=top.zeisel) reducedDimNames(sce.zeisel) ## [1] &quot;PCA&quot; dim(reducedDim(sce.zeisel, &quot;PCA&quot;)) ## [1] 2816 50 For large data sets, greater efficiency is obtained by using approximate SVD algorithms that only compute the top PCs. By default, most PCA-related functions in scater and scran will use methods from the irlba or rsvd packages to perform the SVD. We can explicitly specify the SVD algorithm to use by passing an BiocSingularParam object to the BSPARAM= argument (see Advanced Section 14.2.2 for more details). Many of these approximate algorithms are based on randomization and thus require set.seed() to obtain reproducible results. library(BiocSingular) set.seed(1000) sce.zeisel &lt;- fixedPCA(sce.zeisel, subset.row=top.zeisel, BSPARAM=RandomParam(), name=&quot;randomized&quot;) reducedDimNames(sce.zeisel) ## [1] &quot;PCA&quot; &quot;randomized&quot; 4.3 Choosing the number of PCs How many of the top PCs should we retain for downstream analyses? The choice of the number of PCs \\(d\\) is a decision that is analogous to the choice of the number of HVGs to use. Using more PCs will retain more biological signal at the cost of including more noise that might mask said signal. On the other hand, using fewer PCs will introduce competition between different factors of variation, where weaker (but still interesting) factors may be pushed down into lower PCs and inadvertently discarded from downtream analyses. Much like the choice of the number of HVGs, it is hard to determine whether an “optimal” choice exists for the number of PCs. Certainly, we could attempt to remove the technical variation that is almost always uninteresting. However, even if we were only left with biological variation, there is no straightforward way to automatically determine which aspects of this variation are relevant. One analyst’s biological signal may be irrelevant noise to another analyst with a different scientific question. For example, heterogeneity within a population might be interesting when studying continuous processes like metabolic flux or differentiation potential, but is comparable to noise in applications that only aim to distinguish between distinct cell types. Most practitioners will simply set \\(d\\) to a “reasonable” but arbitrary value, typically ranging from 10 to 50. This is often satisfactory as the later PCs explain so little variance that their inclusion or omission has no major effect. For example, in the Zeisel dataset, few PCs explain more than 1% of the variance in the entire dataset (Figure 4.1) and choosing between, say, 20 and 40 PCs would not even amount to four percentage points’ worth of difference in variance. In fact, the main consequence of using more PCs is simply that downstream calculations take longer as they need to compute over more dimensions, but most PC-related calculations are fast enough that this is not a practical concern. percent.var &lt;- attr(reducedDim(sce.zeisel), &quot;percentVar&quot;) plot(percent.var, log=&quot;y&quot;, xlab=&quot;PC&quot;, ylab=&quot;Variance explained (%)&quot;) Figure 4.1: Percentage of variance explained by successive PCs in the Zeisel dataset, shown on a log-scale for visualization purposes. Nonetheless, Advanced Section 4.2 describes some more data-driven strategies to guide a suitable choice of \\(d\\). These automated choices are best treated as guidelines as they make assumptions about what variation is “interesting”. Indeed, the concepts in Advanced Section 4.2.3 could even be used to provide some justification for an arbitrarily chosen \\(d\\). More diligent readers may consider repeating the analysis with a variety of choices of \\(d\\) to explore other perspectives of the dataset at a different bias-variance trade-off, though this tends to be unnecessary work in most applications. 4.4 Visualizing the PCs Algorithms are more than happy to operate on 10-50 PCs, but these are still too many dimensions for human comprehension. To visualize the data, we could take the top 2 PCs for plotting (Figure 4.2). library(scater) plotReducedDim(sce.zeisel, dimred=&quot;PCA&quot;, colour_by=&quot;level1class&quot;) Figure 4.2: PCA plot of the first two PCs in the Zeisel brain data. Each point is a cell, coloured according to the annotation provided by the original authors. The problem is that PCA is a linear technique, i.e., only variation along a line in high-dimensional space is captured by each PC. As such, it cannot efficiently pack differences in \\(d\\) dimensions into the first 2 PCs. This is demonstrated in Figure 4.2 where the top two PCs fail to resolve some subpopulations identified by Zeisel et al. (2015). If the first PC is devoted to resolving the biggest difference between subpopulations, and the second PC is devoted to resolving the next biggest difference, then the remaining differences will not be visible in the plot. One workaround is to plot several of the top PCs against each other in pairwise plots (Figure 4.3). However, it is difficult to interpret multiple plots simultaneously, and even this approach is not sufficient to separate some of the annotated subpopulations. plotReducedDim(sce.zeisel, dimred=&quot;PCA&quot;, ncomponents=4, colour_by=&quot;level1class&quot;) Figure 4.3: PCA plot of the first two PCs in the Zeisel brain data. Each point is a cell, coloured according to the annotation provided by the original authors. Thus, plotting the top few PCs is not satisfactory for visualization of complex populations. That said, the PCA itself is still of great value in visualization as it compacts and denoises the data prior to downstream steps. The top PCs are often used as input to more sophisticated (and computationally intensive) algorithms for dimensionality reduction. 4.5 Non-linear methods for visualization 4.5.1 \\(t\\)-stochastic neighbor embedding The de facto standard for visualization of scRNA-seq data is the \\(t\\)-stochastic neighbor embedding (\\(t\\)-SNE) method (Van der Maaten and Hinton 2008). This attempts to find a low-dimensional representation of the data that preserves the distances between each point and its neighbors in the high-dimensional space. Unlike PCA, it is not restricted to linear transformations, nor is it obliged to accurately represent distances between distant populations. This means that it has much more freedom in how it arranges cells in low-dimensional space, enabling it to separate many distinct clusters in a complex population (Figure 4.4). set.seed(00101001101) # runTSNE() stores the t-SNE coordinates in the reducedDims # for re-use across multiple plotReducedDim() calls. sce.zeisel &lt;- runTSNE(sce.zeisel, dimred=&quot;PCA&quot;) plotReducedDim(sce.zeisel, dimred=&quot;TSNE&quot;, colour_by=&quot;level1class&quot;) Figure 4.4: \\(t\\)-SNE plots constructed from the top PCs in the Zeisel brain dataset. Each point represents a cell, coloured according to the published annotation. One of the main disadvantages of \\(t\\)-SNE is that it is much more computationally intensive than other visualization methods. We mitigate this effect by setting dimred=\"PCA\" in runTSNE(), which instructs the function to perform the \\(t\\)-SNE calculations on the top PCs in sce.zeisel. This exploits the data compaction and noise removal of the PCA for faster and cleaner results in the \\(t\\)-SNE. It is also possible to run \\(t\\)-SNE on the original expression matrix but this is less efficient. Another issue with \\(t\\)-SNE is that it requires the user to be aware of additional parameters (discussed here in some depth). It involves a random initialization so we need to set the seed to ensure that the chosen results are reproducible. We may also wish to repeat the visualization several times to ensure that the results are representative. The “perplexity” is another important parameter that determines the granularity of the visualization (Figure 4.5). Low perplexities will favor resolution of finer structure, possibly to the point that the visualization is compromised by random noise. Thus, it is advisable to test different perplexity values to ensure that the choice of perplexity does not drive the interpretation of the plot. set.seed(100) sce.zeisel &lt;- runTSNE(sce.zeisel, dimred=&quot;PCA&quot;, perplexity=5) out5 &lt;- plotReducedDim(sce.zeisel, dimred=&quot;TSNE&quot;, colour_by=&quot;level1class&quot;) + ggtitle(&quot;perplexity = 5&quot;) set.seed(100) sce.zeisel &lt;- runTSNE(sce.zeisel, dimred=&quot;PCA&quot;, perplexity=20) out20 &lt;- plotReducedDim(sce.zeisel, dimred=&quot;TSNE&quot;, colour_by=&quot;level1class&quot;) + ggtitle(&quot;perplexity = 20&quot;) set.seed(100) sce.zeisel &lt;- runTSNE(sce.zeisel, dimred=&quot;PCA&quot;, perplexity=80) out80 &lt;- plotReducedDim(sce.zeisel, dimred=&quot;TSNE&quot;, colour_by=&quot;level1class&quot;) + ggtitle(&quot;perplexity = 80&quot;) gridExtra::grid.arrange(out5, out20, out80, ncol=3) Figure 4.5: \\(t\\)-SNE plots constructed from the top PCs in the Zeisel brain dataset, using a range of perplexity values. Each point represents a cell, coloured according to its annotation. Finally, it is unwise to read too much into the relative sizes and positions of the visual clusters. \\(t\\)-SNE will inflate dense clusters and compress sparse ones, such that we cannot use the size as a measure of subpopulation heterogeneity. In addition, \\(t\\)-SNE is not obliged to preserve the relative locations of non-neighboring clusters, such that we cannot use their positions to determine relationships between distant clusters. Despite its shortcomings, \\(t\\)-SNE is proven tool for general-purpose visualization of scRNA-seq data and remains a popular choice in many analysis pipelines. In particular, this author enjoys looking at \\(t\\)-SNEs as they remind him of histology slides, which allows him to pretend that he is looking at real data. 4.5.2 Uniform manifold approximation and projection The uniform manifold approximation and projection (UMAP) method (McInnes, Healy, and Melville 2018) is an alternative to \\(t\\)-SNE for non-linear dimensionality reduction. It is roughly similar to \\(t\\)-SNE in that it also tries to find a low-dimensional representation that preserves relationships between neighbors in high-dimensional space. However, the two methods are based on different theory, represented by differences in the various graph weighting equations. This manifests as a different visualization as shown in Figure 4.6. set.seed(1100101001) sce.zeisel &lt;- runUMAP(sce.zeisel, dimred=&quot;PCA&quot;) plotReducedDim(sce.zeisel, dimred=&quot;UMAP&quot;, colour_by=&quot;level1class&quot;) Figure 4.6: UMAP plots constructed from the top PCs in the Zeisel brain dataset. Each point represents a cell, coloured according to the published annotation. Compared to \\(t\\)-SNE, the UMAP visualization tends to have more compact visual clusters with more empty space between them. It also attempts to preserve more of the global structure than \\(t\\)-SNE. From a practical perspective, UMAP is much faster than \\(t\\)-SNE, which may be an important consideration for large datasets. (Nonetheless, we have still run UMAP on the top PCs here for consistency.) UMAP also involves a series of randomization steps so setting the seed is critical. Like \\(t\\)-SNE, UMAP has its own suite of hyperparameters that affect the visualization (see the documentation here). Of these, the number of neighbors (n_neighbors) and the minimum distance between embedded points (min_dist) have the greatest effect on the granularity of the output. If these values are too low, random noise will be incorrectly treated as high-resolution structure, while values that are too high will discard fine structure altogether in favor of obtaining an accurate overview of the entire dataset. Again, it is a good idea to test a range of values for these parameters to ensure that they do not compromise any conclusions drawn from a UMAP plot. It is arguable whether the UMAP or \\(t\\)-SNE visualizations are more useful or aesthetically pleasing. UMAP aims to preserve more global structure but this necessarily reduces resolution within each visual cluster. However, UMAP is unarguably much faster, and for that reason alone, it is increasingly displacing \\(t\\)-SNE as the method of choice for visualizing large scRNA-seq data sets. 4.5.3 Interpreting the plots Dimensionality reduction for visualization necessarily involves discarding information and distorting the distances between cells to fit high-dimensional data into a 2-dimensional space. One might wonder whether the results of such extreme data compression can be trusted. Indeed, some of our more quantitative colleagues consider such visualizations to be more artistic than scientific, fit for little but impressing collaborators and reviewers! Perhaps this perspective is not entirely invalid, but we suggest that there is some value to be extracted from them provided that they are accompanied by an analysis of a higher-rank representation. As a general rule, focusing on local neighborhoods provides the safest interpretation of \\(t\\)-SNE and UMAP plots. These methods spend considerable effort to ensure that each cell’s nearest neighbors in the input high-dimensional space are still its neighbors in the output two-dimensional embedding. Thus, if we see multiple cell types or clusters in a single unbroken “island” in the embedding, we could infer that those populations were also close neighbors in higher-dimensional space. However, less can be said about the distances between non-neighboring cells; there is no guarantee that large distances are faithfully recapitulated in the embedding, given the distortions necessary for this type of dimensionality reduction. It would be courageous to use the distances between islands (seen to be measured, on occasion, with a ruler!) to make statements about the relative similarity of distinct cell types. On a related note, we prefer to restrict the \\(t\\)-SNE/UMAP coordinates for visualization and use the higher-rank representation for any quantitative analyses. To illustrate, consider the interaction between clustering and \\(t\\)-SNE. We do not perform clustering on the \\(t\\)-SNE coordinates, but rather, we cluster on the first 10-50 PCs (Chapter (clustering)) and then visualize the cluster identities on \\(t\\)-SNE plots like that in Figure 4.4. This ensures that clustering makes use of the information that was lost during compression into two dimensions for visualization. The plot can then be used for a diagnostic inspection of the clustering output, e.g., to check which clusters are close neighbors or whether a cluster can be split into further subclusters; this follows the aforementioned theme of focusing on local structure. From a naive perspective, using the \\(t\\)-SNE coordinates directly for clustering is tempting as it ensures that any results are immediately consistent with the visualization. Given that clustering is rather arbitrary anyway, there is nothing inherently wrong with this strategy - in fact, it can be treated as a rather circuitous implementation of graph-based clustering (Section 5.2). However, the enforced consistency can actually be considered a disservice as it masks the ambiguity of the conclusions, either due to the loss of information from dimensionality reduction or the uncertainty of the clustering. Rather than being errors, major discrepancies can instead be useful for motivating further investigation into the less obvious aspects of the dataset; conversely, the lack of discrepancies increases trust in the conclusions. Session Info View session info R version 4.1.1 (2021-08-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] scater_1.22.0 ggplot2_3.3.5 [3] BiocSingular_1.10.0 scran_1.22.0 [5] scuttle_1.4.0 SingleCellExperiment_1.16.0 [7] SummarizedExperiment_1.24.0 Biobase_2.54.0 [9] GenomicRanges_1.46.0 GenomeInfoDb_1.30.0 [11] IRanges_2.28.0 S4Vectors_0.32.0 [13] BiocGenerics_0.40.0 MatrixGenerics_1.6.0 [15] matrixStats_0.61.0 BiocStyle_2.22.0 [17] rebook_1.4.0 loaded via a namespace (and not attached): [1] bitops_1.0-7 filelock_1.0.2 [3] tools_4.1.1 bslib_0.3.1 [5] utf8_1.2.2 R6_2.5.1 [7] irlba_2.3.3 vipor_0.4.5 [9] uwot_0.1.10 DBI_1.1.1 [11] colorspace_2.0-2 withr_2.4.2 [13] gridExtra_2.3 tidyselect_1.1.1 [15] compiler_4.1.1 graph_1.72.0 [17] BiocNeighbors_1.12.0 DelayedArray_0.20.0 [19] labeling_0.4.2 bookdown_0.24 [21] sass_0.4.0 scales_1.1.1 [23] rappdirs_0.3.3 stringr_1.4.0 [25] digest_0.6.28 rmarkdown_2.11 [27] XVector_0.34.0 pkgconfig_2.0.3 [29] htmltools_0.5.2 sparseMatrixStats_1.6.0 [31] fastmap_1.1.0 limma_3.50.0 [33] highr_0.9 rlang_0.4.12 [35] FNN_1.1.3 DelayedMatrixStats_1.16.0 [37] farver_2.1.0 jquerylib_0.1.4 [39] generics_0.1.1 jsonlite_1.7.2 [41] BiocParallel_1.28.0 dplyr_1.0.7 [43] RCurl_1.98-1.5 magrittr_2.0.1 [45] GenomeInfoDbData_1.2.7 Matrix_1.3-4 [47] ggbeeswarm_0.6.0 Rcpp_1.0.7 [49] munsell_0.5.0 fansi_0.5.0 [51] viridis_0.6.2 lifecycle_1.0.1 [53] stringi_1.7.5 yaml_2.2.1 [55] edgeR_3.36.0 zlibbioc_1.40.0 [57] Rtsne_0.15 grid_4.1.1 [59] ggrepel_0.9.1 parallel_4.1.1 [61] dqrng_0.3.0 crayon_1.4.1 [63] dir.expiry_1.2.0 lattice_0.20-45 [65] cowplot_1.1.1 beachmat_2.10.0 [67] locfit_1.5-9.4 CodeDepends_0.6.5 [69] metapod_1.2.0 knitr_1.36 [71] pillar_1.6.4 igraph_1.2.7 [73] codetools_0.2-18 ScaledMatrix_1.2.0 [75] XML_3.99-0.8 glue_1.4.2 [77] evaluate_0.14 BiocManager_1.30.16 [79] vctrs_0.3.8 purrr_0.3.4 [81] gtable_0.3.0 assertthat_0.2.1 [83] xfun_0.27 rsvd_1.0.5 [85] RSpectra_0.16-0 viridisLite_0.4.0 [87] tibble_3.1.5 beeswarm_0.4.0 [89] cluster_2.1.2 bluster_1.4.0 [91] statmod_1.4.36 ellipsis_0.3.2 References "],["clustering.html", "Chapter 5 Clustering 5.1 Overview 5.2 Graph-based clustering 5.3 Vector quantization with \\(k\\)-means 5.4 Hierarchical clustering 5.5 Subclustering Session Info", " Chapter 5 Clustering .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 5.1 Overview Clustering is an unsupervised learning procedure that is used to empirically define groups of cells with similar expression profiles. Its primary purpose is to summarize complex scRNA-seq data into a digestible format for human interpretation. This allows us to describe population heterogeneity in terms of discrete labels that are easily understood, rather than attempting to comprehend the high-dimensional manifold on which the cells truly reside. After annotation based on marker genes, the clusters can be treated as proxies for more abstract biological concepts such as cell types or states. At this point, it is helpful to realize that clustering, like a microscope, is simply a tool to explore the data. We can zoom in and out by changing the resolution of the clustering parameters, and we can experiment with different clustering algorithms to obtain alternative perspectives of the data. This iterative approach is entirely permissible given that data exploration constitutes the majority of the scRNA-seq data analysis workflow. As such, questions about the “correctness” of the clusters or the “true” number of clusters are usually meaningless. We can define as many clusters as we like, with whatever algorithm we like - each clustering will represent its own partitioning of the high-dimensional expression space, and is as “real” as any other clustering. A more relevant question is “how well do the clusters approximate the cell types or states of interest?” Unfortunately, this is difficult to answer given the context-dependent interpretation of the underlying biology. Some analysts will be satisfied with resolution of the major cell types; other analysts may want resolution of subtypes; and others still may require resolution of different states (e.g., metabolic activity, stress) within those subtypes. Moreover, two clusterings can be highly inconsistent yet both valid, simply partitioning the cells based on different aspects of biology. Indeed, asking for an unqualified “best” clustering is akin to asking for the best magnification on a microscope without any context. Regardless of the exact method used, clustering is a critical step for extracting biological insights from scRNA-seq data. Here, we demonstrate the application of several commonly used methods with the 10X PBMC dataset. View set-up code (Workflow Chapter 3) #--- loading ---# library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.pbmc &lt;- read10xCounts(fname, col.names=TRUE) #--- gene-annotation ---# library(scater) rownames(sce.pbmc) &lt;- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol) library(EnsDb.Hsapiens.v86) location &lt;- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, column=&quot;SEQNAME&quot;, keytype=&quot;GENEID&quot;) #--- cell-detection ---# set.seed(100) e.out &lt;- emptyDrops(counts(sce.pbmc)) sce.pbmc &lt;- sce.pbmc[,which(e.out$FDR &lt;= 0.001)] #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location==&quot;MT&quot;))) high.mito &lt;- isOutlier(stats$subsets_Mito_percent, type=&quot;higher&quot;) sce.pbmc &lt;- sce.pbmc[,!high.mito] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.pbmc) sce.pbmc &lt;- computeSumFactors(sce.pbmc, cluster=clusters) sce.pbmc &lt;- logNormCounts(sce.pbmc) #--- variance-modelling ---# set.seed(1001) dec.pbmc &lt;- modelGeneVarByPoisson(sce.pbmc) top.pbmc &lt;- getTopHVGs(dec.pbmc, prop=0.1) #--- dimensionality-reduction ---# set.seed(10000) sce.pbmc &lt;- denoisePCA(sce.pbmc, subset.row=top.pbmc, technical=dec.pbmc) set.seed(100000) sce.pbmc &lt;- runTSNE(sce.pbmc, dimred=&quot;PCA&quot;) set.seed(1000000) sce.pbmc &lt;- runUMAP(sce.pbmc, dimred=&quot;PCA&quot;) sce.pbmc ## class: SingleCellExperiment ## dim: 33694 3985 ## metadata(1): Samples ## assays(2): counts logcounts ## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B ## rowData names(2): ID Symbol ## colnames(3985): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ... ## TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1 ## colData names(3): Sample Barcode sizeFactor ## reducedDimNames(3): PCA TSNE UMAP ## mainExpName: NULL ## altExpNames(0): 5.2 Graph-based clustering 5.2.1 Background Popularized by its use in Seurat, graph-based clustering is a flexible and scalable technique for clustering large scRNA-seq datasets. We first build a graph where each node is a cell that is connected to its nearest neighbors in the high-dimensional space. Edges are weighted based on the similarity between the cells involved, with higher weight given to cells that are more closely related. We then apply algorithms to identify “communities” of cells that are more connected to cells in the same community than they are to cells of different communities. Each community represents a cluster that we can use for downstream interpretation. The major advantage of graph-based clustering lies in its scalability. It only requires a \\(k\\)-nearest neighbor search that can be done in log-linear time on average, in contrast to hierachical clustering methods with runtimes that are quadratic with respect to the number of cells. Graph construction avoids making strong assumptions about the shape of the clusters or the distribution of cells within each cluster, compared to other methods like \\(k\\)-means (that favor spherical clusters) or Gaussian mixture models (that require normality). From a practical perspective, each cell is forcibly connected to a minimum number of neighboring cells, which reduces the risk of generating many uninformative clusters consisting of one or two outlier cells. The main drawback of graph-based methods is that, after graph construction, no information is retained about relationships beyond the neighboring cells1. This has some practical consequences in datasets that exhibit differences in cell density, as more steps through the graph are required to move the same distance through a region of higher cell density. From the perspective of community detection algorithms, this effect “inflates” the high-density regions such that any internal substructure or noise is more likely to cause formation of subclusters. The resolution of clustering thus becomes dependent on the density of cells, which can occasionally be misleading if it overstates the heterogeneity in the data. 5.2.2 Implementation To demonstrate, we use the clusterCells() function in scran on PBMC dataset. All calculations are performed using the top PCs to take advantage of data compression and denoising. This function returns a vector containing cluster assignments for each cell in our SingleCellExperiment object. library(scran) nn.clusters &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;) table(nn.clusters) ## nn.clusters ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 205 508 541 56 374 125 46 432 302 867 47 155 166 61 84 16 We assign the cluster assignments back into our SingleCellExperiment object as a factor in the column metadata. This allows us to conveniently visualize the distribution of clusters in a \\(t\\)-SNE plot (Figure 5.1). library(scater) colLabels(sce.pbmc) &lt;- nn.clusters plotReducedDim(sce.pbmc, &quot;TSNE&quot;, colour_by=&quot;label&quot;) Figure 5.1: \\(t\\)-SNE plot of the 10X PBMC dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from graph-based clustering. By default, clusterCells() uses the 10 nearest neighbors of each cell to construct a shared nearest neighbor graph. Two cells are connected by an edge if any of their nearest neighbors are shared, with the edge weight defined from the highest average rank of the shared neighbors (Xu and Su 2015). The Walktrap method from the igraph package is then used to identify communities. If we wanted to explicitly specify all of these parameters, we would use the more verbose call below. This uses a SNNGraphParam object from the bluster package to instruct clusterCells() to detect communities from a shared nearest-neighbor graph with the specified parameters. The appeal of this interface is that it allows us to easily switch to a different clustering algorithm by simply changing the BLUSPARAM= argument, as we will demonstrate later in the chapter. library(bluster) nn.clusters2 &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, BLUSPARAM=SNNGraphParam(k=10, type=&quot;rank&quot;, cluster.fun=&quot;walktrap&quot;)) table(nn.clusters2) ## nn.clusters2 ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 205 508 541 56 374 125 46 432 302 867 47 155 166 61 84 16 We can also obtain the graph itself by specifying full=TRUE in the clusterCells() call. Doing so will return all intermediate structures that are used during clustering, including a graph object from the igraph package. This graph can be visualized using a force-directed layout (Figure 5.2), closely related to \\(t\\)-SNE and UMAP, though which of these is the most aesthetically pleasing is left to the eye of the beholder. nn.clust.info &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, full=TRUE) nn.clust.info$objects$graph ## IGRAPH 3006a09 U-W- 3985 133130 -- ## + attr: weight (e/n) ## + edges from 3006a09: ## [1] 4--12 8--13 11--15 12--17 4--17 12--18 17--18 8--20 1--21 8--23 ## [11] 2--25 17--27 18--27 11--30 15--30 10--31 25--33 2--33 19--37 9--38 ## [21] 23--41 4--43 12--43 17--43 21--45 2--47 33--47 22--48 18--49 22--51 ## [31] 48--51 16--52 11--54 15--54 30--54 26--57 21--58 47--58 52--59 13--59 ## [41] 23--60 9--60 41--60 52--60 22--61 51--61 48--61 55--63 22--64 61--64 ## [51] 48--64 51--64 1--65 42--67 16--68 55--69 50--70 59--70 40--71 55--72 ## [61] 63--73 39--75 69--75 30--76 11--76 15--76 8--78 41--78 20--79 8--79 ## [71] 6--82 76--82 15--82 11--85 30--85 76--85 15--85 54--85 14--87 66--87 ## + ... omitted several edges set.seed(11000) reducedDim(sce.pbmc, &quot;force&quot;) &lt;- igraph::layout_with_fr(nn.clust.info$objects$graph) plotReducedDim(sce.pbmc, colour_by=&quot;label&quot;, dimred=&quot;force&quot;) Figure 5.2: Force-directed layout for the shared nearest-neighbor graph of the PBMC dataset. Each point represents a cell and is coloured according to its assigned cluster identity. In addition, the graph can be used to generate detailed diagnostics on the behavior of the graph-based clustering (link(\"graph-diagnostics\", \"OSCA.advanced\")). 5.2.3 Adjusting the parameters A graph-based clustering method has several key parameters: How many neighbors are considered when constructing the graph. What scheme is used to weight the edges. Which community detection algorithm is used to define the clusters. One of the most important parameters is k, the number of nearest neighbors used to construct the graph. This controls the resolution of the clustering where higher k yields a more inter-connected graph and broader clusters. Users can exploit this by experimenting with different values of k to obtain a satisfactory resolution. # More resolved. clust.5 &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, BLUSPARAM=NNGraphParam(k=5)) table(clust.5) ## clust.5 ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ## 523 302 125 45 172 573 249 439 293 95 772 142 38 18 62 38 30 16 15 9 ## 21 22 ## 16 13 # Less resolved. clust.50 &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, BLUSPARAM=NNGraphParam(k=50)) table(clust.50) ## clust.50 ## 1 2 3 4 5 6 7 8 9 10 ## 869 514 194 478 539 944 138 175 89 45 Further tweaking can be performed by changing the edge weighting scheme during graph construction. Setting type=\"number\" will weight edges based on the number of nearest neighbors that are shared between two cells. Similarly, type=\"jaccard\" will weight edges according to the Jaccard index of the two sets of neighbors. We can also disable weighting altogether by using a simple \\(k\\)-nearest neighbor graph, which is occasionally useful for downstream graph operations that do not support weights. clust.num &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, BLUSPARAM=NNGraphParam(type=&quot;number&quot;)) table(clust.num) ## clust.num ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ## 199 541 354 294 458 123 47 45 170 397 838 150 40 80 136 50 31 16 16 clust.jaccard &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, BLUSPARAM=NNGraphParam(type=&quot;jaccard&quot;)) table(clust.jaccard) ## clust.jaccard ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ## 201 541 740 233 352 47 124 841 45 361 40 154 80 61 133 16 16 clust.none &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, BLUSPARAM=KNNGraphParam()) table(clust.none) ## clust.none ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 541 446 54 170 699 45 126 172 128 907 286 47 158 137 54 15 The community detection can be performed by using any of the algorithms provided by igraph. We have already mentioned the Walktrap approach, but many others are available to choose from: clust.walktrap &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, BLUSPARAM=NNGraphParam(cluster.fun=&quot;walktrap&quot;)) clust.louvain &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, BLUSPARAM=NNGraphParam(cluster.fun=&quot;louvain&quot;)) clust.infomap &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, BLUSPARAM=NNGraphParam(cluster.fun=&quot;infomap&quot;)) clust.fast &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, BLUSPARAM=NNGraphParam(cluster.fun=&quot;fast_greedy&quot;)) clust.labprop &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, BLUSPARAM=NNGraphParam(cluster.fun=&quot;label_prop&quot;)) clust.eigen &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, BLUSPARAM=NNGraphParam(cluster.fun=&quot;leading_eigen&quot;)) It is straightforward to compare two clustering strategies to see how they differ (link(\"comparing-different-clusterings\", \"OSCA.advanced\")). For example, Figure 5.3 suggests that Infomap yields finer clusters than Walktrap while fast-greedy yields coarser clusters. library(pheatmap) # Using a large pseudo-count for a smoother color transition # between 0 and 1 cell in each &#39;tab&#39;. tab &lt;- table(paste(&quot;Infomap&quot;, clust.infomap), paste(&quot;Walktrap&quot;, clust.walktrap)) ivw &lt;- pheatmap(log10(tab+10), main=&quot;Infomap vs Walktrap&quot;, color=viridis::viridis(100), silent=TRUE) tab &lt;- table(paste(&quot;Fast&quot;, clust.fast), paste(&quot;Walktrap&quot;, clust.walktrap)) fvw &lt;- pheatmap(log10(tab+10), main=&quot;Fast-greedy vs Walktrap&quot;, color=viridis::viridis(100), silent=TRUE) gridExtra::grid.arrange(ivw[[4]], fvw[[4]]) Figure 5.3: Number of cells assigned to combinations of cluster labels with different community detection algorithms in the PBMC dataset. Each entry of each heatmap represents a pair of labels, coloured proportionally to the log-number of cells with those labels. Pipelines involving scran default to rank-based weights followed by Walktrap clustering. In contrast, Seurat uses Jaccard-based weights followed by Louvain clustering. Both of these strategies work well, and it is likely that the same could be said for many other combinations of weighting schemes and community detection algorithms. 5.3 Vector quantization with \\(k\\)-means 5.3.1 Background Vector quantization partitions observations into groups where each group is associated with a representative point, i.e., vector in the coordinate space. This is a type of clustering that primarily aims to compress data by replacing many points with a single representative. The representatives can then be treated as “samples” for further analysis, reducing the number of samples and computational work in later steps like, e.g., trajectory reconstruction (Ji and Ji 2016). This approach will also eliminate differences in cell density across the expression space, ensuring that the most abundant cell type does not dominate downstream results. \\(k\\)-means clustering is a classic vector quantization technique that divides cells into \\(k\\) clusters. Each cell is assigned to the cluster with the closest centroid, which is done by minimizing the within-cluster sum of squares using a random starting configuration for the \\(k\\) centroids. We usually set \\(k\\) to a large value such as the square root of the number of cells to obtain fine-grained clusters. These are not meant to be interpreted directly, but rather, the centroids are used in downstream steps for faster computation. The main advantage of this approach lies in its speed, given the simplicity and ease of implementation of the algorithm. 5.3.2 Implementation We supply a KmeansParam object in clusterCells() to perform \\(k\\)-means clustering with the specified number of clusters in centers=. We again use our top PCs after setting the random seed to ensure that the results are reproducible. In general, the \\(k\\)-means clusters correspond to the visual clusters on the \\(t\\)-SNE plot in Figure 5.4, though there are some divergences that are not observed in, say, Figure 5.1. (This is at least partially due to the fact that \\(t\\)-SNE is itself graph-based and so will naturally agree more with a graph-based clustering strategy.) set.seed(100) clust.kmeans &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, BLUSPARAM=KmeansParam(centers=10)) table(clust.kmeans) ## clust.kmeans ## 1 2 3 4 5 6 7 8 9 10 ## 548 46 408 270 539 199 148 783 163 881 colLabels(sce.pbmc) &lt;- clust.kmeans plotReducedDim(sce.pbmc, &quot;TSNE&quot;, colour_by=&quot;label&quot;) Figure 5.4: \\(t\\)-SNE plot of the 10X PBMC dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from \\(k\\)-means clustering. If we were so inclined, we could obtain a “reasonable” choice of \\(k\\) by computing the gap statistic using methods from the cluster package. A more practical use of \\(k\\)-means is to deliberately set \\(k\\) to a large value to achieve overclustering. This will forcibly partition cells inside broad clusters that do not have well-defined internal structure. For example, we might be interested in the change in expression from one “side” of a cluster to the other, but the lack of any clear separation within the cluster makes it difficult to separate with graph-based methods, even at the highest resolution. \\(k\\)-means has no such problems and will readily split these broad clusters for greater resolution. set.seed(100) clust.kmeans2 &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, BLUSPARAM=KmeansParam(centers=20)) table(clust.kmeans2) ## clust.kmeans2 ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ## 243 28 202 361 282 166 388 150 114 537 170 96 46 131 162 118 201 257 288 45 colLabels(sce.pbmc) &lt;- clust.kmeans2 plotTSNE(sce.pbmc, colour_by=&quot;label&quot;, text_by=&quot;label&quot;) Figure 5.5: \\(t\\)-SNE plot of the 10X PBMC dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from \\(k\\)-means clustering with \\(k=20\\). For larger datasets, we can use a variant of this approach named mini-batch \\(k\\)-means from the mbkmeans package. At each iteration, we only update the cluster assignments and centroid positions for a small subset of the observations. This reduces memory usage and computational time - especially when not all of the observations are informative for convergence - and supports parallelization via BiocParallel. Using this variant is as simple as switching to a MbkmeansParam() object in our clusterCells() call: set.seed(100) clust.mbkmeans &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, BLUSPARAM=MbkmeansParam(centers=10)) table(clust.mbkmeans) ## clust.mbkmeans ## 1 2 3 4 5 6 7 8 9 10 ## 210 158 441 138 1385 709 540 243 46 115 5.3.3 In two-step procedures By itself, \\(k\\)-means suffers from several shortcomings that reduce its appeal for obtaining interpretable clusters: It implicitly favors spherical clusters of equal radius. This can lead to unintuitive partitionings on real datasets that contain groupings with irregular sizes and shapes. The number of clusters \\(k\\) must be specified beforehand and represents a hard cap on the resolution of the clustering.. For example, setting \\(k\\) to be below the number of cell types will always lead to co-clustering of two cell types, regardless of how well separated they are. In contrast, other methods like graph-based clustering will respect strong separation even if the relevant resolution parameter is set to a low value. It is dependent on the randomly chosen initial coordinates. This requires multiple runs to verify that the clustering is stable. However, these concerns are less relevant when \\(k\\)-means is being used for vector quantization. In this application, \\(k\\)-means is used as a prelude to more sophisticated and interpretable - but computationally expensive - clustering algorithms. The clusterCells() function supports a “two-step” mode where \\(k\\)-means is initially used to obtain representative centroids that are subjected to graph-based clustering. Each cell is then placed in the same graph-based cluster that its \\(k\\)-means centroid was assigned to (Figure 5.6). # Setting the seed due to the randomness of k-means. set.seed(0101010) kgraph.clusters &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, BLUSPARAM=TwoStepParam( first=KmeansParam(centers=1000), second=NNGraphParam(k=5) ) ) table(kgraph.clusters) ## kgraph.clusters ## 1 2 3 4 5 6 7 8 9 10 11 12 ## 191 854 506 541 541 892 46 120 29 132 47 86 plotTSNE(sce.pbmc, colour_by=I(kgraph.clusters)) Figure 5.6: \\(t\\)-SNE plot of the PBMC dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from combined \\(k\\)-means/graph-based clustering. The obvious benefit of this approach over direct graph-based clustering is the speed improvement. We avoid the need to identifying nearest neighbors for each cell and the construction of a large intermediate graph, while benefiting from the relative interpretability of graph-based clusters compared to those from \\(k\\)-means. This approach also mitigates the “inflation” effect discussed in Section 5.2. Each centroid serves as a representative of a region of space that is roughly similar in volume, ameliorating differences in cell density that can cause (potentially undesirable) differences in resolution. The choice of the number of \\(k\\)-means clusters determines the trade-off between speed and fidelity. Larger values provide a more faithful representation of the underlying distribution of cells, at the cost of requiring more computational work by the second-step clustering procedure. Note that the second step operates on the centroids, so increasing clusters= may have further implications if the second-stage procedure is sensitive to the total number of input observations. For example, increasing the number of centroids would require an concomitant increase in k= (the number of neighbors in graph construction) to maintain the same level of resolution in the final output. 5.4 Hierarchical clustering 5.4.1 Background Hierarchical clustering is an old technique that arranges samples into a hierarchy based on their relative similarity to each other. Most implementations do so by joining the most similar samples into a new cluster, then joining similar clusters into larger clusters, and so on, until all samples belong to a single cluster. This process yields obtain a dendrogram that defines clusters with progressively increasing granularity. Variants of hierarchical clustering methods primarily differ in how they choose to perform the agglomerations. For example, complete linkage aims to merge clusters with the smallest maximum distance between their elements, while Ward’s method aims to minimize the increase in within-cluster variance. In the context of scRNA-seq, the main advantage of hierarchical clustering lies in the production of the dendrogram. This is a rich summary that quantitatively captures the relationships between subpopulations at various resolutions. Cutting the dendrogram at high resolution is also guaranteed to yield clusters that are nested within those obtained at a low-resolution cut; this can be helpful for interpretation, as discussed in Section 5.5. The dendrogram is also a natural representation of the data in situations where cells have descended from a relatively recent common ancestor. In practice, hierarchical clustering is too slow to be used for anything but the smallest scRNA-seq datasets. Most implementations require a cell-cell distance matrix that is prohibitively expensive to compute for a large number of cells. Greedy agglomeration is also likely to result in a quantitatively suboptimal partitioning (as defined by the agglomeration measure) at higher levels of the dendrogram when the number of cells and merge steps is high. Nonetheless, we will still demonstrate the application of hierarchical clustering here as it can be useful when combined with vector quantization techniques like \\(k\\)-means. 5.4.2 Implementation The PBMC dataset is too large to use directly in hierarchical clustering, requiring a two-step approach to compress the observations instead (Section 5.3.3). For the sake of simplicity, we will demonstrate on the smaller 416B dataset instead. View set-up code (Workflow Chapter 1) #--- loading ---# library(scRNAseq) sce.416b &lt;- LunSpikeInData(which=&quot;416b&quot;) sce.416b$block &lt;- factor(sce.416b$block) #--- gene-annotation ---# library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] rowData(sce.416b)$ENSEMBL &lt;- rownames(sce.416b) rowData(sce.416b)$SYMBOL &lt;- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype=&quot;GENEID&quot;, column=&quot;SYMBOL&quot;) rowData(sce.416b)$SEQNAME &lt;- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype=&quot;GENEID&quot;, column=&quot;SEQNAME&quot;) library(scater) rownames(sce.416b) &lt;- uniquifyFeatureNames(rowData(sce.416b)$ENSEMBL, rowData(sce.416b)$SYMBOL) #--- quality-control ---# mito &lt;- which(rowData(sce.416b)$SEQNAME==&quot;MT&quot;) stats &lt;- perCellQCMetrics(sce.416b, subsets=list(Mt=mito)) qc &lt;- quickPerCellQC(stats, percent_subsets=c(&quot;subsets_Mt_percent&quot;, &quot;altexps_ERCC_percent&quot;), batch=sce.416b$block) sce.416b &lt;- sce.416b[,!qc$discard] #--- normalization ---# library(scran) sce.416b &lt;- computeSumFactors(sce.416b) sce.416b &lt;- logNormCounts(sce.416b) #--- variance-modelling ---# dec.416b &lt;- modelGeneVarWithSpikes(sce.416b, &quot;ERCC&quot;, block=sce.416b$block) chosen.hvgs &lt;- getTopHVGs(dec.416b, prop=0.1) #--- batch-correction ---# library(limma) assay(sce.416b, &quot;corrected&quot;) &lt;- removeBatchEffect(logcounts(sce.416b), design=model.matrix(~sce.416b$phenotype), batch=sce.416b$block) #--- dimensionality-reduction ---# sce.416b &lt;- runPCA(sce.416b, ncomponents=10, subset_row=chosen.hvgs, exprs_values=&quot;corrected&quot;, BSPARAM=BiocSingular::ExactParam()) set.seed(1010) sce.416b &lt;- runTSNE(sce.416b, dimred=&quot;PCA&quot;, perplexity=10) sce.416b ## class: SingleCellExperiment ## dim: 46604 185 ## metadata(0): ## assays(3): counts logcounts corrected ## rownames(46604): 4933401J01Rik Gm26206 ... CAAA01147332.1 ## CBFB-MYH11-mcherry ## rowData names(4): Length ENSEMBL SYMBOL SEQNAME ## colnames(185): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1 ## SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ... ## SLX-11312.N712_S507.H5H5YBBXX.s_8.r_1 ## SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1 ## colData names(10): Source Name cell line ... block sizeFactor ## reducedDimNames(2): PCA TSNE ## mainExpName: endogenous ## altExpNames(2): ERCC SIRV We use a HclustParam object to instruct clusterCells() to perform hierarchical clustering on the top PCs. Specifically, it computes a cell-cell distance matrix using the top PCs and then applies Ward’s minimum variance method to obtain a dendrogram. When visualized in Figure 5.7, we see a clear split in the population caused by oncogene induction. While both Ward’s method and the default complete linkage yield compact clusters, we prefer the former it is less affected by differences in variance between clusters. hclust.416b &lt;- clusterCells(sce.416b, use.dimred=&quot;PCA&quot;, BLUSPARAM=HclustParam(method=&quot;ward.D2&quot;), full=TRUE) tree.416b &lt;- hclust.416b$objects$hclust # Making a prettier dendrogram. library(dendextend) tree.416b$labels &lt;- seq_along(tree.416b$labels) dend &lt;- as.dendrogram(tree.416b, hang=0.1) combined.fac &lt;- paste0(sce.416b$block, &quot;.&quot;, sub(&quot; .*&quot;, &quot;&quot;, sce.416b$phenotype)) labels_colors(dend) &lt;- c( &quot;20160113.wild&quot;=&quot;blue&quot;, &quot;20160113.induced&quot;=&quot;red&quot;, &quot;20160325.wild&quot;=&quot;dodgerblue&quot;, &quot;20160325.induced&quot;=&quot;salmon&quot; )[combined.fac][order.dendrogram(dend)] plot(dend) Figure 5.7: Hierarchy of cells in the 416B data set after hierarchical clustering, where each leaf node is a cell that is coloured according to its oncogene induction status (red is induced, blue is control) and plate of origin (light or dark). To obtain explicit clusters, we “cut” the tree by removing internal branches such that every subtree represents a distinct cluster. This is most simply done by removing internal branches above a certain height of the tree, as performed by the cutree() function. A more sophisticated variant of this approach is implemented in the dynamicTreeCut package, which uses the shape of the branches to obtain a better partitioning for complex dendrograms (Figure 5.8). We enable this option by setting cut.dynamic=TRUE, with additional tweaking of the deepSplit= parameter to control the resolution of the resulting clusters. hclust.dyn &lt;- clusterCells(sce.416b, use.dimred=&quot;PCA&quot;, BLUSPARAM=HclustParam(method=&quot;ward.D2&quot;, cut.dynamic=TRUE, cut.params=list(minClusterSize=10, deepSplit=1))) table(hclust.dyn) ## hclust.dyn ## 1 2 3 4 ## 78 69 24 14 labels_colors(dend) &lt;- as.integer(hclust.dyn)[order.dendrogram(dend)] plot(dend) Figure 5.8: Hierarchy of cells in the 416B data set after hierarchical clustering, where each leaf node is a cell that is coloured according to its assigned cluster identity from a dynamic tree cut. This generally corresponds well to the grouping of cells on a \\(t\\)-SNE plot (Figure 5.9). Cluster 2 is split across two visual clusters in the plot but we attribute this to a distortion introduced by \\(t\\)-SNE, given that this cluster actually has the highest average silhouette width (link(\"silhouette-width\", \"OSCA.advanced\")). colLabels(sce.416b) &lt;- factor(hclust.dyn) plotReducedDim(sce.416b, &quot;TSNE&quot;, colour_by=&quot;label&quot;) Figure 5.9: \\(t\\)-SNE plot of the 416B dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from hierarchical clustering. 5.4.3 In two-step procedures, again Returning to our PBMC example, we can use a two-step approach to perform hierarchical clustering on the representative centroids (Figure 5.10). This avoids the construction of a distance matrix across all cells for faster computation. # Setting the seed due to the randomness of k-means. set.seed(1111) khclust.info &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, BLUSPARAM=TwoStepParam( first=KmeansParam(centers=1000), second=HclustParam(method=&quot;ward.D2&quot;, cut.dynamic=TRUE, cut.param=list(deepSplit=3)) # for higher resolution. ), full=TRUE ) table(khclust.info$clusters) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ## 649 374 347 328 281 243 218 223 161 157 172 133 135 167 93 117 72 115 plotTSNE(sce.pbmc, colour_by=I(khclust.info$clusters), text_by=I(khclust.info$clusters)) Figure 5.10: \\(t\\)-SNE plot of the PBMC dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from combined \\(k\\)-means/hierarchical clustering. With a little bit of work, we can also examine the dendrogram constructed on the centroids (Figure 5.11). This provides a more quantitative visualization of the relative similarities between the different subpopulations. k.stats &lt;- khclust.info$objects$first tree.pbmc &lt;- khclust.info$objects$second$hclust m &lt;- match(as.integer(tree.pbmc$labels), k.stats$cluster) final.clusters &lt;- khclust.info$clusters[m] # TODO: expose scater color palette for easier re-use, # given that the default colors start getting recycled. dend &lt;- as.dendrogram(tree.pbmc, hang=0.1) labels_colors(dend) &lt;- as.integer(final.clusters)[order.dendrogram(dend)] plot(dend) Figure 5.11: Dendrogram of the \\(k\\)-mean centroids after hierarchical clustering in the PBMC dataset. Each leaf node represents a representative cluster of cells generated by \\(k\\)-mean clustering. As an aside, the same approach can be used to speed up any clustering method based on a distance matrix. For example, we could subject our \\(k\\)-means centroids to clustering by affinity propagation (Frey and Dueck 2007). In this procedure, each sample (i.e., centroid) chooses itself or another sample as its “exemplar”, with the suitability of the choice dependent on the distance between the samples, other potential exemplars for each sample, and the other samples with the same chosen exemplar. Iterative updates of these choices yields a set of clusters where each cluster is defined from the samples assigned to the same exemplar (Figure 5.12) Unlike hierarchical clustering, this does not provide a dendrogram, but it also avoids the extra complication of a tree cut - resolution is primarily controlled via the q= parameter, which defines the strength with which a sample considers itself as an exemplar and thus forms its own cluster. # Setting the seed due to the randomness of k-means. set.seed(1111) kaclust.info &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;, BLUSPARAM=TwoStepParam( first=KmeansParam(centers=1000), second=AffinityParam(q=0.1) # larger q =&gt; more clusters ), full=TRUE ) table(kaclust.info$clusters) ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ## 847 144 392 93 405 264 170 209 379 187 24 20 45 121 536 27 64 58 plotTSNE(sce.pbmc, colour_by=I(kaclust.info$clusters), text_by=I(kaclust.info$clusters)) Figure 5.12: \\(t\\)-SNE plot of the PBMC dataset, where each point represents a cell and is coloured according to the identity of the assigned cluster from combined \\(k\\)-means/affinity propagation clustering. 5.5 Subclustering Another simple approach to improving resolution is to repeat the feature selection and clustering within a single cluster. This aims to select HVGs and PCs that are more relevant to internal structure, improving resolution by avoiding noise from unnecessary features. Subsetting also encourages clustering methods to separate cells according to more modest heterogeneity in the absence of distinct subpopulations. We demonstrate with a cluster of putative memory T cells from the PBMC dataset, identified according to several markers (Figure 5.13). clust.full &lt;- clusterCells(sce.pbmc, use.dimred=&quot;PCA&quot;) plotExpression(sce.pbmc, features=c(&quot;CD3E&quot;, &quot;CCR7&quot;, &quot;CD69&quot;, &quot;CD44&quot;), x=I(clust.full), colour_by=I(clust.full)) Figure 5.13: Distribution of log-normalized expression values for several T cell markers within each cluster in the 10X PBMC dataset. Each cluster is color-coded for convenience. # Repeating modelling and PCA on the subset. memory &lt;- 10L sce.memory &lt;- sce.pbmc[,clust.full==memory] dec.memory &lt;- modelGeneVar(sce.memory) sce.memory &lt;- denoisePCA(sce.memory, technical=dec.memory, subset.row=getTopHVGs(dec.memory, n=5000)) We apply graph-based clustering within this memory subset to obtain CD4+ and CD8+ subclusters (Figure 5.14). Admittedly, the expression of CD4 is so low that the change is rather modest, but the interpretation is clear enough. g.memory &lt;- buildSNNGraph(sce.memory, use.dimred=&quot;PCA&quot;) clust.memory &lt;- igraph::cluster_walktrap(g.memory)$membership plotExpression(sce.memory, features=c(&quot;CD8A&quot;, &quot;CD4&quot;), x=I(factor(clust.memory))) Figure 5.14: Distribution of CD4 and CD8A log-normalized expression values within each cluster in the memory T cell subset of the 10X PBMC dataset. For subclustering analyses, it is helpful to define a customized function that calls our desired algorithms to obtain a clustering from a given SingleCellExperiment. This function can then be applied multiple times on different subsets without having to repeatedly copy and modify the code for each subset. For example, quickSubCluster() loops over all subsets and executes this user-specified function to generate a list of SingleCellExperiment objects containing the subclustering results. (Of course, the downside is that this assumes that a similar analysis is appropriate for each subset. If different subsets require extensive reparametrization, copying the code may actually be more straightforward.) set.seed(1000010) subcluster.out &lt;- quickSubCluster(sce.pbmc, groups=clust.full, prepFUN=function(x) { # Preparing the subsetted SCE for clustering. dec &lt;- modelGeneVar(x) input &lt;- denoisePCA(x, technical=dec, subset.row=getTopHVGs(dec, prop=0.1), BSPARAM=BiocSingular::IrlbaParam()) }, clusterFUN=function(x) { # Performing the subclustering in the subset. g &lt;- buildSNNGraph(x, use.dimred=&quot;PCA&quot;, k=20) igraph::cluster_walktrap(g)$membership } ) # One SingleCellExperiment object per parent cluster: names(subcluster.out) ## [1] &quot;1&quot; &quot;2&quot; &quot;3&quot; &quot;4&quot; &quot;5&quot; &quot;6&quot; &quot;7&quot; &quot;8&quot; &quot;9&quot; &quot;10&quot; &quot;11&quot; &quot;12&quot; &quot;13&quot; &quot;14&quot; &quot;15&quot; ## [16] &quot;16&quot; # Looking at the subclustering for one example: table(subcluster.out[[1]]$subcluster) ## ## 1.1 1.2 1.3 1.4 1.5 1.6 ## 28 22 34 62 11 48 Subclustering is a general and conceptually straightforward procedure for increasing resolution. It can also simplify the interpretation of the subclusters, which only need to be considered in the context of the parent cluster’s identity - for example, we did not have to re-identify the cells in cluster 10 as T cells. However, this is a double-edged sword as it is difficult for practitioners to consider the uncertainty of identification for parent clusters when working with deep nesting. If cell types or states span cluster boundaries, conditioning on the putative cell type identity of the parent cluster can encourage the construction of a “house of cards” of cell type assignments, e.g., where a subcluster of one parent cluster is actually contamination from a cell type in a separate parent cluster. Session Info View session info R version 4.1.1 (2021-08-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] apcluster_1.4.8 dendextend_1.15.1 [3] pheatmap_1.0.12 bluster_1.4.0 [5] scater_1.22.0 ggplot2_3.3.5 [7] scran_1.22.0 scuttle_1.4.0 [9] SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0 [11] Biobase_2.54.0 GenomicRanges_1.46.0 [13] GenomeInfoDb_1.30.0 IRanges_2.28.0 [15] S4Vectors_0.32.0 BiocGenerics_0.40.0 [17] MatrixGenerics_1.6.0 matrixStats_0.61.0 [19] BiocStyle_2.22.0 rebook_1.4.0 loaded via a namespace (and not attached): [1] ggbeeswarm_0.6.0 colorspace_2.0-2 [3] ellipsis_0.3.2 dynamicTreeCut_1.63-1 [5] benchmarkme_1.0.7 XVector_0.34.0 [7] BiocNeighbors_1.12.0 farver_2.1.0 [9] ggrepel_0.9.1 fansi_0.5.0 [11] ClusterR_1.2.5 codetools_0.2-18 [13] sparseMatrixStats_1.6.0 doParallel_1.0.16 [15] knitr_1.36 jsonlite_1.7.2 [17] cluster_2.1.2 graph_1.72.0 [19] httr_1.4.2 BiocManager_1.30.16 [21] compiler_4.1.1 dqrng_0.3.0 [23] assertthat_0.2.1 Matrix_1.3-4 [25] fastmap_1.1.0 limma_3.50.0 [27] BiocSingular_1.10.0 htmltools_0.5.2 [29] tools_4.1.1 gmp_0.6-2 [31] rsvd_1.0.5 igraph_1.2.7 [33] gtable_0.3.0 glue_1.4.2 [35] GenomeInfoDbData_1.2.7 dplyr_1.0.7 [37] rappdirs_0.3.3 Rcpp_1.0.7 [39] jquerylib_0.1.4 vctrs_0.3.8 [41] iterators_1.0.13 DelayedMatrixStats_1.16.0 [43] xfun_0.27 stringr_1.4.0 [45] beachmat_2.10.0 lifecycle_1.0.1 [47] irlba_2.3.3 gtools_3.9.2 [49] statmod_1.4.36 XML_3.99-0.8 [51] edgeR_3.36.0 zlibbioc_1.40.0 [53] scales_1.1.1 parallel_4.1.1 [55] RColorBrewer_1.1-2 yaml_2.2.1 [57] gridExtra_2.3 sass_0.4.0 [59] stringi_1.7.5 highr_0.9 [61] foreach_1.5.1 ScaledMatrix_1.2.0 [63] filelock_1.0.2 BiocParallel_1.28.0 [65] benchmarkmeData_1.0.4 rlang_0.4.12 [67] pkgconfig_2.0.3 bitops_1.0-7 [69] evaluate_0.14 lattice_0.20-45 [71] purrr_0.3.4 CodeDepends_0.6.5 [73] labeling_0.4.2 cowplot_1.1.1 [75] tidyselect_1.1.1 magrittr_2.0.1 [77] bookdown_0.24 R6_2.5.1 [79] generics_0.1.1 metapod_1.2.0 [81] DelayedArray_0.20.0 DBI_1.1.1 [83] pillar_1.6.4 withr_2.4.2 [85] mbkmeans_1.10.0 RCurl_1.98-1.5 [87] tibble_3.1.5 dir.expiry_1.2.0 [89] crayon_1.4.1 utf8_1.2.2 [91] rmarkdown_2.11 viridis_0.6.2 [93] locfit_1.5-9.4 grid_4.1.1 [95] digest_0.6.28 munsell_0.5.0 [97] beeswarm_0.4.0 viridisLite_0.4.0 [99] vipor_0.4.5 bslib_0.3.1 References "],["marker-detection.html", "Chapter 6 Marker gene detection 6.1 Motivation 6.2 Scoring markers by pairwise comparisons 6.3 Effect sizes for pairwise comparisons 6.4 Summarizing pairwise effects 6.5 Obtaining the full effects 6.6 Using a log-fold change threshold 6.7 Handling blocking factors Session Info", " Chapter 6 Marker gene detection .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 6.1 Motivation To interpret our clustering results from Chapter 5, we identify the genes that drive separation between clusters. These marker genes allow us to assign biological meaning to each cluster based on their functional annotation. In the simplest case, we have a priori knowledge of the marker genes associated with particular cell types, allowing us to treat the clustering as a proxy for cell type identity. The same principle can be applied to discover more subtle differences between clusters (e.g., changes in activation or differentiation state) based on the behavior of genes in the affected pathways. The most straightforward approach to marker gene detection involves testing for differential expression between clusters. If a gene is strongly DE between clusters, it is likely to have driven the separation of cells in the clustering algorithm. Several methods are available to quantify the differences in expression profiles between clusters and obtain a single ranking of genes for each cluster. We will demonstrate some of these choices in this chapter using the 10X PBMC dataset: View set-up code (Workflow Chapter 3) #--- loading ---# library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.pbmc &lt;- read10xCounts(fname, col.names=TRUE) #--- gene-annotation ---# library(scater) rownames(sce.pbmc) &lt;- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol) library(EnsDb.Hsapiens.v86) location &lt;- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, column=&quot;SEQNAME&quot;, keytype=&quot;GENEID&quot;) #--- cell-detection ---# set.seed(100) e.out &lt;- emptyDrops(counts(sce.pbmc)) sce.pbmc &lt;- sce.pbmc[,which(e.out$FDR &lt;= 0.001)] #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location==&quot;MT&quot;))) high.mito &lt;- isOutlier(stats$subsets_Mito_percent, type=&quot;higher&quot;) sce.pbmc &lt;- sce.pbmc[,!high.mito] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.pbmc) sce.pbmc &lt;- computeSumFactors(sce.pbmc, cluster=clusters) sce.pbmc &lt;- logNormCounts(sce.pbmc) #--- variance-modelling ---# set.seed(1001) dec.pbmc &lt;- modelGeneVarByPoisson(sce.pbmc) top.pbmc &lt;- getTopHVGs(dec.pbmc, prop=0.1) #--- dimensionality-reduction ---# set.seed(10000) sce.pbmc &lt;- denoisePCA(sce.pbmc, subset.row=top.pbmc, technical=dec.pbmc) set.seed(100000) sce.pbmc &lt;- runTSNE(sce.pbmc, dimred=&quot;PCA&quot;) set.seed(1000000) sce.pbmc &lt;- runUMAP(sce.pbmc, dimred=&quot;PCA&quot;) #--- clustering ---# g &lt;- buildSNNGraph(sce.pbmc, k=10, use.dimred = &#39;PCA&#39;) clust &lt;- igraph::cluster_walktrap(g)$membership colLabels(sce.pbmc) &lt;- factor(clust) sce.pbmc ## class: SingleCellExperiment ## dim: 33694 3985 ## metadata(1): Samples ## assays(2): counts logcounts ## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B ## rowData names(2): ID Symbol ## colnames(3985): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ... ## TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1 ## colData names(4): Sample Barcode sizeFactor label ## reducedDimNames(3): PCA TSNE UMAP ## mainExpName: NULL ## altExpNames(0): 6.2 Scoring markers by pairwise comparisons Our general strategy is to compare each pair of clusters and compute scores quantifying the differences in the expression distributions between clusters. The scores for all pairwise comparisons involving a particular cluster are then consolidated into a single DataFrame for that cluster. The scoreMarkers() function from scran returns a list of DataFrames where each DataFrame corresponds to a cluster and each row of the DataFrame corresponds to a gene. In the DataFrame for cluster \\(X\\), the columns contain the self.average, the mean log-expression in \\(X\\); other.average, the grand mean across all other clusters; self.detected, the proportion of cells with detected expression in \\(X\\); other.detected, the mean detected proportion across all other clusters; and finally, a variety of effect size summaries generated from all pairwise comparisons involving \\(X\\). library(scran) marker.info &lt;- scoreMarkers(sce.pbmc, colLabels(sce.pbmc)) marker.info ## List of length 16 ## names(16): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 colnames(marker.info[[&quot;1&quot;]]) # statistics for cluster 1. ## [1] &quot;self.average&quot; &quot;other.average&quot; &quot;self.detected&quot; ## [4] &quot;other.detected&quot; &quot;mean.logFC.cohen&quot; &quot;min.logFC.cohen&quot; ## [7] &quot;median.logFC.cohen&quot; &quot;max.logFC.cohen&quot; &quot;rank.logFC.cohen&quot; ## [10] &quot;mean.AUC&quot; &quot;min.AUC&quot; &quot;median.AUC&quot; ## [13] &quot;max.AUC&quot; &quot;rank.AUC&quot; &quot;mean.logFC.detected&quot; ## [16] &quot;min.logFC.detected&quot; &quot;median.logFC.detected&quot; &quot;max.logFC.detected&quot; ## [19] &quot;rank.logFC.detected&quot; For each cluster, we can then rank candidate markers based on one of these effect size summaries. We demonstrate below with the mean AUC for cluster 1, which probably contains NK cells based on the top genes in Figure 6.1 (and no CD3E expression). The next section will go into more detail on the differences between the various columns. chosen &lt;- marker.info[[&quot;1&quot;]] ordered &lt;- chosen[order(chosen$mean.AUC, decreasing=TRUE),] head(ordered[,1:4]) # showing basic stats only, for brevity. ## DataFrame with 6 rows and 4 columns ## self.average other.average self.detected other.detected ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## NKG7 4.39503 0.7486412 0.985366 0.3768438 ## GNLY 4.40275 0.3489056 0.956098 0.2010344 ## CTSW 2.60281 0.4618797 0.960976 0.2898534 ## HOPX 1.99060 0.1607469 0.907317 0.1160658 ## PRF1 2.22297 0.1659853 0.887805 0.1263180 ## KLRF1 1.60598 0.0379703 0.858537 0.0346691 library(scater) plotExpression(sce.pbmc, features=head(rownames(ordered)), x=&quot;label&quot;, colour_by=&quot;label&quot;) Figure 6.1: Distribution of expression values across clusters for the top potential marker genes (as determined by the mean AUC) for cluster 1 in the PBMC dataset. We deliberately use pairwise comparisons rather than comparing each cluster to the average of all other cells. The latter approach is sensitive to the population composition, which introduces an element of unpredictability to the marker sets due to variation in cell type abundances. (In the worst case, the presence of one subpopulation containing a majority of the cells will drive the selection of top markers for every other cluster, pushing out useful genes that can distinguish between the smaller subpopulations.) Moreover, pairwise comparisons naturally provide more information to interpret of the utility of a marker, e.g., by providing log-fold changes to indicate which clusters are distinguished by each gene (Section 6.5). Previous editions of this chapter used \\(p\\)-values from the tests corresponding to each effect size, e.g., Welch’s \\(t\\)-test, the Wilcoxon ranked sum test. While this is fine for ranking genes, the \\(p\\)-values themselves are statistically flawed and are of little use for inference - see Advanced Section 6.4 for more details. The scoreMarkers() function simplifies the marker detection procedure by omitting the \\(p\\)-values altogether, instead focusing on the underlying effect sizes. 6.3 Effect sizes for pairwise comparisons In the context of marker detection, the area under the curve (AUC) quantifies our ability to distinguish between two distributions in a pairwise comparison. The AUC represents the probability that a randomly chosen observation from our cluster of interest is greater than a randomly chosen observation from the other cluster. A value of 1 corresponds to upregulation, where all values of our cluster of interest are greater than any value from the other cluster; a value of 0.5 means that there is no net difference in the location of the distributions; and a value of 0 corresponds to downregulation. The AUC is closely related to the \\(U\\) statistic in the Wilcoxon ranked sum test (a.k.a., Mann-Whitney U-test). auc.only &lt;- chosen[,grepl(&quot;AUC&quot;, colnames(chosen))] auc.only[order(auc.only$mean.AUC,decreasing=TRUE),] ## DataFrame with 33694 rows and 5 columns ## mean.AUC min.AUC median.AUC max.AUC rank.AUC ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;integer&gt; ## NKG7 0.963726 0.805885 0.988947 0.991803 1 ## GNLY 0.958868 0.877404 0.973347 0.974913 2 ## CTSW 0.932352 0.709079 0.966638 0.978873 2 ## HOPX 0.928138 0.780016 0.949129 0.953659 3 ## PRF1 0.924690 0.837420 0.940300 0.943902 5 ## ... ... ... ... ... ... ## RPS13 0.174413 0.00684690 0.0806435 0.695732 426 ## RPL26 0.170493 0.01334117 0.0724876 0.813720 104 ## RPL18A 0.169516 0.01090122 0.0760620 0.749735 264 ## RPL39 0.152346 0.00405525 0.0626341 0.774848 185 ## FTH1 0.147289 0.00121951 0.0639979 0.645732 766 Cohen’s \\(d\\) is a standardized log-fold change where the difference in the mean log-expression between groups is scaled by the average standard deviation across groups. In other words, it is the number of standard deviations that separate the means of the two groups. The interpretation is similar to the log-fold change; positive values indicate that the gene is upregulated in our cluster of interest, negative values indicate downregulation and values close to zero indicate that there is little difference. Cohen’s \\(d\\) is roughly analogous to the \\(t\\)-statistic in various two-sample \\(t\\)-tests. cohen.only &lt;- chosen[,grepl(&quot;logFC.cohen&quot;, colnames(chosen))] cohen.only[order(cohen.only$mean.logFC.cohen,decreasing=TRUE),] ## DataFrame with 33694 rows and 5 columns ## mean.logFC.cohen min.logFC.cohen median.logFC.cohen max.logFC.cohen ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## NKG7 4.84346 1.025337 5.84606 6.30987 ## GNLY 3.71574 1.793853 4.04410 4.36929 ## CTSW 2.96940 0.699433 3.19968 3.96973 ## GZMA 2.69683 0.399487 3.18392 3.44040 ## HOPX 2.67330 1.108548 2.92242 3.06690 ## ... ... ... ... ... ## FTH1 -2.28562 -5.19176 -2.533685 0.2995322 ## HLA-DRA -2.39933 -7.13493 -2.032812 0.0146072 ## FTL -2.40544 -5.82525 -1.285601 0.2569453 ## CST3 -2.56767 -7.92584 -0.950339 0.0336350 ## LYZ -2.84045 -9.00815 -0.341198 -0.1797338 ## rank.logFC.cohen ## &lt;integer&gt; ## NKG7 1 ## GNLY 1 ## CTSW 3 ## GZMA 2 ## HOPX 4 ## ... ... ## FTH1 4362 ## HLA-DRA 6202 ## FTL 5319 ## CST3 5608 ## LYZ 30966 Finally, we also compute the log-fold change in the proportion of cells with detected expression between clusters. This ignores any information about the magnitude of expression, only considering whether any expression is detected at all. Again, positive values indicate that a greater proportion of cells express the gene in our cluster of interest compared to the other cluster. Note that a pseudo-count is added to avoid undefined log-fold changes when no cells express the gene in either group. detect.only &lt;- chosen[,grepl(&quot;logFC.detected&quot;, colnames(chosen))] detect.only[order(detect.only$mean.logFC.detected,decreasing=TRUE),] ## DataFrame with 33694 rows and 5 columns ## mean.logFC.detected min.logFC.detected median.logFC.detected ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## KLRF1 4.80539 2.532196 5.33959 ## PRSS23 4.43836 2.354538 4.57967 ## XCL1 4.42946 1.398559 4.91134 ## XCL2 4.42099 1.304208 4.87151 ## SH2D1B 4.17329 0.804099 4.54737 ## ... ... ... ... ## MARCKS -3.14645 -6.96050 -2.00000 ## RAB31 -3.23328 -6.63557 -2.58496 ## SLC7A7 -3.28244 -6.95262 -2.64295 ## RAB32 -3.42074 -6.72160 -3.32193 ## NCF2 -3.76139 -7.00693 -3.38863 ## max.logFC.detected rank.logFC.detected ## &lt;numeric&gt; &lt;integer&gt; ## KLRF1 6.50482 1 ## PRSS23 5.98530 2 ## XCL1 6.16993 1 ## XCL2 6.21524 1 ## SH2D1B 5.85007 3 ## ... ... ... ## MARCKS 0 11805 ## RAB31 -1 31423 ## SLC7A7 0 11796 ## RAB32 0 11805 ## NCF2 0 11805 The AUC or Cohen’s \\(d\\) is usually the best choice for general purpose marker detection, as they are effective regardless of the magnitude of the expression values. The log-fold change in the detected proportion is specifically useful for identifying binary changes in expression. See Advanced Section 6.2 for more information about the practical differences between the effect sizes. 6.4 Summarizing pairwise effects In a dataset with \\(N\\) clusters, each cluster is associated with \\(N-1\\) values for each type of effect size described in the previous section. To simplify interpretation, we summarize the effects for each cluster into some key statistics such as the mean and median. Each summary statistic has a different interpretation when used for ranking: The most obvious summary statistic is the mean. For cluster \\(X\\), a large mean effect size (&gt;0 for the log-fold changes, &gt;0.5 for the AUCs) indicates that the gene is upregulated in \\(X\\) compared to the average of the other groups. Another summary statistic is the median, where a large value indicates that the gene is upregulated in \\(X\\) compared to most (&gt;50%) other clusters. The median provides greater robustness to outliers than the mean, which may or may not be desirable. On one hand, the median avoids an inflated effect size if only a minority of comparisons have large effects; on the other hand, it will also overstate the effect size by ignoring a minority of comparisons that have opposing effects. The minimum value (min.*) is the most stringent summary for identifying upregulated genes, as a large value indicates that the gene is upregulated in \\(X\\) compared to all other clusters. Conversely, if the minimum is small (&lt;0 for the log-fold changes, &lt;0.5 for the AUCs), we can conclude that the gene is downregulated in \\(X\\) compared to at least one other cluster. The maximum value (max.*) is the least stringent summary for identifying upregulated genes, as a large value can be obtained if there is strong upregulation in \\(X\\) compared to any other cluster. Conversely, if the maximum is small, we can conclude that the gene is downregulated in \\(X\\) compared to all other clusters. The minimum rank, a.k.a., “min-rank” (rank.*) is the smallest rank of each gene across all pairwise comparisons. Specifically, genes are ranked within each pairwise comparison based on decreasing effect size, and then the smallest rank across all comparisons is reported for each gene. If a gene has a small min-rank, we can conclude that it is one of the top upregulated genes in at least one comparison of \\(X\\) to another cluster. Each of these summaries is computed for each effect size, for each gene, and for each cluster. Our next step is to choose one of these summary statistics for one of the effect sizes and to use it to rank the rows of the DataFrame. The choice of summary determines the stringency of the marker selection strategy, i.e., how many other clusters must we differ from? For identifying upregulated genes, ranking by the minimum is the most stringent and the maximum is the least stringent; the mean and median fall somewhere in between and are reasonable defaults for most applications. The example below uses the median Cohen’s \\(d\\) to obtain a ranking of upregulated markers for cluster 4 (Figure 6.2), which probably contains monocytes. chosen &lt;- marker.info[[&quot;4&quot;]] # using another cluster, for some variety. ordered &lt;- chosen[order(chosen$median.logFC.cohen,decreasing=TRUE),] head(ordered[,1:4]) # showing basic stats only, for brevity. ## DataFrame with 6 rows and 4 columns ## self.average other.average self.detected other.detected ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## LYZ 4.97589 2.188256 1.000000 0.657832 ## S100A9 4.94461 2.059524 1.000000 0.705775 ## FTL 5.58358 4.140706 1.000000 0.961428 ## S100A8 4.61729 1.834030 1.000000 0.639031 ## CTSS 3.28383 1.568992 1.000000 0.630574 ## CSTA 1.76876 0.685007 0.982143 0.337978 plotExpression(sce.pbmc, features=head(rownames(ordered)), x=&quot;label&quot;, colour_by=&quot;label&quot;) Figure 6.2: Distribution of expression values across clusters for the top potential marker genes (as determined by the median Cohen’s \\(d\\)) for cluster 4 in the PBMC dataset. On some occasions, ranking by the minimum can be highly effective as it yields a concise set of highly cluster-specific markers. However, any gene that is expressed at the same level in two or more clusters will simply not be detected. This is likely to discard many interesting genes, especially if the clusters are finely resolved with weak separation. To give a concrete example, consider a mixed population of CD4+-only, CD8+-only, double-positive and double-negative T cells. Neither Cd4 or Cd8 would be detected as subpopulation-specific markers because each gene is expressed in two subpopulations such that the minimum effect would be small. In practice, the minimum and maximum are most helpful for diagnosing discrepancies between the mean and median, rather than being used directly for ranking. Ranking genes by the min-rank is similiar in stringency to ranking by the maximum effect size, in that both will respond to strong DE in a single comparison. However, the min-rank is more useful as it ensures that a single comparison to another cluster with consistently large effects does not dominate the ranking. If we select all genes with min-ranks less than or equal to \\(T\\), the resulting set is the union of the top \\(T\\) genes from all pairwise comparisons. This guarantees that our set contains at least \\(T\\) genes that can distinguish our cluster of interest from any other cluster, which permits a comprehensive determination of a cluster’s identity. We demonstrate below for cluster 4, taking the top \\(T=5\\) genes with the largest Cohen’s \\(d\\) from each comparison to display in Figure 6.3. ordered &lt;- chosen[order(chosen$rank.logFC.cohen),] top.ranked &lt;- ordered[ordered$rank.logFC.cohen &lt;= 5,] rownames(top.ranked) ## [1] &quot;S100A8&quot; &quot;S100A4&quot; &quot;RPS27&quot; &quot;MALAT1&quot; ## [5] &quot;LYZ&quot; &quot;TYROBP&quot; &quot;CTSS&quot; &quot;RPL31&quot; ## [9] &quot;EEF1B2&quot; &quot;LTB&quot; &quot;RPS21&quot; &quot;FTL&quot; ## [13] &quot;S100A9&quot; &quot;RPSA&quot; &quot;GPX1&quot; &quot;RPS29&quot; ## [17] &quot;RPLP1&quot; &quot;RPL17&quot; &quot;CST3&quot; &quot;S100A11&quot; ## [21] &quot;RPS27A&quot; &quot;RPS15A&quot; &quot;MT-ND2&quot; &quot;FCER1G&quot; ## [25] &quot;RPS12&quot; &quot;RPL36A&quot; &quot;RP11-1143G9.4&quot; &quot;IL32&quot; ## [29] &quot;RPL23A&quot; plotGroupedHeatmap(sce.pbmc, features=rownames(top.ranked), group=&quot;label&quot;, center=TRUE, zlim=c(-3, 3)) Figure 6.3: Heatmap of the centered average log-expression values for the top potential marker genes for cluster 4 in the PBMC dataset. The set of markers was selected as those genes with Cohen’s \\(d\\)-derived min-ranks less than or equal to 5. Our discussion above has focused mainly on potential markers that are upregulated in our cluster of interest, as these are the easiest to interpret and experimentally validate. However, it also means that any cluster defined by downregulation of a marker will not contain that gene among the top features. This is occasionally relevant for subtypes or other states that are defined by low expression of particular genes. In such cases, focusing on upregulation may yield a disappointing set of markers, and it may be worth examining some of the lowest-ranked genes to see if there is any consistent downregulation compared to other clusters. # Omitting the decreasing=TRUE to focus on negative effects. ordered &lt;- chosen[order(chosen$median.logFC.cohen),1:4] head(ordered) ## DataFrame with 6 rows and 4 columns ## self.average other.average self.detected other.detected ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## HLA-B 4.013806 4.438053 1.000000 0.971760 ## HLA-E 1.789158 2.093695 0.982143 0.871183 ## RPSA 2.685950 2.563913 1.000000 0.829648 ## HLA-C 3.116699 3.527693 1.000000 0.934941 ## BIN2 0.410471 0.667019 0.464286 0.456757 ## RAC2 0.823508 0.982662 0.714286 0.630389 6.5 Obtaining the full effects For more complex questions, we may need to interrogate effect sizes from specific comparisons of interest. To do so, we set full.stats=TRUE to obtain the effect sizes for all pairwise comparisons involving a particular cluster. This is returned in the form of a nested DataFrame for each effect size type - in the example below, full.AUC contains the AUCs for the comparisons between cluster 4 and every other cluster. marker.info &lt;- scoreMarkers(sce.pbmc, colLabels(sce.pbmc), full.stats=TRUE) chosen &lt;- marker.info[[&quot;4&quot;]] chosen$full.AUC ## DataFrame with 33694 rows and 15 columns ## 1 2 3 5 6 7 ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## RP11-34P13.3 0.5 0.500000 0.500000 0.500000 0.5 0.5 ## FAM138A 0.5 0.500000 0.500000 0.500000 0.5 0.5 ## OR4F5 0.5 0.500000 0.500000 0.500000 0.5 0.5 ## RP11-34P13.7 0.5 0.499016 0.499076 0.497326 0.5 0.5 ## RP11-34P13.8 0.5 0.500000 0.500000 0.500000 0.5 0.5 ## ... ... ... ... ... ... ... ## AC233755.2 0.500000 0.500000 0.500000 0.500000 0.500 0.5 ## AC233755.1 0.500000 0.500000 0.500000 0.500000 0.500 0.5 ## AC240274.1 0.492683 0.496063 0.494455 0.495989 0.496 0.5 ## AC213203.1 0.500000 0.500000 0.500000 0.500000 0.500 0.5 ## FAM231B 0.500000 0.500000 0.500000 0.500000 0.500 0.5 ## 8 9 10 11 12 13 ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## RP11-34P13.3 0.500000 0.500000 0.5 0.500000 0.5 0.5 ## FAM138A 0.500000 0.500000 0.5 0.500000 0.5 0.5 ## OR4F5 0.500000 0.500000 0.5 0.500000 0.5 0.5 ## RP11-34P13.7 0.498843 0.495033 0.5 0.489362 0.5 0.5 ## RP11-34P13.8 0.498843 0.498344 0.5 0.500000 0.5 0.5 ## ... ... ... ... ... ... ... ## AC233755.2 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 ## AC233755.1 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 ## AC240274.1 0.496528 0.496689 0.494233 0.489362 0.496774 0.496988 ## AC213203.1 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 ## FAM231B 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000 ## 14 15 16 ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## RP11-34P13.3 0.5 0.5 0.5 ## FAM138A 0.5 0.5 0.5 ## OR4F5 0.5 0.5 0.5 ## RP11-34P13.7 0.5 0.5 0.5 ## RP11-34P13.8 0.5 0.5 0.5 ## ... ... ... ... ## AC233755.2 0.5 0.500000 0.5 ## AC233755.1 0.5 0.500000 0.5 ## AC240274.1 0.5 0.482143 0.5 ## AC213203.1 0.5 0.500000 0.5 ## FAM231B 0.5 0.500000 0.5 Say we want to identify the genes that distinguish cluster 4 from other clusters with high LYZ expression. We subset full.AUC to the relevant comparisons and sort on our summary statistic of choice to obtain a ranking of markers within this subset. This allows us to easily characterize subtle differences between closely related clusters. To illustrate, we use the smallest rank from computeMinRank() to identify the top DE genes in cluster 4 compared to the other LYZ-high clusters (Figure 6.4). lyz.high &lt;- c(&quot;4&quot;, &quot;6&quot;, &quot;8&quot;, &quot;9&quot;, &quot;14&quot;) # based on inspection of the previous Figure. subset &lt;- chosen$full.AUC[,colnames(chosen$full.AUC) %in% lyz.high] to.show &lt;- subset[computeMinRank(subset) &lt;= 10,] to.show ## DataFrame with 27 rows and 4 columns ## 6 8 9 14 ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## CTSS 0.942286 0.195230 0.203288 0.208138 ## S100A12 0.885571 0.204034 0.605428 0.287178 ## S100A8 0.898143 0.156457 0.639368 0.112705 ## RPS27 0.906286 0.943576 0.905866 0.949063 ## RPS27A 0.810000 0.862021 0.832663 0.932084 ## ... ... ... ... ... ## FTL 0.941857 0.115369 0.118023 0.0954333 ## RPL13A 0.637143 0.858466 0.813683 0.9396956 ## RPL3 0.383143 0.819651 0.764250 0.8934426 ## MT-ND2 0.922143 0.576100 0.536128 0.7391686 ## MT-ND3 0.906857 0.432746 0.514487 0.5386417 plotGroupedHeatmap(sce.pbmc[,colLabels(sce.pbmc) %in% lyz.high], features=rownames(to.show), group=&quot;label&quot;, center=TRUE, zlim=c(-3, 3)) Figure 6.4: Heatmap of the centered average log-expression values for the top potential marker genes for cluster 4 relative to other LYZ-high clusters in the PBMC dataset. The set of markers was selected as those genes with AUC-derived min-ranks less than or equal to 10. Similarly, we can use the full set of effect sizes to define our own summary statistic if the precomputed measures are too coarse. For example, we may be interested in markers that are upregulated against some percentage - say, 80% - of other clusters. This improves the cluster specificity of the ranking by being more stringent than the median yet not as strigent as the minimum. We achieve this by computing and sorting on the 20th percentile of effect sizes, as shown below. stat &lt;- rowQuantiles(as.matrix(chosen$full.AUC), p=0.2) chosen[order(stat, decreasing=TRUE), 1:4] # just showing the basic stats for brevity. ## DataFrame with 33694 rows and 4 columns ## self.average other.average self.detected other.detected ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## S100A12 1.94300 0.609079 0.928571 0.263675 ## JUND 1.42774 0.619368 0.910714 0.442422 ## S100A9 4.94461 2.059524 1.000000 0.705775 ## VCAN 1.63794 0.436597 0.875000 0.222359 ## S100A8 4.61729 1.834030 1.000000 0.639031 ## ... ... ... ... ... ## RPL23A 3.96991 3.74428 1 0.905175 ## RPSA 2.68595 2.56391 1 0.829648 ## HLA-C 3.11670 3.52769 1 0.934941 ## RPS29 4.57983 4.27407 1 0.919797 ## HLA-B 4.01381 4.43805 1 0.971760 6.6 Using a log-fold change threshold The Cohen’s \\(d\\) and AUC calculations consider both the magnitude of the difference between clusters as well as the variability within each cluster. If the variability is lower, it is possible for a gene to have a large effect size even if the magnitude of the difference is small. These genes tend to be somewhat uninformative for cell type identification despite their strong differential expression (e.g., ribosomal protein genes). We would prefer genes with larger log-fold changes between clusters, even if they have higher variability. To favor the detection of such genes, we can compute the effect sizes relative to a log-fold change threshold by setting lfc= in scoreMarkers(). The definition of Cohen’s \\(d\\) is generalized to the standardized difference between the observed log-fold change and the specified lfc threshold. Similarly, the AUC is redefined as the probability of randomly picking an expression value from one cluster that is greater than a random value from the other cluster plus lfc. A large positive Cohen’s \\(d\\) and an AUC above 0.5 can only be obtained if the observed log-fold change between clusters is significantly greater than lfc. We demonstrate below by obtaining the top markers for cluster 5 in the PBMC dataset with lfc=2 (Figure 6.5). marker.info.lfc &lt;- scoreMarkers(sce.pbmc, colLabels(sce.pbmc), lfc=2) chosen2 &lt;- marker.info.lfc[[&quot;5&quot;]] # another cluster for some variety. chosen2 &lt;- chosen2[order(chosen2$mean.AUC, decreasing=TRUE),] chosen2[,c(&quot;self.average&quot;, &quot;other.average&quot;, &quot;mean.AUC&quot;)] ## DataFrame with 33694 rows and 3 columns ## self.average other.average mean.AUC ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## CCL5 3.81358 1.047986 0.747609 ## NKG7 3.17038 0.830285 0.654736 ## IL32 3.29770 1.076383 0.633969 ## GZMA 1.92877 0.426848 0.454187 ## TRAC 2.25881 0.848025 0.434530 ## ... ... ... ... ## AC233755.2 0.00000000 0.00000000 0 ## AC233755.1 0.00000000 0.00000000 0 ## AC240274.1 0.00942441 0.00800457 0 ## AC213203.1 0.00000000 0.00000000 0 ## FAM231B 0.00000000 0.00000000 0 plotDots(sce.pbmc, rownames(chosen2)[1:10], group=&quot;label&quot;) Figure 6.5: Dot plot of the top potential marker genes (as determined by the mean AUC) for cluster 5 in the PBMC dataset. Each row corrresponds to a marker gene and each column corresponds to a cluster. The size of each dot represents the proportion of cells with detected expression of the gene in the cluster, while the color is proportional to the average expression across all cells in that cluster. Note that the interpretation of the AUC and Cohen’s \\(d\\) becomes slightly more complicated when lfc is non-zero. If lfc is positive, a positive Cohen’s \\(d\\) and an AUC above 0.5 represents upregulation. However, a negative Cohen’s \\(d\\) or AUC below 0.5 may not represent downregulation; it may just indicate that the observed log-fold change is less than the specified lfc. The converse applies when lfc is negative, where the only conclusive interpretation occurs for downregulated genes. For the most part, this complication is not too problematic for routine marker detection, as we are mostly interested in upregulated genes with large positive Cohen’s \\(d\\) and AUCs above 0.5. 6.7 Handling blocking factors Large studies may contain factors of variation that are known and not interesting (e.g., batch effects, sex differences). If these are not modelled, they can interfere with marker gene detection - most obviously by inflating the variance within each cluster, but also by distorting the log-fold changes if the cluster composition varies across levels of the blocking factor. To avoid these issues, we specify the blocking factor via the block= argument, as demonstrated below for the 416B data set. View set-up code (Workflow Chapter 1) #--- loading ---# library(scRNAseq) sce.416b &lt;- LunSpikeInData(which=&quot;416b&quot;) sce.416b$block &lt;- factor(sce.416b$block) #--- gene-annotation ---# library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] rowData(sce.416b)$ENSEMBL &lt;- rownames(sce.416b) rowData(sce.416b)$SYMBOL &lt;- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype=&quot;GENEID&quot;, column=&quot;SYMBOL&quot;) rowData(sce.416b)$SEQNAME &lt;- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype=&quot;GENEID&quot;, column=&quot;SEQNAME&quot;) library(scater) rownames(sce.416b) &lt;- uniquifyFeatureNames(rowData(sce.416b)$ENSEMBL, rowData(sce.416b)$SYMBOL) #--- quality-control ---# mito &lt;- which(rowData(sce.416b)$SEQNAME==&quot;MT&quot;) stats &lt;- perCellQCMetrics(sce.416b, subsets=list(Mt=mito)) qc &lt;- quickPerCellQC(stats, percent_subsets=c(&quot;subsets_Mt_percent&quot;, &quot;altexps_ERCC_percent&quot;), batch=sce.416b$block) sce.416b &lt;- sce.416b[,!qc$discard] #--- normalization ---# library(scran) sce.416b &lt;- computeSumFactors(sce.416b) sce.416b &lt;- logNormCounts(sce.416b) #--- variance-modelling ---# dec.416b &lt;- modelGeneVarWithSpikes(sce.416b, &quot;ERCC&quot;, block=sce.416b$block) chosen.hvgs &lt;- getTopHVGs(dec.416b, prop=0.1) #--- batch-correction ---# library(limma) assay(sce.416b, &quot;corrected&quot;) &lt;- removeBatchEffect(logcounts(sce.416b), design=model.matrix(~sce.416b$phenotype), batch=sce.416b$block) #--- dimensionality-reduction ---# sce.416b &lt;- runPCA(sce.416b, ncomponents=10, subset_row=chosen.hvgs, exprs_values=&quot;corrected&quot;, BSPARAM=BiocSingular::ExactParam()) set.seed(1010) sce.416b &lt;- runTSNE(sce.416b, dimred=&quot;PCA&quot;, perplexity=10) #--- clustering ---# my.dist &lt;- dist(reducedDim(sce.416b, &quot;PCA&quot;)) my.tree &lt;- hclust(my.dist, method=&quot;ward.D2&quot;) library(dynamicTreeCut) my.clusters &lt;- unname(cutreeDynamic(my.tree, distM=as.matrix(my.dist), minClusterSize=10, verbose=0)) colLabels(sce.416b) &lt;- factor(my.clusters) m.out &lt;- scoreMarkers(sce.416b, colLabels(sce.416b), block=sce.416b$block) For each gene, each pairwise comparison between clusters is performed separately in each level of the blocking factor - in this case, the plate of origin. By comparing within each batch, we cancel out any batch effects so that they are not conflated with the biological differences between subpopulations. The effect sizes are then averaged across batches to obtain a single value per comparison, using a weighted mean that accounts for the number of cells involved in the comparison in each batch. A similar correction is applied to the mean log-expression and proportion of detected cells inside and outside each cluster. demo &lt;- m.out[[&quot;1&quot;]] ordered &lt;- demo[order(demo$median.logFC.cohen, decreasing=TRUE),] ordered[,1:4] ## DataFrame with 46604 rows and 4 columns ## self.average other.average self.detected other.detected ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Myh11 4.03436 0.861019 0.988132 0.303097 ## Cd200r3 7.97667 3.524762 0.977675 0.624507 ## Pi16 6.27654 2.644421 0.957126 0.530395 ## Actb 15.48533 14.808584 1.000000 1.000000 ## Ctsd 11.61247 9.130141 1.000000 1.000000 ## ... ... ... ... ... ## Spc24 0.4772577 5.03548 0.222281 0.862153 ## Ska1 0.0787421 4.43426 0.118743 0.773950 ## Pimreg 0.5263611 5.35494 0.258150 0.910706 ## Birc5 1.5580536 7.07230 0.698746 0.976929 ## Ccna2 0.9664521 6.55243 0.554104 0.948520 plotExpression(sce.416b, features=rownames(ordered)[1:6], x=&quot;label&quot;, colour_by=&quot;block&quot;) Figure 6.6: Distribution of expression values across clusters for the top potential marker genes from cluster 1 in the 416B dataset. Each point represents a cell and is colored by the batch of origin. The block= argument works for all effect sizes shown above and is robust to differences in the log-fold changes or variance between batches. However, it assumes that each pair of clusters is present in at least one batch. In scenarios where cells from two clusters never co-occur in the same batch, the associated pairwise comparison will be impossible and is ignored during calculation of summary statistics. Session Info View session info R version 4.1.1 (2021-08-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] scater_1.22.0 ggplot2_3.3.5 [3] scran_1.22.0 scuttle_1.4.0 [5] SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0 [7] Biobase_2.54.0 GenomicRanges_1.46.0 [9] GenomeInfoDb_1.30.0 IRanges_2.28.0 [11] S4Vectors_0.32.0 BiocGenerics_0.40.0 [13] MatrixGenerics_1.6.0 matrixStats_0.61.0 [15] BiocStyle_2.22.0 rebook_1.4.0 loaded via a namespace (and not attached): [1] bitops_1.0-7 RColorBrewer_1.1-2 [3] filelock_1.0.2 tools_4.1.1 [5] bslib_0.3.1 utf8_1.2.2 [7] R6_2.5.1 irlba_2.3.3 [9] vipor_0.4.5 DBI_1.1.1 [11] colorspace_2.0-2 withr_2.4.2 [13] gridExtra_2.3 tidyselect_1.1.1 [15] compiler_4.1.1 graph_1.72.0 [17] BiocNeighbors_1.12.0 DelayedArray_0.20.0 [19] labeling_0.4.2 bookdown_0.24 [21] sass_0.4.0 scales_1.1.1 [23] rappdirs_0.3.3 stringr_1.4.0 [25] digest_0.6.28 rmarkdown_2.11 [27] XVector_0.34.0 pkgconfig_2.0.3 [29] htmltools_0.5.2 sparseMatrixStats_1.6.0 [31] highr_0.9 fastmap_1.1.0 [33] limma_3.50.0 rlang_0.4.12 [35] DelayedMatrixStats_1.16.0 farver_2.1.0 [37] jquerylib_0.1.4 generics_0.1.1 [39] jsonlite_1.7.2 BiocParallel_1.28.0 [41] dplyr_1.0.7 RCurl_1.98-1.5 [43] magrittr_2.0.1 BiocSingular_1.10.0 [45] GenomeInfoDbData_1.2.7 Matrix_1.3-4 [47] ggbeeswarm_0.6.0 Rcpp_1.0.7 [49] munsell_0.5.0 fansi_0.5.0 [51] viridis_0.6.2 lifecycle_1.0.1 [53] stringi_1.7.5 yaml_2.2.1 [55] edgeR_3.36.0 zlibbioc_1.40.0 [57] grid_4.1.1 ggrepel_0.9.1 [59] parallel_4.1.1 dqrng_0.3.0 [61] crayon_1.4.1 dir.expiry_1.2.0 [63] lattice_0.20-45 cowplot_1.1.1 [65] beachmat_2.10.0 locfit_1.5-9.4 [67] CodeDepends_0.6.5 metapod_1.2.0 [69] knitr_1.36 pillar_1.6.4 [71] igraph_1.2.7 codetools_0.2-18 [73] ScaledMatrix_1.2.0 XML_3.99-0.8 [75] glue_1.4.2 evaluate_0.14 [77] BiocManager_1.30.16 vctrs_0.3.8 [79] gtable_0.3.0 purrr_0.3.4 [81] assertthat_0.2.1 xfun_0.27 [83] rsvd_1.0.5 viridisLite_0.4.0 [85] pheatmap_1.0.12 tibble_3.1.5 [87] beeswarm_0.4.0 cluster_2.1.2 [89] bluster_1.4.0 statmod_1.4.36 [91] ellipsis_0.3.2 "],["cell-type-annotation.html", "Chapter 7 Cell type annotation 7.1 Motivation 7.2 Assigning cell labels from reference data 7.3 Assigning cell labels from gene sets 7.4 Assigning cluster labels from markers 7.5 Computing gene set activities Session Info", " Chapter 7 Cell type annotation .rebook-collapse { background-color: #eee; color: #444; cursor: pointer; padding: 18px; width: 100%; border: none; text-align: left; outline: none; font-size: 15px; } .rebook-content { padding: 0 18px; display: none; overflow: hidden; background-color: #f1f1f1; } 7.1 Motivation The most challenging task in scRNA-seq data analysis is arguably the interpretation of the results. Obtaining clusters of cells is fairly straightforward, but it is more difficult to determine what biological state is represented by each of those clusters. Doing so requires us to bridge the gap between the current dataset and prior biological knowledge, and the latter is not always available in a consistent and quantitative manner. Indeed, even the concept of a “cell type” is not clearly defined, with most practitioners possessing a “I’ll know it when I see it” intuition that is not amenable to computational analysis. As such, interpretation of scRNA-seq data is often manual and a common bottleneck in the analysis workflow. To expedite this step, we can use various computational approaches that exploit prior information to assign meaning to an uncharacterized scRNA-seq dataset. The most obvious sources of prior information are the curated gene sets associated with particular biological processes, e.g., from the Gene Ontology (GO) or the Kyoto Encyclopedia of Genes and Genomes (KEGG) collections. Alternatively, we can directly compare our expression profiles to published reference datasets where each sample or cell has already been annotated with its putative biological state by domain experts. Here, we will demonstrate both approaches with several different scRNA-seq datasets. 7.2 Assigning cell labels from reference data 7.2.1 Overview A conceptually straightforward annotation approach is to compare the single-cell expression profiles with previously annotated reference datasets. Labels can then be assigned to each cell in our uncharacterized test dataset based on the most similar reference sample(s), for some definition of “similar”. This is a standard classification challenge that can be tackled by standard machine learning techniques such as random forests and support vector machines. Any published and labelled RNA-seq dataset (bulk or single-cell) can be used as a reference, though its reliability depends greatly on the expertise of the original authors who assigned the labels in the first place. In this section, we will demonstrate the use of the SingleR method (Aran et al. 2019) for cell type annotation. This method assigns labels to cells based on the reference samples with the highest Spearman rank correlations, using only the marker genes between pairs of labels to focus on the relevant differences between cell types. It also performs a fine-tuning step for each cell where the correlations are recomputed with just the marker genes for the top-scoring labels. This aims to resolve any ambiguity between those labels by removing noise from irrelevant markers for other labels. Further details can be found in the SingleR book from which most of the examples here are derived. 7.2.2 Using existing references For demonstration purposes, we will use one of the 10X PBMC datasets as our test. While we have already applied quality control, normalization and clustering for this dataset, this is not strictly necessary. It is entirely possible to run SingleR() on the raw counts without any a priori quality control and filter on the annotation results at one’s leisure - see the book for an explanation. View set-up code (Workflow Chapter 3) #--- loading ---# library(DropletTestFiles) raw.path &lt;- getTestFile(&quot;tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz&quot;) out.path &lt;- file.path(tempdir(), &quot;pbmc4k&quot;) untar(raw.path, exdir=out.path) library(DropletUtils) fname &lt;- file.path(out.path, &quot;raw_gene_bc_matrices/GRCh38&quot;) sce.pbmc &lt;- read10xCounts(fname, col.names=TRUE) #--- gene-annotation ---# library(scater) rownames(sce.pbmc) &lt;- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol) library(EnsDb.Hsapiens.v86) location &lt;- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, column=&quot;SEQNAME&quot;, keytype=&quot;GENEID&quot;) #--- cell-detection ---# set.seed(100) e.out &lt;- emptyDrops(counts(sce.pbmc)) sce.pbmc &lt;- sce.pbmc[,which(e.out$FDR &lt;= 0.001)] #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location==&quot;MT&quot;))) high.mito &lt;- isOutlier(stats$subsets_Mito_percent, type=&quot;higher&quot;) sce.pbmc &lt;- sce.pbmc[,!high.mito] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.pbmc) sce.pbmc &lt;- computeSumFactors(sce.pbmc, cluster=clusters) sce.pbmc &lt;- logNormCounts(sce.pbmc) #--- variance-modelling ---# set.seed(1001) dec.pbmc &lt;- modelGeneVarByPoisson(sce.pbmc) top.pbmc &lt;- getTopHVGs(dec.pbmc, prop=0.1) #--- dimensionality-reduction ---# set.seed(10000) sce.pbmc &lt;- denoisePCA(sce.pbmc, subset.row=top.pbmc, technical=dec.pbmc) set.seed(100000) sce.pbmc &lt;- runTSNE(sce.pbmc, dimred=&quot;PCA&quot;) set.seed(1000000) sce.pbmc &lt;- runUMAP(sce.pbmc, dimred=&quot;PCA&quot;) #--- clustering ---# g &lt;- buildSNNGraph(sce.pbmc, k=10, use.dimred = &#39;PCA&#39;) clust &lt;- igraph::cluster_walktrap(g)$membership colLabels(sce.pbmc) &lt;- factor(clust) sce.pbmc ## class: SingleCellExperiment ## dim: 33694 3985 ## metadata(1): Samples ## assays(2): counts logcounts ## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B ## rowData names(2): ID Symbol ## colnames(3985): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ... ## TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1 ## colData names(4): Sample Barcode sizeFactor label ## reducedDimNames(3): PCA TSNE UMAP ## mainExpName: NULL ## altExpNames(0): The celldex contains a number of curated reference datasets, mostly assembled from bulk RNA-seq or microarray data of sorted cell types. These references are often good enough for most applications provided that they contain the cell types that are expected in the test population. Here, we will use a reference constructed from Blueprint and ENCODE data (Martens and Stunnenberg 2013; The ENCODE Project Consortium 2012); this is obtained by calling the BlueprintEncode() function to construct a SummarizedExperiment containing log-expression values with curated labels for each sample. library(celldex) ref &lt;- BlueprintEncodeData() ref ## class: SummarizedExperiment ## dim: 19859 259 ## metadata(0): ## assays(1): logcounts ## rownames(19859): TSPAN6 TNMD ... LINC00550 GIMAP1-GIMAP5 ## rowData names(0): ## colnames(259): mature.neutrophil ## CD14.positive..CD16.negative.classical.monocyte ... ## epithelial.cell.of.umbilical.artery.1 ## dermis.lymphatic.vessel.endothelial.cell.1 ## colData names(3): label.main label.fine label.ont We call the SingleR() function to annotate each of our PBMCs with the main cell type labels from the Blueprint/ENCODE reference. This returns a DataFrame where each row corresponds to a cell in the test dataset and contains its label assignments. Alternatively, we could use the labels in ref$label.fine, which provide more resolution at the cost of speed and increased ambiguity in the assignments. library(SingleR) pred &lt;- SingleR(test=sce.pbmc, ref=ref, labels=ref$label.main) table(pred$labels) ## ## B-cells CD4+ T-cells CD8+ T-cells DC Eosinophils Erythrocytes ## 549 773 1274 1 1 5 ## HSC Monocytes NK cells ## 14 1117 251 We inspect the results using a heatmap of the per-cell and label scores (Figure 7.1). Ideally, each cell should exhibit a high score in one label relative to all of the others, indicating that the assignment to that label was unambiguous. This is largely the case for monocytes and B cells, whereas we see more ambiguity between CD4+ and CD8+ T cells (and to a lesser extent, NK cells). plotScoreHeatmap(pred) Figure 7.1: Heatmap of the assignment score for each cell (column) and label (row). Scores are shown before any fine-tuning and are normalized to [0, 1] within each cell. We compare the assignments with the clustering results to determine the identity of each cluster. Here, several clusters are nested within the monocyte and B cell labels (Figure 7.2), indicating that the clustering represents finer subdivisions within the cell types. Interestingly, our clustering does not effectively distinguish between CD4+ and CD8+ T cell labels. This is probably due to the presence of other factors of heterogeneity within the T cell subpopulation (e.g., activation) that have a stronger influence on unsupervised methods than the a priori expected CD4+/CD8+ distinction. tab &lt;- table(Assigned=pred$pruned.labels, Cluster=colLabels(sce.pbmc)) # Adding a pseudo-count of 10 to avoid strong color jumps with just 1 cell. library(pheatmap) pheatmap(log2(tab+10), color=colorRampPalette(c(&quot;white&quot;, &quot;blue&quot;))(101)) Figure 7.2: Heatmap of the distribution of cells across labels and clusters in the 10X PBMC dataset. Color scale is reported in the log10-number of cells for each cluster-label combination. This episode highlights some of the differences between reference-based annotation and unsupervised clustering. The former explicitly focuses on aspects of the data that are known to be interesting, simplifying the process of biological interpretation. However, the cost is that the downstream analysis is restricted by the diversity and resolution of the available labels, a problem that is largely avoided by de novo identification of clusters. We suggest applying both strategies to examine the agreement (or lack thereof) between reference label and cluster assignments. Any inconsistencies are not necessarily problematic due to the conceptual differences between the two approaches; indeed, one could use those discrepancies as the basis for further investigation to discover novel factors of variation in the data. 7.2.3 Using custom references We can also apply SingleR to single-cell reference datasets that are curated and supplied by the user. This is most obviously useful when we have an existing dataset that was previously (manually) annotated and we want to use that knowledge to annotate a new dataset in an automated manner. To illustrate, we will use the Muraro et al. (2016) human pancreas dataset as our reference. View set-up code (Workflow Chapter 6) #--- loading ---# library(scRNAseq) sce.muraro &lt;- MuraroPancreasData() #--- gene-annotation ---# library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] gene.symb &lt;- sub(&quot;__chr.*$&quot;, &quot;&quot;, rownames(sce.muraro)) gene.ids &lt;- mapIds(edb, keys=gene.symb, keytype=&quot;SYMBOL&quot;, column=&quot;GENEID&quot;) # Removing duplicated genes or genes without Ensembl IDs. keep &lt;- !is.na(gene.ids) &amp; !duplicated(gene.ids) sce.muraro &lt;- sce.muraro[keep,] rownames(sce.muraro) &lt;- gene.ids[keep] #--- quality-control ---# library(scater) stats &lt;- perCellQCMetrics(sce.muraro) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;, batch=sce.muraro$donor, subset=sce.muraro$donor!=&quot;D28&quot;) sce.muraro &lt;- sce.muraro[,!qc$discard] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.muraro) sce.muraro &lt;- computeSumFactors(sce.muraro, clusters=clusters) sce.muraro &lt;- logNormCounts(sce.muraro) sce.muraro ## class: SingleCellExperiment ## dim: 16940 2299 ## metadata(0): ## assays(2): counts logcounts ## rownames(16940): ENSG00000268895 ENSG00000121410 ... ENSG00000159840 ## ENSG00000074755 ## rowData names(2): symbol chr ## colnames(2299): D28-1_1 D28-1_2 ... D30-8_93 D30-8_94 ## colData names(4): label donor plate sizeFactor ## reducedDimNames(0): ## mainExpName: endogenous ## altExpNames(1): ERCC # Pruning out unknown or unclear labels. sce.muraro &lt;- sce.muraro[,!is.na(sce.muraro$label) &amp; sce.muraro$label!=&quot;unclear&quot;] table(sce.muraro$label) ## ## acinar alpha beta delta duct endothelial ## 217 795 442 189 239 18 ## epsilon mesenchymal pp ## 3 80 96 Our aim is to assign labels to our test dataset from Segerstolpe et al. (2016). We use the same call to SingleR() but with de.method=\"wilcox\" to identify markers via pairwise Wilcoxon ranked sum tests between labels in the reference Muraro dataset. This re-uses the same machinery from Chapter 6; further options to fine-tune the test procedure can be passed via the de.args argument. View set-up code (Workflow Chapter 8) #--- loading ---# library(scRNAseq) sce.seger &lt;- SegerstolpePancreasData() #--- gene-annotation ---# library(AnnotationHub) edb &lt;- AnnotationHub()[[&quot;AH73881&quot;]] symbols &lt;- rowData(sce.seger)$symbol ens.id &lt;- mapIds(edb, keys=symbols, keytype=&quot;SYMBOL&quot;, column=&quot;GENEID&quot;) ens.id &lt;- ifelse(is.na(ens.id), symbols, ens.id) # Removing duplicated rows. keep &lt;- !duplicated(ens.id) sce.seger &lt;- sce.seger[keep,] rownames(sce.seger) &lt;- ens.id[keep] #--- sample-annotation ---# emtab.meta &lt;- colData(sce.seger)[,c(&quot;cell type&quot;, &quot;disease&quot;, &quot;individual&quot;, &quot;single cell well quality&quot;)] colnames(emtab.meta) &lt;- c(&quot;CellType&quot;, &quot;Disease&quot;, &quot;Donor&quot;, &quot;Quality&quot;) colData(sce.seger) &lt;- emtab.meta sce.seger$CellType &lt;- gsub(&quot; cell&quot;, &quot;&quot;, sce.seger$CellType) sce.seger$CellType &lt;- paste0( toupper(substr(sce.seger$CellType, 1, 1)), substring(sce.seger$CellType, 2)) #--- quality-control ---# low.qual &lt;- sce.seger$Quality == &quot;low quality cell&quot; library(scater) stats &lt;- perCellQCMetrics(sce.seger) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;altexps_ERCC_percent&quot;, batch=sce.seger$Donor, subset=!sce.seger$Donor %in% c(&quot;HP1504901&quot;, &quot;HP1509101&quot;)) sce.seger &lt;- sce.seger[,!(qc$discard | low.qual)] #--- normalization ---# library(scran) clusters &lt;- quickCluster(sce.seger) sce.seger &lt;- computeSumFactors(sce.seger, clusters=clusters) sce.seger &lt;- logNormCounts(sce.seger) # Converting to FPKM for a more like-for-like comparison to UMI counts. # However, results are often still good even when this step is skipped. library(AnnotationHub) hs.db &lt;- AnnotationHub()[[&quot;AH73881&quot;]] hs.exons &lt;- exonsBy(hs.db, by=&quot;gene&quot;) hs.exons &lt;- reduce(hs.exons) hs.len &lt;- sum(width(hs.exons)) library(scuttle) available &lt;- intersect(rownames(sce.seger), names(hs.len)) fpkm.seger &lt;- calculateFPKM(sce.seger[available,], hs.len[available]) pred.seger &lt;- SingleR(test=fpkm.seger, ref=sce.muraro, labels=sce.muraro$label, de.method=&quot;wilcox&quot;) table(pred.seger$labels) ## ## acinar alpha beta delta duct endothelial ## 192 892 273 106 381 18 ## epsilon mesenchymal pp ## 5 52 171 As it so happens, we are in the fortunate position where our test dataset also contains independently defined labels. We see strong consistency between the two sets of labels (Figure 7.3), indicating that our automatic annotation is comparable to that generated manually by domain experts. tab &lt;- table(pred.seger$pruned.labels, sce.seger$CellType) library(pheatmap) pheatmap(log2(tab+10), color=colorRampPalette(c(&quot;white&quot;, &quot;blue&quot;))(101)) Figure 7.3: Heatmap of the confusion matrix between the predicted labels (rows) and the independently defined labels (columns) in the Segerstolpe dataset. The color is proportinal to the log-transformed number of cells with a given combination of labels from each set. An interesting question is - given a single-cell reference dataset, is it better to use it directly or convert it to pseudo-bulk values? A single-cell reference preserves the “shape” of the subpopulation in high-dimensional expression space, potentially yielding more accurate predictions when the differences between labels are subtle (or at least capturing ambiguity more accurately to avoid grossly incorrect predictions). However, it also requires more computational work to assign each cell in the test dataset. We refer to the other book for more details on how to achieve a compromise between these two concerns. 7.3 Assigning cell labels from gene sets A related strategy is to explicitly identify sets of marker genes that are highly expressed in each individual cell. This does not require matching of individual cells to the expression values of the reference dataset, which is faster and more convenient when only the identities of the markers are available. We demonstrate this approach using neuronal cell type markers derived from the Zeisel et al. (2015) study. View set-up code (Workflow Chapter 2) #--- loading ---# library(scRNAseq) sce.zeisel &lt;- ZeiselBrainData() library(scater) sce.zeisel &lt;- aggregateAcrossFeatures(sce.zeisel, id=sub(&quot;_loc[0-9]+$&quot;, &quot;&quot;, rownames(sce.zeisel))) #--- gene-annotation ---# library(org.Mm.eg.db) rowData(sce.zeisel)$Ensembl &lt;- mapIds(org.Mm.eg.db, keys=rownames(sce.zeisel), keytype=&quot;SYMBOL&quot;, column=&quot;ENSEMBL&quot;) #--- quality-control ---# stats &lt;- perCellQCMetrics(sce.zeisel, subsets=list( Mt=rowData(sce.zeisel)$featureType==&quot;mito&quot;)) qc &lt;- quickPerCellQC(stats, percent_subsets=c(&quot;altexps_ERCC_percent&quot;, &quot;subsets_Mt_percent&quot;)) sce.zeisel &lt;- sce.zeisel[,!qc$discard] #--- normalization ---# library(scran) set.seed(1000) clusters &lt;- quickCluster(sce.zeisel) sce.zeisel &lt;- computeSumFactors(sce.zeisel, cluster=clusters) sce.zeisel &lt;- logNormCounts(sce.zeisel) library(scran) wilcox.z &lt;- pairwiseWilcox(sce.zeisel, sce.zeisel$level1class, lfc=1, direction=&quot;up&quot;) markers.z &lt;- getTopMarkers(wilcox.z$statistics, wilcox.z$pairs, pairwise=FALSE, n=50) lengths(markers.z) ## astrocytes_ependymal endothelial-mural interneurons ## 79 83 118 ## microglia oligodendrocytes pyramidal CA1 ## 69 81 125 ## pyramidal SS ## 149 Our test dataset will be another brain scRNA-seq experiment from Tasic et al. (2016). library(scRNAseq) sce.tasic &lt;- TasicBrainData() sce.tasic ## class: SingleCellExperiment ## dim: 24058 1809 ## metadata(0): ## assays(1): counts ## rownames(24058): 0610005C13Rik 0610007C21Rik ... mt_X57780 tdTomato ## rowData names(0): ## colnames(1809): Calb2_tdTpositive_cell_1 Calb2_tdTpositive_cell_2 ... ## Rbp4_CTX_250ng_2 Trib2_CTX_250ng_1 ## colData names(13): sample_title mouse_line ... secondary_type ## aibs_vignette_id ## reducedDimNames(0): ## mainExpName: endogenous ## altExpNames(1): ERCC We use the AUCell package to identify marker sets that are highly expressed in each cell. This method ranks genes by their expression values within each cell and constructs a response curve of the number of genes from each marker set that are present with increasing rank. It then computes the area under the curve (AUC) for each marker set, quantifying the enrichment of those markers among the most highly expressed genes in that cell. This is roughly similar to performing a Wilcoxon rank sum test between genes in and outside of the set, but involving only the top ranking genes by expression in each cell. library(GSEABase) all.sets &lt;- lapply(names(markers.z), function(x) { GeneSet(markers.z[[x]], setName=x) }) all.sets &lt;- GeneSetCollection(all.sets) library(AUCell) rankings &lt;- AUCell_buildRankings(counts(sce.tasic), plotStats=FALSE, verbose=FALSE) cell.aucs &lt;- AUCell_calcAUC(all.sets, rankings) results &lt;- t(assay(cell.aucs)) head(results) ## gene sets ## cells astrocytes_ependymal endothelial-mural interneurons ## Calb2_tdTpositive_cell_1 0.1387 0.04264 0.5306 ## Calb2_tdTpositive_cell_2 0.1366 0.04885 0.4538 ## Calb2_tdTpositive_cell_3 0.1087 0.07270 0.3459 ## Calb2_tdTpositive_cell_4 0.1322 0.04993 0.5113 ## Calb2_tdTpositive_cell_5 0.1513 0.07161 0.4930 ## Calb2_tdTpositive_cell_6 0.1342 0.09161 0.3378 ## gene sets ## cells microglia oligodendrocytes pyramidal CA1 ## Calb2_tdTpositive_cell_1 0.04845 0.1318 0.2318 ## Calb2_tdTpositive_cell_2 0.02683 0.1211 0.2063 ## Calb2_tdTpositive_cell_3 0.03583 0.1567 0.3219 ## Calb2_tdTpositive_cell_4 0.05388 0.1481 0.2547 ## Calb2_tdTpositive_cell_5 0.06656 0.1386 0.2088 ## Calb2_tdTpositive_cell_6 0.03201 0.1553 0.4011 ## gene sets ## cells pyramidal SS ## Calb2_tdTpositive_cell_1 0.3477 ## Calb2_tdTpositive_cell_2 0.2762 ## Calb2_tdTpositive_cell_3 0.5244 ## Calb2_tdTpositive_cell_4 0.3506 ## Calb2_tdTpositive_cell_5 0.3010 ## Calb2_tdTpositive_cell_6 0.5393 We assign cell type identity to each cell in the test dataset by taking the marker set with the top AUC as the label for that cell. Our new labels mostly agree with the original annotation from Tasic et al. (2016), which is encouraging. The only exception involves misassignment of oligodendrocyte precursors to astrocytes, which may be understandable given that they are derived from a common lineage. In the absence of prior annotation, a more general diagnostic check is to compare the assigned labels to cluster identities, under the expectation that most cells of a single cluster would have the same label (or, if multiple labels are present, they should at least represent closely related cell states). new.labels &lt;- colnames(results)[max.col(results)] tab &lt;- table(new.labels, sce.tasic$broad_type) tab ## ## new.labels Astrocyte Endothelial Cell GABA-ergic Neuron ## astrocytes_ependymal 43 2 0 ## endothelial-mural 0 27 0 ## interneurons 0 0 759 ## microglia 0 0 0 ## oligodendrocytes 0 0 1 ## pyramidal SS 0 0 1 ## ## new.labels Glutamatergic Neuron Microglia Oligodendrocyte ## astrocytes_ependymal 0 0 0 ## endothelial-mural 0 0 0 ## interneurons 2 0 0 ## microglia 0 22 0 ## oligodendrocytes 0 0 38 ## pyramidal SS 810 0 0 ## ## new.labels Oligodendrocyte Precursor Cell Unclassified ## astrocytes_ependymal 20 4 ## endothelial-mural 0 2 ## interneurons 0 15 ## microglia 0 1 ## oligodendrocytes 2 0 ## pyramidal SS 0 60 As a diagnostic measure, we examine the distribution of AUCs across cells for each label (Figure 7.4). In heterogeneous populations, the distribution for each label should be bimodal with one high-scoring peak containing cells of that cell type and a low-scoring peak containing cells of other types. The gap between these two peaks can be used to derive a threshold for whether a label is “active” for a particular cell. (In this case, we simply take the single highest-scoring label per cell as the labels should be mutually exclusive.) In populations where a particular cell type is expected, lack of clear bimodality for the corresponding label may indicate that its gene set is not sufficiently informative. par(mfrow=c(3,3)) AUCell_exploreThresholds(cell.aucs, plotHist=TRUE, assign=TRUE) Figure 7.4: Distribution of AUCs in the Tasic brain dataset for each label in the Zeisel dataset. The blue curve represents the density estimate, the red curve represents a fitted two-component mixture of normals, the pink curve represents a fitted three-component mixture, and the grey curve represents a fitted normal distribution. Vertical lines represent threshold estimates corresponding to each estimate of the distribution. Interpretation of the AUCell results is most straightforward when the marker sets are mutually exclusive, as shown above for the cell type markers. In other applications, one might consider computing AUCs for gene sets associated with signalling or metabolic pathways. It is likely that multiple pathways will be active in any given cell, and it is tempting to use the AUCs to quantify this activity for comparison across cells. However, such comparisons must be interpreted with much caution as the AUCs are competitive values - any increase in one pathway’s activity will naturally reduce the AUCs for all other pathways, potentially resulting in spurious differences across the population. As we mentioned previously, the advantage of the AUCell approach is that it does not require reference expression values. This is particularly useful when dealing with gene sets derived from the literature or other qualitative forms of biological knowledge. For example, we might instead use single-cell signatures defined from MSigDB, obtained as shown below. # Downloading the signatures and caching them locally. library(BiocFileCache) bfc &lt;- BiocFileCache(ask=FALSE) scsig.path &lt;- bfcrpath(bfc, file.path(&quot;http://software.broadinstitute.org&quot;, &quot;gsea/msigdb/supplemental/scsig.all.v1.0.symbols.gmt&quot;)) scsigs &lt;- getGmt(scsig.path) The flipside is that information on relative expression is lost when only the marker identities are used. The net effect of ignoring expression values is difficult to predict; for example, it may reduce performance for resolving more subtle cell types, but may also improve performance if the per-cell expression was too noisy to be useful. Performance is also highly dependent on the gene sets themselves, which may not be defined in the same context in which they are used. For example, applying all of the MSigDB signatures on the Muraro dataset is rather disappointing (Figure 7.5), while restricting to the subset of pancreas signatures is more promising. muraro.mat &lt;- counts(sce.muraro) rownames(muraro.mat) &lt;- rowData(sce.muraro)$symbol muraro.rankings &lt;- AUCell_buildRankings(muraro.mat, plotStats=FALSE, verbose=FALSE) # Applying MsigDB to the Muraro dataset, because it&#39;s human: scsig.aucs &lt;- AUCell_calcAUC(scsigs, muraro.rankings) scsig.results &lt;- t(assay(scsig.aucs)) full.labels &lt;- colnames(scsig.results)[max.col(scsig.results)] tab &lt;- table(full.labels, sce.muraro$label) fullheat &lt;- pheatmap(log10(tab+10), color=viridis::viridis(100), silent=TRUE) # Restricting to the subset of Muraro-derived gene sets: scsigs.sub &lt;- scsigs[grep(&quot;Pancreas&quot;, names(scsigs))] sub.aucs &lt;- AUCell_calcAUC(scsigs.sub, muraro.rankings) sub.results &lt;- t(assay(sub.aucs)) sub.labels &lt;- colnames(sub.results)[max.col(sub.results)] tab &lt;- table(sub.labels, sce.muraro$label) subheat &lt;- pheatmap(log10(tab+10), color=viridis::viridis(100), silent=TRUE) gridExtra::grid.arrange(fullheat[[4]], subheat[[4]]) Figure 7.5: Heatmaps of the log-number of cells with each combination of known labels (columns) and assigned MSigDB signatures (rows) in the Muraro data set. The signature assigned to each cell was defined as that with the highest AUC across all (top) or all pancreas-related signatures (bottom). 7.4 Assigning cluster labels from markers Yet another strategy for annotation is to perform a gene set enrichment analysis on the marker genes defining each cluster. This identifies the pathways and processes that are (relatively) active in each cluster based on upregulation of the associated genes compared to other clusters. We demonstrate on the mouse mammary dataset from Bach et al. (2017), obtaining annotations for the marker genes that define cluster 2. Specifically, we define our marker subset as the top 100 genes with the largest median Cohen’s \\(d\\) (Chapter 6). View set-up code (Workflow Chapter 12) #--- loading ---# library(scRNAseq) sce.mam &lt;- BachMammaryData(samples=&quot;G_1&quot;) #--- gene-annotation ---# library(scater) rownames(sce.mam) &lt;- uniquifyFeatureNames( rowData(sce.mam)$Ensembl, rowData(sce.mam)$Symbol) library(AnnotationHub) ens.mm.v97 &lt;- AnnotationHub()[[&quot;AH73905&quot;]] rowData(sce.mam)$SEQNAME &lt;- mapIds(ens.mm.v97, keys=rowData(sce.mam)$Ensembl, keytype=&quot;GENEID&quot;, column=&quot;SEQNAME&quot;) #--- quality-control ---# is.mito &lt;- rowData(sce.mam)$SEQNAME == &quot;MT&quot; stats &lt;- perCellQCMetrics(sce.mam, subsets=list(Mito=which(is.mito))) qc &lt;- quickPerCellQC(stats, percent_subsets=&quot;subsets_Mito_percent&quot;) sce.mam &lt;- sce.mam[,!qc$discard] #--- normalization ---# library(scran) set.seed(101000110) clusters &lt;- quickCluster(sce.mam) sce.mam &lt;- computeSumFactors(sce.mam, clusters=clusters) sce.mam &lt;- logNormCounts(sce.mam) #--- variance-modelling ---# set.seed(00010101) dec.mam &lt;- modelGeneVarByPoisson(sce.mam) top.mam &lt;- getTopHVGs(dec.mam, prop=0.1) #--- dimensionality-reduction ---# library(BiocSingular) set.seed(101010011) sce.mam &lt;- denoisePCA(sce.mam, technical=dec.mam, subset.row=top.mam) sce.mam &lt;- runTSNE(sce.mam, dimred=&quot;PCA&quot;) #--- clustering ---# snn.gr &lt;- buildSNNGraph(sce.mam, use.dimred=&quot;PCA&quot;, k=25) colLabels(sce.mam) &lt;- factor(igraph::cluster_walktrap(snn.gr)$membership) markers.mam &lt;- scoreMarkers(sce.mam, lfc=1) chosen &lt;- &quot;2&quot; cur.markers &lt;- markers.mam[[chosen]] is.de &lt;- order(cur.markers$median.logFC.cohen, decreasing=TRUE)[1:100] cur.markers[is.de,1:4] ## DataFrame with 100 rows and 4 columns ## self.average other.average self.detected other.detected ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Csn2 8.91675 4.34354 1.000000 0.989293 ## Wfdc18 7.89056 3.42930 1.000000 0.836166 ## Csn1s1 7.83704 3.64794 1.000000 0.935088 ## Csn1s2a 9.21795 3.91963 1.000000 0.960138 ## Muc15 4.77990 1.57096 0.998748 0.446035 ## ... ... ... ... ... ## Ppp1cb 3.00484 1.921274 0.993742 0.820306 ## Cib1 2.38490 1.212979 0.942428 0.588033 ## Timm13 2.29062 1.542706 0.948686 0.749735 ## Prdx1 4.06969 3.364301 0.997497 0.975200 ## Pycard 1.88745 0.938662 0.881101 0.521319 We test for enrichment of gene sets defined by the Gene Ontology (GO) project, which describe a comprehensive range of biological processes and functions. The simplest implementation of this approach involves calling the goana() function from the limma package. This performs a hypergeometric test to identify GO terms that are overrepresented in our marker subset. # goana() requires Entrez IDs, some of which map to multiple # symbols - hence the unique() in the call below. library(org.Mm.eg.db) entrez.ids &lt;- mapIds(org.Mm.eg.db, keys=rownames(cur.markers), column=&quot;ENTREZID&quot;, keytype=&quot;SYMBOL&quot;) library(limma) go.out &lt;- goana(unique(entrez.ids[is.de]), species=&quot;Mm&quot;, universe=unique(entrez.ids)) # Only keeping biological process terms that are not overly general. go.out &lt;- go.out[order(go.out$P.DE),] go.useful &lt;- go.out[go.out$Ont==&quot;BP&quot; &amp; go.out$N &lt;= 200,] head(go.useful[,c(1,3,4)], 30) ## Term N DE ## GO:0006641 triglyceride metabolic process 106 8 ## GO:0006639 acylglycerol metabolic process 140 8 ## GO:0009060 aerobic respiration 141 8 ## GO:0006638 neutral lipid metabolic process 142 8 ## GO:1905954 positive regulation of lipid localization 120 7 ## GO:0045333 cellular respiration 193 8 ## GO:0006119 oxidative phosphorylation 94 6 ## GO:0010884 positive regulation of lipid storage 25 4 ## GO:1905952 regulation of lipid localization 177 7 ## GO:0070542 response to fatty acid 34 4 ## GO:0019432 triglyceride biosynthetic process 38 4 ## GO:0009152 purine ribonucleotide biosynthetic process 138 6 ## GO:0019646 aerobic electron transport chain 43 4 ## GO:0019915 lipid storage 89 5 ## GO:1902600 proton transmembrane transport 90 5 ## GO:0006857 oligopeptide transport 16 3 ## GO:0009260 ribonucleotide biosynthetic process 150 6 ## GO:0015985 energy coupled proton transport, down electrochemical gradient 17 3 ## GO:0015986 ATP synthesis coupled proton transport 17 3 ## GO:0001838 embryonic epithelial tube formation 155 6 ## GO:0006164 purine nucleotide biosynthetic process 156 6 ## GO:0055095 lipoprotein particle mediated signaling 3 2 ## GO:0055096 low-density lipoprotein particle mediated signaling 3 2 ## GO:0046390 ribose phosphate biosynthetic process 159 6 ## GO:0046460 neutral lipid biosynthetic process 50 4 ## GO:0046463 acylglycerol biosynthetic process 50 4 ## GO:0072522 purine-containing compound biosynthetic process 162 6 ## GO:0044539 long-chain fatty acid import into cell 19 3 ## GO:0140354 lipid import into cell 19 3 ## GO:0072175 epithelial tube formation 165 6 We see an enrichment for genes involved in lipid storage, lipid synthesis and tube formation. Given that this is a mammary gland experiment, we might guess that cluster 2 contains luminal epithelial cells responsible for milk production and secretion. Indeed, a closer examination of the marker list indicates that this cluster upregulates milk proteins Csn2 and Csn3 (Figure 7.6). library(scater) plotExpression(sce.mam, features=c(&quot;Csn2&quot;, &quot;Csn3&quot;), x=&quot;label&quot;, colour_by=&quot;label&quot;) Figure 7.6: Distribution of log-expression values for Csn2 and Csn3 in each cluster. Further inspection of interesting GO terms is achieved by extracting the relevant genes. This is usually desirable to confirm that the interpretation of the annotated biological process is appropriate. Many terms have overlapping gene sets, so a term may only be highly ranked because it shares genes with a more relevant term that represents the active pathway. # Extract symbols for each GO term; done once. tab &lt;- select(org.Mm.eg.db, keytype=&quot;SYMBOL&quot;, keys=rownames(sce.mam), columns=&quot;GOALL&quot;) by.go &lt;- split(tab[,1], tab[,2]) # Identify genes associated with an interesting term. interesting &lt;- unique(by.go[[&quot;GO:0019432&quot;]]) interesting.markers &lt;- cur.markers[rownames(cur.markers) %in% interesting,] head(interesting.markers[order(-interesting.markers$median.logFC.cohen),1:4], 10) ## DataFrame with 10 rows and 4 columns ## self.average other.average self.detected other.detected ## &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; &lt;numeric&gt; ## Thrsp 2.694174 0.538882 0.914894 0.259116 ## Acsl1 1.445731 0.394331 0.763454 0.263945 ## C3 1.835239 0.711535 0.834793 0.374558 ## Lpl 2.123419 1.256942 0.909887 0.517217 ## Srebf1 0.897006 0.364144 0.578223 0.278124 ## Tcf7l2 1.200509 0.793351 0.693367 0.466360 ## Acsl4 0.727051 0.332819 0.513141 0.253460 ## Gpat4 0.747920 0.490298 0.546934 0.355464 ## Lpin1 0.467945 0.222698 0.376721 0.189425 ## Lpgat1 0.468181 0.416764 0.372966 0.291568 Gene set testing of marker lists is a reliable approach for determining if pathways are up- or down-regulated between clusters. As the top marker genes are simply DEGs, we can directly apply well-established procedures for testing gene enrichment in DEG lists (see here for relevant packages). This contrasts with the AUCell approach where scores are not easily comparable across cells. The downside is that all conclusions are made relative to the other clusters, making it more difficult to determine cell identity if an “outgroup” is not present in the same study. 7.5 Computing gene set activities For the sake of completeness, we should mention that we can also quantify gene set activity on a per-cell level and test for differences in activity. This inverts the standard gene set testing procedure by combining information across genes first and then testing for differences afterwards. To avoid the pitfalls mentioned previously for the AUCs, we simply compute the average of the log-expression values across all genes in the set for each cell. This is less sensitive to the behavior of other genes in that cell (aside from composition biases, as discussed in Chapter 2). aggregated &lt;- sumCountsAcrossFeatures(sce.mam, by.go, exprs_values=&quot;logcounts&quot;, average=TRUE) dim(aggregated) # rows are gene sets, columns are cells ## [1] 22607 2772 aggregated[1:10,1:5] ## [,1] [,2] [,3] [,4] [,5] ## GO:0000002 0.33417 0.3324 0.08332 0.2714 0.2458 ## GO:0000003 0.24231 0.2503 0.19379 0.1708 0.1945 ## GO:0000009 0.00000 0.0000 0.00000 0.0000 0.0000 ## GO:0000010 0.00000 0.0000 0.00000 0.0000 0.0000 ## GO:0000012 0.31107 0.4095 0.15848 0.0000 0.3071 ## GO:0000014 0.07573 0.1744 0.23772 0.2535 0.2844 ## GO:0000015 0.76686 0.4295 0.47544 0.5071 0.8043 ## GO:0000016 0.00000 0.0000 0.00000 0.0000 0.0000 ## GO:0000017 0.00000 0.0000 0.00000 0.0000 0.6636 ## GO:0000018 0.24713 0.2661 0.06876 0.1207 0.1561 We can then identify “differential gene set activity” between clusters by looking for significant differences in the per-set averages of the relevant cells. For example, we observe that cluster 2 has the highest average expression for the triacylglycerol biosynthesis GO term (Figure 7.7), consistent with the proposed identity of those cells. plotColData(sce.mam, y=I(aggregated[&quot;GO:0019432&quot;,]), x=&quot;label&quot;) Figure 7.7: Distribution of average log-normalized expression for genes involved in triacylglycerol biosynthesis, for all cells in each cluster of the mammary gland dataset. The obvious disadvantage of this approach is that not all genes in the set may exhibit the same pattern of differences. Non-DE genes will add noise to the per-set average, “diluting” the strength of any differences compared to an analysis that focuses directly on the DE genes (Figure 7.8). At worst, a gene set may contain subsets of DE genes that change in opposite directions, cancelling out any differences in the per-set average. This is not uncommon for gene sets that contain both positive and negative regulators of a particular biological process or pathway. # Choose the top-ranking gene in GO:0019432. plotExpression(sce.mam, &quot;Thrsp&quot;, x=&quot;label&quot;) Figure 7.8: Distribution of log-normalized expression values for Thrsp across all cells in each cluster of the mammary gland dataset. We could attempt to use the per-set averages to identify gene sets of interest via differential testing across all possible sets, e.g., with findMarkers(). However, the highest ranking gene sets in this approach tend to be very small and uninteresting because - by definition - the pitfalls mentioned above are avoided when there is only one gene in the set. This is compounded by the fact that the log-fold changes in the per-set averages are difficult to interpret. For these reasons, we generally reserve the use of this gene set summary statistic for visualization rather than any real statistical analysis. Session Info View session info R version 4.1.1 (2021-08-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_GB [4] LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] scater_1.22.0 ggplot2_3.3.5 limma_3.50.0 [4] org.Mm.eg.db_3.14.0 AUCell_1.16.0 GSEABase_1.56.0 [7] graph_1.72.0 annotate_1.72.0 XML_3.99-0.8 [10] scRNAseq_2.7.2 scran_1.22.0 bluster_1.4.0 [13] scuttle_1.4.0 ensembldb_2.18.0 AnnotationFilter_1.18.0 [16] GenomicFeatures_1.46.0 AnnotationDbi_1.56.0 AnnotationHub_3.2.0 [19] BiocFileCache_2.2.0 dbplyr_2.1.1 pheatmap_1.0.12 [22] SingleR_1.8.0 celldex_1.3.0 SingleCellExperiment_1.16.0 [25] SummarizedExperiment_1.24.0 Biobase_2.54.0 GenomicRanges_1.46.0 [28] GenomeInfoDb_1.30.0 IRanges_2.28.0 S4Vectors_0.32.0 [31] BiocGenerics_0.40.0 MatrixGenerics_1.6.0 matrixStats_0.61.0 [34] BiocStyle_2.22.0 rebook_1.4.0 loaded via a namespace (and not attached): [1] utf8_1.2.2 R.utils_2.11.0 tidyselect_1.1.1 [4] RSQLite_2.2.8 grid_4.1.1 BiocParallel_1.28.0 [7] munsell_0.5.0 ScaledMatrix_1.2.0 codetools_0.2-18 [10] statmod_1.4.36 withr_2.4.2 colorspace_2.0-2 [13] filelock_1.0.2 highr_0.9 knitr_1.36 [16] labeling_0.4.2 GenomeInfoDbData_1.2.7 bit64_4.0.5 [19] farver_2.1.0 vctrs_0.3.8 generics_0.1.1 [22] xfun_0.27 R6_2.5.1 ggbeeswarm_0.6.0 [25] rsvd_1.0.5 locfit_1.5-9.4 bitops_1.0-7 [28] cachem_1.0.6 DelayedArray_0.20.0 assertthat_0.2.1 [31] promises_1.2.0.1 BiocIO_1.4.0 scales_1.1.1 [34] beeswarm_0.4.0 gtable_0.3.0 beachmat_2.10.0 [37] rlang_0.4.12 splines_4.1.1 rtracklayer_1.54.0 [40] lazyeval_0.2.2 BiocManager_1.30.16 yaml_2.2.1 [43] httpuv_1.6.3 tools_4.1.1 bookdown_0.24 [46] ellipsis_0.3.2 jquerylib_0.1.4 RColorBrewer_1.1-2 [49] Rcpp_1.0.7 sparseMatrixStats_1.6.0 progress_1.2.2 [52] zlibbioc_1.40.0 purrr_0.3.4 RCurl_1.98-1.5 [55] prettyunits_1.1.1 viridis_0.6.2 cowplot_1.1.1 [58] ggrepel_0.9.1 cluster_2.1.2 magrittr_2.0.1 [61] data.table_1.14.2 ProtGenerics_1.26.0 hms_1.1.1 [64] mime_0.12 evaluate_0.14 xtable_1.8-4 [67] gridExtra_2.3 compiler_4.1.1 biomaRt_2.50.0 [70] tibble_3.1.5 crayon_1.4.1 R.oo_1.24.0 [73] htmltools_0.5.2 segmented_1.3-4 later_1.3.0 [76] DBI_1.1.1 ExperimentHub_2.2.0 MASS_7.3-54 [79] rappdirs_0.3.3 Matrix_1.3-4 R.methodsS3_1.8.1 [82] parallel_4.1.1 metapod_1.2.0 igraph_1.2.7 [85] pkgconfig_2.0.3 GenomicAlignments_1.30.0 dir.expiry_1.2.0 [88] xml2_1.3.2 vipor_0.4.5 bslib_0.3.1 [91] dqrng_0.3.0 XVector_0.34.0 stringr_1.4.0 [94] digest_0.6.28 Biostrings_2.62.0 rmarkdown_2.11 [97] edgeR_3.36.0 DelayedMatrixStats_1.16.0 restfulr_0.0.13 [100] curl_4.3.2 kernlab_0.9-29 shiny_1.7.1 [103] Rsamtools_2.10.0 rjson_0.2.20 lifecycle_1.0.1 [106] jsonlite_1.7.2 BiocNeighbors_1.12.0 CodeDepends_0.6.5 [109] viridisLite_0.4.0 fansi_0.5.0 pillar_1.6.4 [112] lattice_0.20-45 KEGGREST_1.34.0 fastmap_1.1.0 [115] httr_1.4.2 survival_3.2-13 GO.db_3.14.0 [118] interactiveDisplayBase_1.32.0 glue_1.4.2 png_0.1-7 [121] BiocVersion_3.14.0 bit_4.0.4 stringi_1.7.5 [124] sass_0.4.0 mixtools_1.2.0 blob_1.2.2 [127] BiocSingular_1.10.0 memoise_2.0.0 dplyr_1.0.7 [130] irlba_2.3.3 "]]