UMAP is commonly used in scRNA-seq data analysis as a visualization tool
projecting high dimensional data onto 2 dimensions to visualize cell clustering.
However, UMAP is prone to showing spurious clustering and distorting distances
(Chari, Banerjee, and Pachter 2021). Moreover, UMAP shows clustering that seems to correspond to
graph-based clusters from Louvain and Leiden because the k nearest neighbor
graph is used in both clustering and UMAP. We have developed concordex as a
quantitative alternative to UMAP cluster visualization without the misleading
problems of UMAP. This package is the R implementation of the original Python
command line tool.
In a nutshell, concordex finds the proportion of cells among the k-nearest
neighbors of each cell with the same cluster or label as the cell itself. This
is computed across all labels and the average of all labels is returned as a
metric that indicates the quality of clustering. If the clustering separates cells
well, then the observed similarity matrix should be diagonal dominant.
library(concordexR)
library(TENxPBMCData)
library(BiocNeighbors)
library(bluster)
library(scater)
library(patchwork)
library(ggplot2)
theme_set(theme_bw())In this vignette, we demonstrate the usage of concordex on a human peripheral
blood mononuclear cells (PBMC) scRNA-seq dataset from 10X Genomics. The data is
loaded as a SingleCellExperiment object.
sce <- TENxPBMCData("pbmc3k")
#> see ?TENxPBMCData and browseVignettes('TENxPBMCData') for documentation
#> loading from cacheHere we plot the standard QC metrics: total number of UMIs detected per cell
(nCounts), number of genes detected (nGenes), and percentage of UMIs from
mitochondrially encoded genes (pct_mito).
sce$nCounts <- colSums(counts(sce))
sce$nGenes <- colSums(counts(sce) > 0)
mito_inds <- grepl("^MT-", rowData(sce)$Symbol_TENx)
sce$pct_mito <- colSums(counts(sce)[mito_inds,])/sce$nCounts * 100plotColData(sce, "nCounts") +
  plotColData(sce, "nGenes") +
  plotColData(sce, "pct_mito")p1 <- plotColData(sce, x = "nCounts", y = "nGenes") +
  geom_density2d()
p2 <- plotColData(sce, x = "nCounts", y = "pct_mito") +
  geom_density2d()
p1 + p2Remove the outliers and cells with high percentage of mitochondrial counts as the high percentage is not expected biologically from the cell type:
sce <- sce[, sce$nCounts < 10000 & sce$pct_mito < 8]
sce <- sce[rowSums(counts(sce)) > 0,]Then normalize the data:
sce <- logNormCounts(sce)For simplicity, the top 500 highly variable genes are used to perform PCA:
sce <- runPCA(sce, ncomponents = 30, ntop = 500, scale = TRUE)See the number of PCs to use later from the elbow plot:
plot(attr(reducedDim(sce, "PCA"), "percentVar"), ylab = "Percentage of variance explained")Percentage of variance explained drops sharply from PC1 to PC5, and definitely
levels off after PC10, so we use the top 10 PCs for clustering here. The graph
based Leiden clustering uses a k nearest neighbor graph. For demonstration here,
we use k = 10.
set.seed(29)
sce$cluster <- clusterRows(reducedDim(sce, "PCA")[,seq_len(10)],
                           NNGraphParam(k = 10, cluster.fun = "leiden",
                                        cluster.args = list(
                                          objective_function = "modularity"
                                        )))See what the clusters look like in PCA space:
plotPCA(sce, color_by = "cluster", ncomponents = 4)
#> Warning in data.frame(gg1$all, df_to_plot[, -reddim_cols]): row names were
#> found from a short variable and have been discardedSome of the clusters seem well-separated along the first 4 PCs.
Since UMAP is commonly used to visualize the clusters, we plot UMAP here
although we don’t recommend UMAP because it’s prone to showing spurious clusters
and distorting distances. UMAP also uses a k nearest neighbor graph, and we use
the same k = 10 here:
sce <- runUMAP(sce, dimred = "PCA", n_dimred = 10, n_neighbors = 10)plotUMAP(sce, color_by = "cluster")For the most part, the clusters are clearly separated on UMAP.
concordexSince UMAP is prone to showing spurious clusters, we’ll see what the concordex
metric says about the clustering and whether it agrees with UMAP visualization.
Here we explicitly obtain the k nearest neighbor graph, as clustering and UMAP
above did not store the graph itself.
g <- findKNN(reducedDim(sce, "PCA")[,seq_len(10)], k = 10)The result here is a list of two n (number of cell) by k matrices. The first
is the indices of each cell’s neighbors, as in an adjacency list that can be
matrix here due to the fixed number of neighbors, and the second is the
distances between each cell and its neighbors. For concordex, only the first
matrix is relevant. An adjacency matrix, either sparse of dense, as stored in
the Seurat object, can also be used. Here the cluster labels are permuted 100
times.
res <- calculateConcordex(
    sce, 
    labels="cluster", 
    use.dimred="PCA",
    compute_similarity=TRUE
)Here the argument compute_similarity indicates that we concordex will return
the cluster-cluster similarity matrix. The entries in this matrix itself represent
the proportion of cells with each label in the neighborhood of other cells with the
same label.
sim <- attr(res, "similarity")
round(sim, 2)
#>      1    2    3    4    5    6    7    8
#> 1 0.80 0.00 0.00 0.00 0.03 0.00 0.18 0.00
#> 2 0.01 0.98 0.00 0.00 0.00 0.00 0.00 0.00
#> 3 0.00 0.00 0.97 0.00 0.00 0.03 0.00 0.00
#> 4 0.00 0.00 0.00 0.94 0.06 0.00 0.00 0.00
#> 5 0.15 0.00 0.00 0.02 0.76 0.00 0.07 0.00
#> 6 0.00 0.00 0.07 0.00 0.00 0.93 0.00 0.00
#> 7 0.12 0.00 0.00 0.00 0.00 0.00 0.88 0.00
#> 8 0.00 0.00 0.28 0.29 0.00 0.08 0.00 0.33sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] patchwork_1.3.0             scater_1.34.0              
#>  [3] ggplot2_3.5.1               scuttle_1.16.0             
#>  [5] bluster_1.16.0              BiocNeighbors_2.0.0        
#>  [7] TENxPBMCData_1.23.0         HDF5Array_1.34.0           
#>  [9] rhdf5_2.50.0                DelayedArray_0.32.0        
#> [11] SparseArray_1.6.0           S4Arrays_1.6.0             
#> [13] abind_1.4-8                 Matrix_1.7-1               
#> [15] SingleCellExperiment_1.28.0 SummarizedExperiment_1.36.0
#> [17] Biobase_2.66.0              GenomicRanges_1.58.0       
#> [19] GenomeInfoDb_1.42.0         IRanges_2.40.0             
#> [21] S4Vectors_0.44.0            BiocGenerics_0.52.0        
#> [23] MatrixGenerics_1.18.0       matrixStats_1.4.1          
#> [25] concordexR_1.6.0            BiocStyle_2.34.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] DBI_1.2.3                gridExtra_2.3            rlang_1.1.4             
#>  [4] magrittr_2.0.3           compiler_4.4.1           RSQLite_2.3.7           
#>  [7] png_0.1-8                vctrs_0.6.5              pkgconfig_2.0.3         
#> [10] SpatialExperiment_1.16.0 crayon_1.5.3             fastmap_1.2.0           
#> [13] dbplyr_2.5.0             magick_2.8.5             XVector_0.46.0          
#> [16] labeling_0.4.3           utf8_1.2.4               rmarkdown_2.28          
#> [19] ggbeeswarm_0.7.2         UCSC.utils_1.2.0         tinytex_0.53            
#> [22] purrr_1.0.2              bit_4.5.0                xfun_0.48               
#> [25] zlibbioc_1.52.0          cachem_1.1.0             beachmat_2.22.0         
#> [28] jsonlite_1.8.9           blob_1.2.4               highr_0.11              
#> [31] rhdf5filters_1.18.0      Rhdf5lib_1.28.0          BiocParallel_1.40.0     
#> [34] irlba_2.3.5.1            parallel_4.4.1           cluster_2.1.6           
#> [37] R6_2.5.1                 bslib_0.8.0              jquerylib_0.1.4         
#> [40] Rcpp_1.0.13              bookdown_0.41            knitr_1.48              
#> [43] FNN_1.1.4.1              igraph_2.1.1             tidyselect_1.2.1        
#> [46] viridis_0.6.5            yaml_2.3.10              codetools_0.2-20        
#> [49] curl_5.2.3               lattice_0.22-6           tibble_3.2.1            
#> [52] withr_3.0.2              KEGGREST_1.46.0          evaluate_1.0.1          
#> [55] isoband_0.2.7            BiocFileCache_2.14.0     ExperimentHub_2.14.0    
#> [58] Biostrings_2.74.0        pillar_1.9.0             BiocManager_1.30.25     
#> [61] filelock_1.0.3           generics_0.1.3           BiocVersion_3.20.0      
#> [64] sparseMatrixStats_1.18.0 munsell_0.5.1            scales_1.3.0            
#> [67] glue_1.8.0               tools_4.4.1              AnnotationHub_3.14.0    
#> [70] ScaledMatrix_1.14.0      cowplot_1.1.3            grid_4.4.1              
#> [73] AnnotationDbi_1.68.0     colorspace_2.1-1         GenomeInfoDbData_1.2.13 
#> [76] beeswarm_0.4.0           BiocSingular_1.22.0      vipor_0.4.7             
#> [79] rsvd_1.0.5               cli_3.6.3                rappdirs_0.3.3          
#> [82] fansi_1.0.6              viridisLite_0.4.2        dplyr_1.1.4             
#> [85] uwot_0.2.2               gtable_0.3.6             sass_0.4.9              
#> [88] digest_0.6.37            ggrepel_0.9.6            farver_2.1.2            
#> [91] rjson_0.2.23             memoise_2.0.1            htmltools_0.5.8.1       
#> [94] lifecycle_1.0.4          httr_1.4.7               mime_0.12               
#> [97] MASS_7.3-61              bit64_4.5.2Chari, Tara, Joeyta Banerjee, and Lior Pachter. 2021. “The Specious Art of Single-Cell Genomics.” bioRxiv.