The scECODA package provides functions for single-cell Exploratory COmpositional Data Analysis at the population scale. It enables users to explore their dataset by visualizing the sample structure and how groups of samples relate to each other by providing numerous convenient plotting and clustering functions. Additionally, it provides metrics to quantify how strongly groups of samples separate and which cell types are driving this separation (differential abundance analysis).
While many tools exist for supervised comparative single-cell analysis, there is
a distinct need for a standardized framework that handles unsupervised
exploratory compositional data analysis (ECODA) specifically for large-scale
cohorts. scECODA is motivated by the following goals for the Bioconductor
ecosystem:
SingleCellExperiment and SummarizedExperiment,
allowing seamless integration into existing pipelines.DESeq2 for pseudobulk normalization.You can install the latest stable release of scECODA from Bioconductor.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("scECODA")
library(scECODA)
You can create an SummarizedExperiment object from a single-cell data (Seurat or
SingleCellExperiment object) or from a data frame containing cell counts or
frequencies (in % per sample) with ecoda().
Based on the input, the following data and transformations will be automatically calculated:
Additionally, it will calculate useful metrics, such as:
NOTE: To calculate CLR-transformed cell type abundances, zero counts or relative abundances are not allowed (log of zero is not defined). By default, scECODA adds a pseudocount of +0.5 to all values if a counts matrix is provided; when the input is a frequency matrix, we add 2/3 times the smallest frequency value as suggested in (Martín-Fernández, Barceló-Vidal, and Pawlowsky-Glahn 2003).
Load an example dataset from the scRNAseq BioC library.
# Loading an example SingleCellExperiment dataset
library(scRNAseq)
all.ds <- as.data.frame(surveyDatasets())
ds <- 1
sce_object <- fetchDataset(
all.ds[["name"]][ds],
all.ds[["version"]][ds]
)
names(colData(sce_object))
## [1] "sample" "DevelopmentalStage" "DaysPostAmputation"
## [4] "cluster" "CellCyclePhase" "Lane"
## [7] "Condition" "batch"
To represent this dataset in terms of cell type composition, we will use the
cell type labels provided by the authors and stored in the “cluster” metadata
field. We will do this for each individual sample (“sample” metadata field). The
ecoda() function will do this for you:
# Create ECODA object
se <- ecoda(
sce_object,
sample_col = "sample",
celltype_col = "cluster"
)
The resulting object stores row counts, frequencies and several transformations
for the cell type composition by sample. For instance, the clr assay contains
centered log-ratio values for the cell types in each sample:
assay(se, "clr")[1:5,1:5]
## SIGAA5 SIGAA8 SIGAA10 SIGAA11 SIGAB5
## Alpha ionocyte -1.4902081 -1.8109108 0.1697487 -1.4273711 0.7809205
## Anterior notochord 1.9437791 1.0223025 -2.6634647 -1.4273711 -2.5149163
## Beta ionocyte -1.4902081 -0.2014729 1.5992152 -1.4273711 1.7477635
## Dermomyotome 0.1192298 0.1349993 -2.6634647 -0.3287588 -0.9054784
## Differentiating myocyte -1.4902081 -0.7122985 -2.6634647 -0.9165454 -2.5149163
Alternatively, scECODA also supports generating compositional matrices from Seurat objects:
# Commented out because Seurat depends on igraph which has not been updated
# to work with R 4.6 yet.
# library(Seurat)
#
# counts <- counts(sce_object)
# metadata <- as.data.frame(SummarizedExperiment::colData(sce_object))
#
# seurat_object <- CreateSeuratObject(counts = counts, meta.data = metadata)
#
# se <- ecoda(
# seurat_object,
# sample_col = "sample",
# celltype_col = "cluster"
# )
If you have a pre-calculated sample x cell type compositional matrix (either as counts or raw frequencies), you can use them directly as input to scECODA. Metadata can be optionally provided and used downstream for visualization.
In this example, we load the immune cell composition in samples from a breast cancer cohort (Zhang et al. 2021)
NOTES:
# Load example datasets
data(example_data)
Zhang <- example_data$Zhang
counts <- Zhang$cell_counts_lowresolution
freq <- calc_freq(counts)
metadata <- Zhang$metadata
se <- ecoda(data = counts, metadata = metadata)
se <- ecoda(data = freq, metadata = metadata)
For explorative analysis, Principal Component Analysis (PCA) can be very useful
to find the major axes of variation between samples. You can plot a PCA of your
se with plot_pca(se).
se <- ecoda(
data = example_data$Zhang$cell_counts_lowresolution,
metadata = example_data$Zhang$metadata,
)
plot_pca(
se,
label_col = "Tissue",
title = "PCA based on cell type composition",
n_hv_feat_show = 5 # Shows the most highly variable features (cell types)
)
Not all cell types have informative composition (some may be very rare, others constant across samples). It is often more robust to restrict the analysis on highly-variable cell types (HVCs); for example, PCA can be calculated on HVCs:
plot_pca(
se,
assay = "clr_hvc",
label_col = "Tissue",
title = "PCA based on highly variable cell types",
n_hv_feat_show = nrow(metadata(se)$clr_hvc)
)
Optionally, we can also add an extra dimension and plot the first 3 principal components:
plot_pca3d(se, label_col = "Tissue")
Sample separation (based on a user-defined grouping) can be quantified with different metrics, for example:
# You can get the separation scores like this:
dist_mat <- dist(assay(se, "clr"))
# Or use the pre-calculated distance matrix based on the CLR abundances:
dist_mat <- metadata(se)$sample_distances
labels <- colData(se)$Tissue
anosim_r_score <- calc_anosim(dist_mat, labels)
ari_score <- calc_ari(dist_mat, labels)
# modularity_score <- calc_modularity(dist_mat, labels)
silhouette_score <- calc_sil(dist_mat, labels)
print(paste("ANOSIM R:", anosim_r_score))
## [1] "ANOSIM R: 0.511"
print(paste("ARI:", ari_score))
## [1] "ARI: 0.639"
# print(paste("Modularity:", modularity_score))
print(paste("Silhouette:", silhouette_score))
## [1] "Silhouette: 0.226"
Box plots are another useful tool to visualize the distribution of cell types across samples:
plot_boxplot(se, title = "CLR Abundance by Cell Type")
If there are biological groupings of interest, we can directly perform statistical tests between the per-sample cell type compositions:
plot_boxplot(
se,
label_col = "Tissue",
title = "CLR Abundance by Cell Type (with Wilcoxon Test)"
)
This can also be plotted as stacked bar plots but it is a bit harder to read.
# Plotting average cell type abundance by experimental group
plot_barplot(
se,
label_col = "Tissue",
plot_by = "group",
title = "Mean Relative Abundance by Condition"
)
# Plotting cell type abundance for each sample separately
plot_barplot(
se,
label_col = "Tissue",
plot_by = "sample",
title = "Relative Abundance for Each Sample"
)
NOTE: It is highly recommended to only do statistical testing on log-ratio-transformed compositional data and not on counts or relative abundances in % (Greenacre 2021).
The main focus of scECODA is exploratory (unsupervised) analysis of samples based on cell type composition. If you want to do down-stream comparative (supervised) differential abundance testing of a priori known groups (including statistical designs with multiple known covariates), there are other tools available for that purpose, e.g. scCODA (Büttner et al. 2021) or tascCODA (“TascCODA: Bayesian Tree-Aggregated Analysis of Compositional Amplicon and Single-Cell Data - Pmc,” n.d.).
The clustered heatmap plot is a good alternative to not only visualize how samples and cell types correlate but also to cluster samples. It is important not to re-scale in order to avoid amplifying tiny differences.
Also, it is possible that you see your samples forming separate clusters in the PCA or heatmap even though you do not find any (or only few) significant differences in cell type composition. However, there might be small differences among a large number of cell types which, taken all together, can still drive sample separation and explain differences between your biological groups.
NOTE: you can specify multiple metadata columns in “label_col” to visualize how samples group across multiple covariates simultaneously.
Here is an example for a simple dataset:
plot_heatmap(
se,
label_col = c("Clinical.efficacy.", "Tissue"),
cutree_rows = 3,
cutree_cols = 3
)
We see, for example, an enrichment of plasma cells in the tumor samples, while neutrophils and platelets are more abundant in blood samples.
As a second example, here’s the same analysis using a large healthy cohort with 868 PBMC samples and 69 cell types (Gong et al. 2024):
se <- ecoda(
data = example_data$GongSharma_full$cell_counts_highresolution,
metadata = example_data$GongSharma_full$metadata
)
plot_heatmap(
se,
label_col = c("subject.cmv", "age_group"),
# Additional arguments for pheatmap:
cutree_rows = 3,
cutree_cols = 5,
show_colnames = FALSE
)
Subsetting to only the most highly variable cell types (HVCs) yields a very similar sample clustering, while being much more interpretable.
plot_heatmap(
se,
assay = "clr_hvc",
label_col = c("subject.cmv", "age_group"),
# Additional arguments for pheatmap:
cutree_rows = 3,
cutree_cols = 4,
show_colnames = FALSE
)
Interestingly, we see that cell type composition tends to cluster subjects, in an unsupervised manner, based on their age group and their CMV status. For example, naive cell types appear to be enriched in younger subjects that were not exposed to CMV infection.
The cell type correlation plot offers the possibility to visualize which cell types correlate with each other, highlighting possible microenvironment hubs that might be of relevance to explain possible groups of samples in your data:
plot_corr(se)
While the correlation plot below showing only the HVCs is easier to read, it is nevertheless interesting to see the full correlation plot above. It seems that there is additional information that might be overlooked:
Besides the main cell type clusters of effector memory cells (separating samples by cytomegalovirus (CMV) infection status) and SOX4+ naive T cells (separating samples by age), there seems to be a cluster of ISG+ cell types, memory T cells and monocytes/DCs whose abundance strongly correlate, respectively.
plot_corr(se, assay = "clr_hvc")
As demonstrated in the heatmap of the extensive Gong & Sharma dataset, not all cell subtypes contribute equally to sample separation.
Cell types at the top exhibit high variance across samples, indicating they are the main drivers of group separation, while those toward the bottom show minor variation.
This principle is directly analogous to the selection of highly variable genes (HVGs) in (sc)RNA-seq data: high variance is a powerful proxy for biological relevance.
We expect cell types that are differentially abundant between key biological groups to display this increased variance. scECODA provides convenient functions to calculate and visualize these cell type variances and mean abundance across all samples:
library(ggplot2)
plot_varmean(se, labels = "none") +
ggtitle("Default - cell types explaining 50% variance")
se_new <- find_hvcs(se, variance_explained = 0.8)
plot_varmean(se_new) +
ggtitle("Cell types explaining 80% variance")
se_new <- find_hvcs(se, top_n_hvcs = 5)
plot_varmean(se_new) +
ggtitle("Top 5 HVCs")
ECODA is much more interpretable (10-100 cell types) compared to gene expression (2’000 - 20’000 genes). However, pseudobulk analysis has the advantage of not requiring cell type annotations.
The scECODA package provides convenient functions to explore sample structure at the population level based on cell type composition but also by default automatically calculates DESeq2-normalized (Love et al. 2015) sample pseudobulk gene expression.
A PCA plot based on pseudobulk can be plotted like this:
# Loading an example SingleCellExperiment dataset
library(scRNAseq)
all.ds <- as.data.frame(surveyDatasets())
ds <- 1
sce_object <- fetchDataset(
all.ds[["name"]][ds],
all.ds[["version"]][ds]
)
se <- ecoda(
data = sce_object,
sample_col = "sample",
celltype_col = "cluster",
get_pb = TRUE
)
plot_pca(
se,
label_col = "Condition",
title = "PCA based on cell type composition",
n_hv_feat_show = 5
)
plot_pca(
se,
assay = "pb",
label_col = "Condition",
title = "PCA based on pseudobulk gene expression"
#, n_hv_feat_show = 5 # optionally, show top n most highly variable genes
)
Yes, ECODA can be influenced by technical variation between batches. A “batch effect” can manifests as an artificial inflation or depletion of specific cell types due to sample handling rather than biology.
Common examples include:
Removing these “nuisance” populations can help mitigating noise and/or batch effect, and to better capture true biological variation.
Yes, we recommended to exclude cell types that introduce technical noise rather than biological signal.
The recommended workflow depends on your data source and whether you require pseudobulk analysis:
# Load example datasets
data(example_data)
Stephenson <- example_data$Stephenson
counts <- Stephenson$cell_counts_lowresolution
metadata <- Stephenson$metadata
se <- ecoda(data = counts, metadata = metadata)
# Specify cell types to drop
cts_to_drop <- c("RBC", "Platelets", "Neutrophils")
keep <- !rownames(assay(se, "counts")) %in% cts_to_drop
counts_subset <- assay(se, "counts")[, keep]
colData_subset <- as.data.frame(colData(se))[keep, ]
# Re-initialize a new subset object
# This ensures all transformations (CLR, HVCs, etc.) are recalculated correctly
se_subset <- ecoda(data = counts_subset, metadata = colData_subset)
Büttner, M., J. Ostner, C. L. Müller, F. J. Theis, and B. Schubert. 2021. “ScCODA Is a Bayesian Model for Compositional Single-Cell Data Analysis.” Nature Communications 12 (1): 6876. https://doi.org/10.1038/s41467-021-27150-6.
Gong, Qiuyu, Mehul Sharma, Emma L. Kuan, Marla C. Glass, Aishwarya Chander, Mansi Singh, Lucas T. Graybuck, et al. 2024. “Longitudinal Multi-Omic Immune Profiling Reveals Age-Related Immune Cell Dynamics in Healthy Adults.” bioRxiv: The Preprint Server for Biology, September, 2024.09.10.612119. https://doi.org/10.1101/2024.09.10.612119.
Greenacre, Michael. 2021. “Compositional Data Analysis.” Annual Review of Statistics and Its Application 8 (1): 271–99. https://doi.org/10.1146/annurev-statistics-042720-124436.
Love, Michael I., Simon Anders, Vladislav Kim, and Wolfgang Huber. 2015. “RNA-Seq Workflow: Gene-Level Exploratory Analysis and Differential Expression.” F1000Research 4 (October): 1070. https://doi.org/10.12688/f1000research.7035.1.
Martín-Fernández, J. A., C. Barceló-Vidal, and V. Pawlowsky-Glahn. 2003. “Dealing with Zeros and Missing Values in Compositional Data Sets Using Nonparametric Imputation.” Mathematical Geology 35 (3): 253–78. https://doi.org/10.1023/A:1023866030544.
“TascCODA: Bayesian Tree-Aggregated Analysis of Compositional Amplicon and Single-Cell Data - Pmc.” n.d. https://pmc.ncbi.nlm.nih.gov/articles/PMC8689185/.
Zhang, Yuanyuan, Hongyan Chen, Hongnan Mo, Xueda Hu, Ranran Gao, Yahui Zhao, Baolin Liu, et al. 2021. “Single-Cell Analyses Reveal Key Immune Cell Subsets Associated with Response to Pd-L1 Blockade in Triple-Negative Breast Cancer.” Cancer Cell 39 (12): 1578–1593.e8. https://doi.org/10.1016/j.ccell.2021.09.010.
sessionInfo()
## R version 4.6.0 alpha (2026-04-05 r89794)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggplot2_4.0.2 scRNAseq_2.25.0
## [3] SingleCellExperiment_1.33.2 SummarizedExperiment_1.41.1
## [5] Biobase_2.71.0 GenomicRanges_1.63.2
## [7] Seqinfo_1.1.0 IRanges_2.45.0
## [9] S4Vectors_0.49.2 BiocGenerics_0.57.1
## [11] generics_0.1.4 MatrixGenerics_1.23.0
## [13] matrixStats_1.5.0 scECODA_0.99.9
## [15] BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 jsonlite_2.0.0 magrittr_2.0.5
## [4] gypsum_1.7.0 GenomicFeatures_1.63.2 farver_2.1.2
## [7] corrplot_0.95 rmarkdown_2.31 BiocIO_1.21.0
## [10] vctrs_0.7.3 memoise_2.0.1 Rsamtools_2.27.2
## [13] RCurl_1.98-1.18 rstatix_0.7.3 htmltools_0.5.9
## [16] S4Arrays_1.11.1 AnnotationHub_4.1.0 curl_7.0.0
## [19] broom_1.0.12 Rhdf5lib_1.33.6 rhdf5_2.55.16
## [22] SparseArray_1.11.13 Formula_1.2-5 sass_0.4.10
## [25] alabaster.base_1.11.4 bslib_0.10.0 alabaster.sce_1.11.0
## [28] htmlwidgets_1.6.4 httr2_1.2.2 plotly_4.12.0
## [31] cachem_1.1.0 GenomicAlignments_1.47.0 lifecycle_1.0.5
## [34] pkgconfig_2.0.3 Matrix_1.7-5 R6_2.6.1
## [37] fastmap_1.2.0 digest_0.6.39 AnnotationDbi_1.73.1
## [40] DESeq2_1.51.7 ExperimentHub_3.1.0 crosstalk_1.2.2
## [43] RSQLite_2.4.6 ggpubr_0.6.3 vegan_2.7-3
## [46] labeling_0.4.3 filelock_1.0.3 httr_1.4.8
## [49] abind_1.4-8 mgcv_1.9-4 compiler_4.6.0
## [52] withr_3.0.2 bit64_4.6.0-1 S7_0.2.1-1
## [55] backports_1.5.1 BiocParallel_1.45.0 carData_3.0-6
## [58] DBI_1.3.0 alabaster.ranges_1.11.0 HDF5Array_1.39.1
## [61] alabaster.schemas_1.11.0 ggsignif_0.6.4 MASS_7.3-65
## [64] rappdirs_0.3.4 DelayedArray_0.37.1 rjson_0.2.23
## [67] gtools_3.9.5 permute_0.9-10 tools_4.6.0
## [70] otel_0.2.0 glue_1.8.1 h5mread_1.3.3
## [73] restfulr_0.0.16 rhdf5filters_1.23.3 nlme_3.1-169
## [76] grid_4.6.0 cluster_2.1.8.2 gtable_0.3.6
## [79] ensembldb_2.35.0 tidyr_1.3.2 data.table_1.18.2.1
## [82] car_3.1-5 XVector_0.51.0 ggrepel_0.9.8
## [85] BiocVersion_3.23.1 pillar_1.11.1 stringr_1.6.0
## [88] splines_4.6.0 dplyr_1.2.1 BiocFileCache_3.1.0
## [91] lattice_0.22-9 rtracklayer_1.71.3 bit_4.6.0
## [94] tidyselect_1.2.1 locfit_1.5-9.12 Biostrings_2.79.5
## [97] knitr_1.51 bookdown_0.46 ProtGenerics_1.43.0
## [100] xfun_0.57 factoextra_2.0.0 pheatmap_1.0.13
## [103] UCSC.utils_1.7.1 stringi_1.8.7 lazyeval_0.2.3
## [106] yaml_2.3.12 evaluate_1.0.5 codetools_0.2-20
## [109] cigarillo_1.1.0 tibble_3.3.1 alabaster.matrix_1.11.0
## [112] BiocManager_1.30.27 cli_3.6.6 jquerylib_0.1.4
## [115] GenomeInfoDb_1.47.2 dichromat_2.0-0.1 Rcpp_1.1.1-1
## [118] dbplyr_2.5.2 png_0.1-9 XML_3.99-0.23
## [121] parallel_4.6.0 blob_1.3.0 mclust_6.1.2
## [124] AnnotationFilter_1.35.0 bitops_1.0-9 alabaster.se_1.11.0
## [127] viridisLite_0.4.3 scales_1.4.0 purrr_1.2.2
## [130] crayon_1.5.3 rlang_1.2.0 KEGGREST_1.51.1