BatchSVG 1.3.2
A standard task in the analysis of spatially resolved transcriptomics (SRT) data is to identify spatially variable genes (SVGs). This is most commonly done within one tissue section at a time because the spatial relationships between the tissue sections are typically unknown. One challenge is how to identify and remove SVGs that are associated with a known bias or technical artifact, such as the slide or capture area, which can lead to poor performance in downstream analyses, such as spatial domain detection. Here, we introduce BatchSVG, a tool to identify biased features associated with batch effect(s)
(e.g. sample, slide, and sex) in a set of SVGs. Our approach compares the rank of per-gene deviance under a binomial model (i) with and (ii) without including a covariate in the model that is associated with the known bias or technical artifact. If the rank of a feature changes significantly between these, then we infer that this gene is likely associated with the bias or technical artifact and should be removed from the downstream analysis. The package is compatible with SpatialExperiment objects.
if (!requireNamespace("BiocManager")) {
install.packages("BiocManager")
}
BiocManager::install("BatchSVG")
To install the development version from GitHub instead:
remotes::install("christinehou11/BatchSVG")
In this section, we will demonstrate the workflow for using
BatchSVG and show how the method helps to detect and visualize batch-biased SVGs. We will use an SRT dataset from
the spatialLIBD package.
library(BatchSVG)
library(spatialLIBD)
library(cowplot)
spatialLIBD_spe <- fetch_data(type = "spe")
spatialLIBD_spe
class: SpatialExperiment
dim: 33538 47681
metadata(0):
assays(2): counts logcounts
rownames(33538): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
ENSG00000268674
rowData names(9): source type ... gene_search is_top_hvg
colnames(47681): AAACAACGAATAGTTC-1 AAACAAGTATCTCCCA-1 ...
TTGTTTCCATACAACT-1 TTGTTTGTGTAAATTC-1
colData names(69): sample_id Cluster ... array_row array_col
reducedDimNames(6): PCA TSNE_perplexity50 ... TSNE_perplexity80
UMAP_neighbors15
mainExpName: NULL
altExpNames(0):
spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
imgData names(4): sample_id image_id data scaleFactor
We will use an SVG set that was previously generated using the nnSVG package.
libd_nnsvgs <- read.csv(
system.file("extdata","libd-all_nnSVG_p-05-features-df.csv",
package = "BatchSVG"),
row.names = 1, check.names = FALSE)
featureSelect()This function performs feature selection on a chosen subset of SRT data (input) using a predefined set of SVGs (VGs).
To assess the impact of the batch variable in the SRT data, we fit a binomial model per gene (i) with and (ii) without including a covariate in the model that is associated with the batch effect or unwanted variation. Using this model output, we define \(d_{i, \text{ default}}\) and \(d_{i, \text{ batch name}}\) as the residual deviance for gene \(i\) using a binomial model with and without the batch effect, respectively. We calculate a per-gene relative change in deviance (RCD) as \({RCD}_i = \frac{d_{i, \text{ default}} - d_{i, \text{ batch name}}}{d_{i, \text{ batch name}}}\). Generally, a higher per-gene deviance \(d_i\) suggests that the gene’s expression is more likely to be biologically meaningful. Therefore, a reduction in deviance after accounting for the batch covariate indicates that the batch explains a portion of the variation in gene expression that was previously attributed to biological differences.
In addition to the residual deviance itself, we also consider the ranks of the residual deviances where top-ranked genes have the largest residual deviance and are considered more important. Here, an increase in rank (e.g. from a rank of 1 to a rank of 500) when including the batch variable indicates that the relative importance of the feature is diminished once the batch variable is accounted for. Therefore, in addition to RCD, we also evaluated the rank deviance (RD), which is defined as \({RD}_i = r_{i, \text{ batch name}} - r_{i, \text{ default}}\), where \(r_{i, \text{ default}}\) and \(r_{i, \text{ batch name}}\) are the per-gene rank when the binomial deviance model is run with and without the batch variable, respectively.
The featureSelect() function enables feature selection while
accounting for multiple batch effects. It returns a list of data
frames, where each batch effect is associated with a corresponding data
frame containing key results, including:
Relative change in deviance before and after batch effect adjustment (RCD)
Rank differences before and after batch effect adjustment (RD)
Number of standard deviations (nSD) for both relative change in deviance and rank difference
We use apply featureSelect() to the example
dataset while adjusting for the batch effect of subject.
list_batch_df <- featureSelect(input = spatialLIBD_spe,
batch_effect = "subject", VGs = libd_nnsvgs$gene_id)
Running feature selection with batch...
Batch Effect: subject
Running feature selection without batch...
Calculating deviance and rank difference...
list_batch_df <- featureSelect(input = spatialLIBD_spe,
batch_effect = "subject", VGs = libd_nnsvgs$gene_id, verbose = FALSE)
class(list_batch_df)
[1] "list"
head(list_batch_df$subject)
gene_id gene_name dev_default rank_default dev_subject rank_subject
1 ENSG00000187608 ISG15 43591.43 1151 43304.39 1157
2 ENSG00000131584 ACAP3 39115.97 1498 38826.05 1509
3 ENSG00000242485 MRPL20 47074.23 920 46912.53 918
4 ENSG00000160075 SSU72 47815.13 862 47742.10 852
5 ENSG00000078369 GNB1 54464.75 477 54385.81 462
6 ENSG00000187730 GABRD 51521.13 646 51099.35 648
d_diff nSD_dev_subject r_diff nSD_rank_subject
1 0.006628419 -0.04505973 6 0.15937488
2 0.007467317 -0.03697398 11 0.29218728
3 0.003446929 -0.07572463 -2 -0.05312496
4 0.001529688 -0.09420402 -10 -0.26562480
5 0.001451448 -0.09495814 -15 -0.39843720
6 0.008254157 -0.02939000 2 0.05312496
svg_nSD()The svg_nSD() function generates visualizations to assess the impact of the batch
variable(s) on the SVGs. It produces bar charts
of the distribution of SVGs based on relative change in deviance
and rank difference, with colors representing different nSD intervals.
Additionally, the scatterplots compare deviance and rank values with and
without batch effects. Using these plots, we can determine appropriate nSD thresholds
for filtering batch-biased SVGs.
The SVGs in red and very far off the line are most likely to be batch-biased SVGs since their deviance and/or rank values are more changed compared to the other SVGs.
plots <- svg_nSD(list_batch_df = list_batch_df,
sd_interval_dev = 3, sd_interval_rank = 3)
Figure 1. Visualizations of nSD_dev and nSD_rank threshold selection
plots$subject
biasDetect()The function biasDetect() is designed to identify and filter out
batch-biased SVGs across different batch effects. Using threshold values
selected from the visualization results generated by svg_nSD(), this
function detects outliers that exceed a specified number
of standard deviation (nSD) threshold in either relative change in deviance, rank difference, or both.
The function outputs visualizations comparing deviance and rank values with and without batch effects. Genes with high deviations, highlighted in color, are identified as potentially biased and can be excluded based on the selected nSD thresholds.
The function offers flexibility in customizing plot aesthetics, allowing users to adjust the point size (plot_point_size), shape (plot_point_shape), annotated text size (plot_text_size), and color palette (plot_palette). Default values are provided for these parameters if not specified. Users should refer to the ggplot2 aesthetic guidelines to ensure appropriate values are assigned for each parameter.
We will use nSD_dev = 3 and nSD_rank = 3 for the example. Users
should adjust these values based on their datasets.
Usage of Different Threshold Options
threshold = "dev": Filters batch-biased SVGs based only on the relative
change in deviance. Genes with deviance changes exceeding the
specified nSD_dev threshold are identified as batch-biased and
can be removed.bias_dev <- biasDetect(list_batch_df = list_batch_df,
threshold = "dev", nSD_dev = 3)
Table 1. Outlier Genes defined by nSD_dev only
head(bias_dev$subject$Table)
gene_id gene_name dev_default rank_default dev_subject rank_subject
1 ENSG00000255823 MTRNR2L8 286927.9 5 67738.95 77
2 ENSG00000087250 MT3 115108.2 21 86093.50 31
3 ENSG00000256618 MTRNR2L1 167381.9 14 40299.65 1368
4 ENSG00000198840 MT-ND3 170192.2 12 120647.03 17
d_diff nSD_dev_subject r_diff nSD_rank_subject nSD_bin_dev dev_outlier
1 3.2357886 31.079310 72 1.9124986 [30,33] TRUE
2 0.3370142 3.139375 10 0.2656248 [3,6) TRUE
3 3.1534316 30.285509 1354 35.9655982 [30,33] TRUE
4 0.4106625 3.849237 5 0.1328124 [3,6) TRUE
We can change point size using plot_point_size.
# size default = 3
bias_dev_size <- biasDetect(list_batch_df = list_batch_df,
threshold = "dev", nSD_dev = 3, plot_point_size = 4)
Figure 2. Customize point size
plot_grid(bias_dev$subject$Plot, bias_dev_size$subject$Plot)
threshold = "rank": Identifies biased genes based only on rank
difference. Genes with rank shifts exceeding nSD_rank are
considered batch-biased.bias_rank <- biasDetect(list_batch_df = list_batch_df,
threshold = "rank", nSD_rank = 3)
Table 2. Outlier Genes defined by nSD_rank only
head(bias_rank$subject$Table)
gene_id gene_name dev_default rank_default dev_subject rank_subject
1 ENSG00000184144 CNTN2 44388.43 1101 42016.77 1258
2 ENSG00000136541 ERMN 48106.50 844 44879.30 1036
3 ENSG00000013297 CLDN11 53385.84 539 49776.21 732
4 ENSG00000164124 TMEM144 39407.07 1471 37587.88 1605
5 ENSG00000204655 MOG 37293.40 1648 35613.38 1762
6 ENSG00000091129 NRCAM 51388.42 655 49006.18 776
d_diff nSD_dev_subject r_diff nSD_rank_subject nSD_bin_rank rank_outlier
1 0.05644564 0.4351052 157 4.170309 [3,6) TRUE
2 0.07190853 0.5841448 192 5.099996 [3,6) TRUE
3 0.07251725 0.5900120 193 5.126559 [3,6) TRUE
4 0.04839817 0.3575395 134 3.559372 [3,6) TRUE
5 0.04717397 0.3457399 114 3.028123 [3,6) TRUE
6 0.04861096 0.3595904 121 3.214060 [3,6) TRUE
We can change point shape using plot_point_shape.
Figure 3. Customize point shape
# shape default = 16
bias_rank_shape <- biasDetect(list_batch_df = list_batch_df,
threshold = "rank", nSD_rank = 3, plot_point_shape = 2)
plot_grid(bias_rank$subject$Plot, bias_rank_shape$subject$Plot)
threshold = "both": Detects biased genes based on both relative change in deviance and rank difference, providing a more stringent filtering
approach.bias_both <- biasDetect(list_batch_df = list_batch_df, threshold = "both",
nSD_dev = 3, nSD_rank = 3)
Table 3. Outlier Genes defined by nSD_dev and nSD_rank
head(bias_both$subject$Table)
gene_id gene_name dev_default rank_default dev_subject rank_subject
1 ENSG00000184144 CNTN2 44388.43 1101 42016.77 1258
2 ENSG00000136541 ERMN 48106.50 844 44879.30 1036
3 ENSG00000013297 CLDN11 53385.84 539 49776.21 732
4 ENSG00000164124 TMEM144 39407.07 1471 37587.88 1605
5 ENSG00000204655 MOG 37293.40 1648 35613.38 1762
6 ENSG00000091129 NRCAM 51388.42 655 49006.18 776
d_diff nSD_dev_subject r_diff nSD_rank_subject nSD_bin_dev dev_outlier
1 0.05644564 0.4351052 157 4.170309 [0,3) FALSE
2 0.07190853 0.5841448 192 5.099996 [0,3) FALSE
3 0.07251725 0.5900120 193 5.126559 [0,3) FALSE
4 0.04839817 0.3575395 134 3.559372 [0,3) FALSE
5 0.04717397 0.3457399 114 3.028123 [0,3) FALSE
6 0.04861096 0.3595904 121 3.214060 [0,3) FALSE
nSD_bin_rank rank_outlier
1 [3,6) TRUE
2 [3,6) TRUE
3 [3,6) TRUE
4 [3,6) TRUE
5 [3,6) TRUE
6 [3,6) TRUE
We can change point color using plot_palette. The color
palette
here can
be referenced as the function uses RColorBrewer to generate
colors.
Figure 4. Customize the palette color
# color default = "YlOrRd"
bias_both_color <- biasDetect(list_batch_df = list_batch_df,
threshold = "both", nSD_dev = 3, nSD_rank = 3, plot_palette = "Greens")
plot_grid(bias_both$subject$Plot, bias_both_color$subject$Plot,nrow = 2)
We can change text size using plot_text_size. We can also specify unique color palettes for both batch effects.
Figure 5. Customize text size and color palette
# text size default = 3
bias_both_color_text <- biasDetect(list_batch_df = list_batch_df,
threshold = "both", nSD_dev = 3, nSD_rank = 3,
plot_palette = c("Blues"), plot_text_size = 4)
plot_grid(bias_both$subject$Plot, bias_both_color_text$subject$Plot,nrow = 2)
Finally, we obtain a refined set of SVGs by
removing the batch-biased SVGs based on user-defined thresholds for
nSD_dev and nSD_rank.
Here, we use the results from bias_both, which applied
threshold = "both" to account for both deviance and rank differences.
bias_both_df <- bias_both$subject$Table
svgs_filt <- setdiff(libd_nnsvgs$gene_id, bias_both_df$gene_id)
svgs_filt_spe <- libd_nnsvgs[libd_nnsvgs$gene_id %in% svgs_filt, ]
nrow(svgs_filt_spe)
[1] 1951
After obtaining the refined set of SVGs, these genes can be further analyzed using established SRT clustering algorithms to explore tissue layers and spatial organization.
R session information## Session info
sessionInfo()
#> R Under development (unstable) (2025-10-20 r88955)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] cowplot_1.2.0 spatialLIBD_1.23.0
#> [3] SpatialExperiment_1.21.0 SingleCellExperiment_1.33.0
#> [5] SummarizedExperiment_1.41.0 Biobase_2.71.0
#> [7] GenomicRanges_1.63.0 Seqinfo_1.1.0
#> [9] IRanges_2.45.0 S4Vectors_0.49.0
#> [11] BiocGenerics_0.57.0 generics_0.1.4
#> [13] MatrixGenerics_1.23.0 matrixStats_1.5.0
#> [15] BatchSVG_1.3.2 BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 jsonlite_2.0.0 shape_1.4.6.1
#> [4] magrittr_2.0.4 ggbeeswarm_0.7.3 magick_2.9.0
#> [7] farver_2.1.2 rmarkdown_2.30 BiocIO_1.21.0
#> [10] GlobalOptions_0.1.3 vctrs_0.6.5 Rsamtools_2.27.0
#> [13] memoise_2.0.1 config_0.3.2 RCurl_1.98-1.17
#> [16] paletteer_1.6.0 benchmarkme_1.0.8 tinytex_0.58
#> [19] htmltools_0.5.8.1 S4Arrays_1.11.1 AnnotationHub_4.1.0
#> [22] curl_7.0.0 BiocNeighbors_2.5.0 SparseArray_1.11.6
#> [25] sass_0.4.10 bslib_0.9.0 htmlwidgets_1.6.4
#> [28] httr2_1.2.1 plotly_4.11.0 cachem_1.1.0
#> [31] GenomicAlignments_1.47.0 mime_0.13 lifecycle_1.0.4
#> [34] iterators_1.0.14 pkgconfig_2.0.3 rsvd_1.0.5
#> [37] Matrix_1.7-4 R6_2.6.1 fastmap_1.2.0
#> [40] shiny_1.11.1 clue_0.3-66 digest_0.6.39
#> [43] colorspace_2.1-2 rematch2_2.1.2 AnnotationDbi_1.73.0
#> [46] scater_1.39.0 irlba_2.3.5.1 ExperimentHub_3.1.0
#> [49] RSQLite_2.4.5 beachmat_2.27.0 labeling_0.4.3
#> [52] filelock_1.0.3 httr_1.4.7 abind_1.4-8
#> [55] compiler_4.6.0 withr_3.0.2 bit64_4.6.0-1
#> [58] doParallel_1.0.17 attempt_0.3.1 S7_0.2.1
#> [61] BiocParallel_1.45.0 viridis_0.6.5 DBI_1.2.3
#> [64] sessioninfo_1.2.3 rappdirs_0.3.3 DelayedArray_0.37.0
#> [67] rjson_0.2.23 tools_4.6.0 vipor_0.4.7
#> [70] otel_0.2.0 beeswarm_0.4.0 httpuv_1.6.16
#> [73] glue_1.8.0 restfulr_0.0.16 promises_1.5.0
#> [76] grid_4.6.0 cluster_2.1.8.1 gtable_0.3.6
#> [79] tidyr_1.3.1 data.table_1.17.8 BiocSingular_1.27.1
#> [82] ScaledMatrix_1.19.0 XVector_0.51.0 stringr_1.6.0
#> [85] ggrepel_0.9.6 BiocVersion_3.23.1 foreach_1.5.2
#> [88] pillar_1.11.1 limma_3.67.0 later_1.4.4
#> [91] circlize_0.4.16 benchmarkmeData_1.0.4 dplyr_1.1.4
#> [94] BiocFileCache_3.1.0 lattice_0.22-7 rtracklayer_1.71.0
#> [97] bit_4.6.0 tidyselect_1.2.1 ComplexHeatmap_2.27.0
#> [100] locfit_1.5-9.12 scuttle_1.21.0 Biostrings_2.79.2
#> [103] knitr_1.50 gridExtra_2.3 bookdown_0.45
#> [106] edgeR_4.9.0 xfun_0.54 statmod_1.5.1
#> [109] DT_0.34.0 stringi_1.8.7 lazyeval_0.2.2
#> [112] yaml_2.3.11 shinyWidgets_0.9.0 cigarillo_1.1.0
#> [115] evaluate_1.0.5 codetools_0.2-20 tibble_3.3.0
#> [118] BiocManager_1.30.27 cli_3.6.5 xtable_1.8-4
#> [121] jquerylib_0.1.4 golem_0.5.1 dichromat_2.0-0.1
#> [124] Rcpp_1.1.0 dbplyr_2.5.1 scry_1.23.0
#> [127] png_0.1-8 XML_3.99-0.20 parallel_4.6.0
#> [130] ggplot2_4.0.1 blob_1.2.4 bitops_1.0-9
#> [133] viridisLite_0.4.2 scales_1.4.0 purrr_1.2.0
#> [136] crayon_1.5.3 GetoptLong_1.1.0 rlang_1.1.6
#> [139] KEGGREST_1.51.1