--- title: "shinyDSP internal data processing pipeline explained" author: - name: Seung J. Kim affiliation: - Interstitial Lung Disease Lab, London Health Sciences Center email: skim823@uwo.ca - name: Marco Mura affiliation: - Interstitial Lung Disease Lab, London Health Sciences Center - Division of Respirology, Department of Medicine, Western University email: marco.mura@respirology.site.co package: "`r pkg_ver('shinyDSP')`" date: "`r doc_date()`" output: BiocStyle::html_document: toc: true number_sections: false toc_float: smooth_scroll: true toc_depth: 2 self_contained: yes vignette: > %\VignetteIndexEntry{shinyDSP_secondary} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} bibliography: references.bib link-citations: true editor_options: markdown: wrap: 80 --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = FALSE, message = FALSE, warning = FALSE, echo = TRUE, eval = TRUE, cache = TRUE, fig.align = 'center', out.width="65%" ) knitr::knit_hooks$set(time_it = local({ now <- NULL function(before, options) { if (before) { # record the current time before each chunk now <<- Sys.time() } else { # calculate the time difference after a chunk res <- difftime(Sys.time(), now, units = "secs") # return a character string to show the time msg <- paste("Time for this code chunk to run:", round(res, 2), "seconds") message(msg) # <-- print to console return(msg) # <-- also return to appear in knitted output } } })) ``` ## Introduction The purpose of this vignette is to look under the hood and explain what a few underlying functions do to make this app work. ## Data import The human kidney data from [Nanostring](https://nanostring.com/products/geomx-digital-spatial-profiler/spatial-organ-atlas/human-kidney/) is loaded from `ExperimentHub()`. Two files are required: a count matrix and matching annotation file. Let's read those files. ```{r load, time_it = TRUE} library(standR) library(SummarizedExperiment) library(ExperimentHub) library(readr) library(dplyr) library(stats) eh <- ExperimentHub() AnnotationHub::query(eh, "standR") countFilePath <- eh[["EH7364"]] sampleAnnoFilePath <- eh[["EH7365"]] countFile <- readr::read_delim(unname(countFilePath), na = character()) sampleAnnoFile <- readr::read_delim(unname(sampleAnnoFilePath), na = character()) ``` ## Variable(s) selection A key step in analysis is to select a biological group(s) of interest. Let's inspect `sampleAnnoFile`. ```{r} colnames(sampleAnnoFile) sampleAnnoFile$disease_status %>% unique() sampleAnnoFile$region %>% unique() ``` "disease_status" and "region" look like interesting variables. For example, we might be interested in comparing "normal_glomerulus" to "DKD_glomerulus". The app can create a new `sampleAnnoFile` where two or more variables of interest can be combined into one grouped variable (under "Variable(s) of interest" in the main `side bar`). ```{r new_sampleAnnoFile, time_it = TRUE} new_sampleAnnoFile <- sampleAnnoFile %>% tidyr::unite( "disease_region", # newly created grouped variable c("disease_status","region"), # variables to combine sep = "_" ) new_sampleAnnoFile$disease_region %>% unique() ``` Finally a spatial experiment object can be created. ```{r readGeoMx, time_it = TRUE} spe <- standR::readGeoMx(as.data.frame(countFile), as.data.frame(new_sampleAnnoFile)) ``` ## Selecting groups to analyze We merged "disease_status" and "region" to create a new group called, "disease_region". The app allows users to pick any subset of groups of interest. In this example, the four possible groups are "DKD_glomerulus", "DKD_tubule","normal_tubule", and "normal_glomerulus". The following script can subset `spe` to keep certain groups. ```{r selectedTypes, time_it = TRUE} selectedTypes <- c("DKD_glomerulus", "DKD_tubule", "normal_tubule", "normal_glomerulus") toKeep <- colData(spe) %>% tibble::as_tibble() %>% pull(disease_region) spe <- spe[, grepl(paste(selectedTypes, collapse = "|"), toKeep)] ``` ## Applying QC filters Regions of interest(ROIs) can be filtered out based on any quantitative variable in `colData(spe)` (same as `colnames(new_sampleAnnoFile)`). These options can be found under the "QC" `nav panel`'s `side bar`. Let's say I want to keep ROIs whose "SequencingSaturation" is at least 85. ```{r filter, time_it = TRUE} filter <- lapply("SequencingSaturation", function(column) { cutoff_value <- 85 if(!is.na(cutoff_value)) { return(colData(spe)[[column]] > cutoff_value) } else { return(NULL) } }) filtered_spe <- spe[,unlist(filter)] colData(spe) %>% dim() colData(filtered_spe) %>% dim() ``` We can see that 14 ROIs have been filtered out. ## PCA Two PCA plots (colour-coded by biological groups or batch) for three normalization schemes are automatically created in the "PCA" `nav panel`. Let's take Q3 and RUV4 normalization as an example. ```{r speQ3, time_it=TRUE} speQ3 <- standR::geomxNorm(filtered_spe, method = "upperquartile") speQ3 <- scater::runPCA(filtered_spe) speQ3_compute <- SingleCellExperiment::reducedDim(speQ3, "PCA") ``` ```{r speRUV, time_it=TRUE} speRuv_NCGs <- standR::findNCGs(filtered_spe, batch_name = "SlideName", top_n = 200) speRuvBatchCorrection <- standR::geomxBatchCorrection(speRuv_NCGs, factors = "disease_region", NCGs = S4Vectors::metadata(speRuv_NCGs)$NCGs, k = 2 ) speRuv <- scater::runPCA(speRuvBatchCorrection) speRuv_compute <- SingleCellExperiment::reducedDim(speRuv, "PCA") ``` Then, `.PCAFunction()` is called to generate the plot. We can see that biological replicates group better with RUV4 (top) compared to Q3 (bottom). ```{r .PCAFunction, echo=FALSE} .PCAFunction <- function(spe, precomputed, colourShapeBy, selectedVar, ROIshapes, ROIcolours) { standR::drawPCA(spe, precomputed = precomputed) + ## really need as.name() plus !! because of factor() ggplot2::geom_point(ggplot2::aes( shape = factor(!!as.name(colourShapeBy), levels = selectedVar), fill = factor(!!as.name(colourShapeBy), levels = selectedVar) ), size = 3, colour = "black", stroke = 0.5) + ggplot2::scale_shape_manual(colourShapeBy, values = as.integer(unlist(ROIshapes)) # as.integer() crucial ) + ggplot2::scale_fill_manual(colourShapeBy, ## colour() is a base function, so must be avoided values = unlist(ROIcolours) ) + ggplot2::scale_y_continuous( labels = scales::number_format(accuracy = 0.1) ) + ggplot2::scale_x_continuous( labels = scales::number_format(accuracy = 0.1) ) + ggplot2::theme( panel.grid.minor = ggplot2::element_blank(), panel.grid.major = ggplot2::element_blank(), axis.text = ggplot2::element_text(color = "black", size = 16), axis.line = ggplot2::element_blank(), axis.ticks = ggplot2::element_line(colour = "black"), axis.title = ggplot2::element_text(size = 16), legend.title = ggplot2::element_text( size = 16, vjust = 0.5, hjust = 0.5, face = "bold", family = "sans" ), legend.text = ggplot2::element_text( size = 16, vjust = 0.5, hjust = 0, face = "bold", family = "sans" ), plot.margin = grid::unit(c(1, 1, 1, 1), "mm"), plot.background = ggplot2::element_rect( fill = "transparent", colour = NA ), plot.title = ggplot2::element_text( size = 16, hjust = 0.5, face = "bold", family = "sans" ), panel.border = ggplot2::element_rect( colour = "black", linewidth = 0.4 ), panel.background = ggplot2::element_rect( fill = "transparent", colour = NA ), legend.background = ggplot2::element_rect( fill = "transparent", colour = NA ), legend.box.background = ggplot2::element_rect( fill = "transparent", colour = NA ), legend.key = ggplot2::element_rect( fill = "transparent", colour = NA ), legend.position = "bottom", aspect.ratio = 1 ) + ggplot2::guides( fill = ggplot2::guide_legend( nrow = length(selectedVar), title.position = "top" ), shape = ggplot2::guide_legend( nrow = length(selectedVar), title.position = "top" ) ) } ``` ```{r PCAFunction, time_it = TRUE} .PCAFunction(speRuv, speRuv_compute, "disease_region", selectedTypes, c(21,22,23,24), c("blue","dodgerblue1","black","grey75")) .PCAFunction(speQ3, speQ3_compute, "disease_region", selectedTypes, c(21,22,23,24), c("blue","dodgerblue1","black","grey75")) ``` ## Design matrix Let's proceed with RUV4 normalization. A design matrix is created next. ```{r design, time_it = TRUE} design <- model.matrix(~0+disease_region+ruv_W1+ruv_W2, data = colData(speRuv)) # Clean up column name colnames(design) <- gsub("disease_region", "", colnames(design)) ``` If confounding variables are chosen in the main `side bar`, those would be added to `model.matrix` as `~0 + disease_region + confounder1 + confounder2)`. ## Creating a DGEList `edgeR` is used to convert `SpatialExperiment` into `DGEList`, filter and estimate dispersion. ```{r convert, time_it = TRUE} library(edgeR) dge <- SE2DGEList(speRuv) keep <- filterByExpr(dge, design) dge <- dge[keep, , keep.lib.sizes = FALSE] dge <- estimateDisp(dge, design = design, robust = TRUE) ``` ## Comparison between all biological groups Recall our "selectedTypes" from above: "DKD_glomerulus", "DKD_tubule", "normal_tubule", "normal_glomerulus". The following code creates all pairwise comparisons between them. ```{r contrast, time_it = TRUE} # In case there are spaces selectedTypes_underscore <- gsub(" ", "_", selectedTypes) comparisons <- list() comparisons <- lapply( seq_len(choose(length(selectedTypes_underscore), 2)), function(i) { noquote( paste0( utils::combn(selectedTypes_underscore, 2, simplify = FALSE )[[i]][1], "-", utils::combn(selectedTypes_underscore, 2, simplify = FALSE )[[i]][2] ) ) } ) con <- makeContrasts( # Must use as.character() contrasts = as.character(unlist(comparisons)), levels = colnames(design) ) colnames(con) <- sub("-", "_vs_", colnames(con)) con ``` ## Fitting a linear regression model with `limma` The app uses `duplicateCorrelation()` "[s]ince we need to make comparisons both within and between subjects, it is necessary to treat Patient as a random effect" [limma user's guide](https://www.bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf) [@ritchie_limma_2015]. `limma-voom` method is used as `standR` package recommends [@liu_standr_2024]. ```{r limma, time_it = TRUE} library(limma) block_by <- colData(speRuv)[["SlideName"]] v <- voom(dge, design) corfit <- duplicateCorrelation(v, design, block = block_by ) v2 <- voom(dge, design, block = block_by, correlation = corfit$consensus ) corfit2 <- duplicateCorrelation(v, design, block = block_by ) fit <- lmFit(v, design, block = block_by, correlation = corfit2$consensus ) fit_contrast <- contrasts.fit(fit, contrasts = con ) efit <- eBayes(fit_contrast, robust = TRUE) ``` ## Tables of top differentially expressed genes For each contrast (a column in `con`), the app creates a table of top differentially expressed genes sorted by their adjusted P value. ```{r tables, time_it = TRUE} # Keep track of how many comparisons there are numeric_vector <- seq_len(ncol(con)) new_list <- as.list(numeric_vector) # This adds n+1th index to new_list where n = ncol(con) # This index contains seq_len(ncol(con)) # ex. new_list[[7]] = 1 2 3 4 5 6 # This coefficient allows ANOVA-like comparison in toptable() if (length(selectedTypes) > 2) { new_list[[length(new_list) + 1]] <- numeric_vector } topTabDF <- lapply(new_list, function(i) { limma::topTable(efit, coef = i, number = Inf, p.value = 0.05, adjust.method = "BH", lfc = 1 ) %>% tibble::rownames_to_column(var = "Gene") }) # Adds names to topTabDF if (length(selectedTypes) > 2) { names(topTabDF) <- c( colnames(con), colnames(con) %>% stringr::str_split(., "_vs_") %>% unlist() %>% unique() %>% paste(., collapse = "_vs_") ) } else { names(topTabDF) <- colnames(con) } ``` `topTabDF` is now a list of tables where the list index corresponds to that of columns in `con`. ```{r tableExample} colnames(con) head(topTabDF[[1]]) ``` These tables are then shown to the user in the "Tables" `nav panel`. ## Volcano plot and heatmap No further *serious* data processing is performed at this point. The numbers in `topTabDF` is now parsed to create volcano plots and heatmaps in their respective `nav panel`s. ```{r .volcanoFunction, echo = FALSE} .volcanoFunction <- function(volcano, delabSize, maxOverlap, title, logFCcutoff, PvalCutoff, DnCol, notDEcol, UpCol) { logFC <- adj.P.Val <- deLab <- NULL ggplot2::ggplot( data = volcano, ggplot2::aes( x = logFC, y = -log10(adj.P.Val), col = de, label = deLab ) ) + ggplot2::geom_point() + ggplot2::theme_bw() + ggplot2::theme( axis.ticks = ggplot2::element_line(colour = "black"), panel.border = ggplot2::element_rect(colour = "black"), text = ggplot2::element_text(size = 16, color = "black"), title = ggplot2::element_text(size = 16, hjust = 0.5), axis.text = ggplot2::element_text(size = 16, color = "black"), axis.title = ggplot2::element_text(size = 16), plot.margin = grid::unit(c(1, 1, 1, 1), "mm"), plot.background = ggplot2::element_rect( fill = "transparent", colour = NA ), panel.background = ggplot2::element_rect( fill = "transparent", colour = NA ), panel.grid = ggplot2::element_blank(), legend.background = ggplot2::element_rect( fill = "transparent", colour = NA ), legend.box.background = ggplot2::element_rect( fill = "transparent", colour = NA ), legend.key = ggplot2::element_rect( fill = "transparent", colour = NA ), legend.position = "none" ) + ggplot2::geom_vline( xintercept = c(-(logFCcutoff), logFCcutoff), col = "black", linetype = 3 ) + ggplot2::geom_hline( yintercept = -log10(PvalCutoff), col = "black", linetype = 3 ) + ggrepel::geom_text_repel( size = delabSize, segment.colour = "black", segment.linetype = 3, data = volcano %>% dplyr::filter(logFC >= logFCcutoff | logFC <= -(logFCcutoff)), ggplot2::aes(label = deLab), min.segment.length = 0, max.overlaps = maxOverlap ) + ggplot2::scale_color_manual( values = c(DnCol, notDEcol, UpCol), breaks = c("DN", "NA", "UP") ) + ggplot2::xlab(expression(log[2] ~ fold ~ change)) + ggplot2::ylab(expression(-log[10] ~ italic(P) ~ value)) + ggplot2::theme(aspect.ratio = 1, rect = ggplot2::element_rect( fill = "transparent" )) + ggplot2::labs(title = title) } ``` ```{r VolcanoFunction, time_it = TRUE} volcanoDF <- lapply(seq_len(ncol(con)), function(i) { limma::topTable(efit, coef = i, number = Inf) %>% tibble::rownames_to_column(var = "Target.name") %>% dplyr::select("Target.name", "logFC", "adj.P.Val") %>% dplyr::mutate(de = ifelse(logFC >= 1 & adj.P.Val < 0.05, "UP", ifelse(logFC <= -(1) & adj.P.Val < 0.05, "DN", "NO" ) )) %>% dplyr::mutate( logFC_threshold = stats::quantile(abs(logFC), 0.99, na.rm = TRUE ), pval_threshold = stats::quantile(adj.P.Val, 0.01, na.rm = TRUE ), deLab = ifelse(abs(logFC) > logFC_threshold & adj.P.Val < pval_threshold & abs(logFC) >= 0.05 & adj.P.Val < 0.05, Target.name, NA) ) }) plots <- lapply(seq_along(volcanoDF), function(i) { .volcanoFunction( volcanoDF[[i]], 12, 5, colnames(con)[i], 1, 0.05, "Blue", "Grey", "Red" ) + ggplot2::xlim( -2, 2 ) + ggplot2::ylim( 0, 20 ) }) plots[1] ``` In contrast to Volcano plots, only the top N genes are shown in a heatmap. Setup is a bit more involved. ```{r heatmapSetup, time_it = TRUE} lcpmSubScaleTopGenes <- lapply(names(topTabDF), function(name) { columns <- stringr::str_split_1(name, "_vs_") %>% lapply(function(.) { which(SummarizedExperiment::colData(speRuv) %>% tibble::as_tibble() %>% dplyr::pull(disease_region) == .) }) %>% unlist() table <- SummarizedExperiment::assay(speRuv, 2)[ topTabDF[[name]] %>% dplyr::slice_head(n = 50) %>% dplyr::select(Gene) %>% unlist() %>% unname(), columns ] %>% data.frame() %>% t() %>% scale() %>% t() return(table) }) names(lcpmSubScaleTopGenes) <- names(topTabDF) columnSplit <- lapply(names(topTabDF), function(name) { columnSplit <- stringr::str_split_1(name, "_vs_") %>% lapply(function(.){ which( SummarizedExperiment::colData(speRuv) %>% tibble::as_tibble() %>% dplyr::select(disease_region) == . ) } ) %>% as.vector() %>% summary() %>% .[, "Length"] }) names(columnSplit) <- names(lcpmSubScaleTopGenes) ``` ```{r heatmap, fig.height = 6, fig.width = 4, time_it = TRUE} colFunc <- circlize::colorRamp2( c( -3, 0, 3 ), hcl_palette = "Inferno" ) heatmap <- lapply(names(lcpmSubScaleTopGenes), function(name) { ComplexHeatmap::Heatmap(lcpmSubScaleTopGenes[[name]], cluster_columns = FALSE, col = colFunc, heatmap_legend_param = list( border = "black", title = "Z score", title_gp = grid::gpar( fontsize = 12, fontface = "plain", fontfamily = "sans" ), labels_gp = grid::gpar( fontsize = 12, fontface = "plain", fontfamily = "sans" ), legend_height = grid::unit( 3 * as.numeric(30), units = "mm" ) ), top_annotation = ComplexHeatmap::HeatmapAnnotation( foo = ComplexHeatmap::anno_block( gp = grid::gpar(lty = 0, fill = "transparent"), labels = columnSplit[[name]] %>% names(), labels_gp = grid::gpar( col = "black", fontsize = 14, fontfamily = "sans", fontface = "bold" ), labels_rot = 0, labels_just = "center", labels_offset = grid::unit(4.5, "mm") ) ), border_gp = grid::gpar(col = "black", lwd = 0.2), row_names_gp = grid::gpar( fontfamily = "sans", fontface = "italic", fontsize = 10 ), show_column_names = FALSE, column_title = NULL, column_split = rep( LETTERS[seq_len(columnSplit[[name]] %>% length())], columnSplit[[name]] %>% unname() %>% as.numeric() ) ) }) names(heatmap) <- names(lcpmSubScaleTopGenes) heatmap[[1]] ```