--- title: "imageTCGAutils" author: "Ilaria Billato, Eslam Abousamra" date: "2026-03-02" output: BiocStyle::html_document: toc: true number_sections: true toc_float: true toc_depth: 3 package: imageTCGAutils bibliography: ../inst/references.bib vignette: > %\VignetteIndexEntry{imageTCGAutils} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # imageTCGAutils # Introduction `imageTCGAutils` provides utility functions for integrating and analyzing multi-modal whole-slide image (WSI) data from [The Cancer Genome Atlas (TCGA)][1]. It is designed to work alongside `imageFeatureTCGA`, which handles data import of precomputed features derived from histopathology foundation models, including HoVerNet [@graham2019hover] and Prov-GigaPath [@xu2024gigapath]. [1]: https://cancer.gov/ccg/research/genome-sequencing/tcga In particular, Prov-GigaPath is a vision encoder pretrained on over 1.3 billion pathology image tiles from the Providence Health System, producing high-dimensional tile-level embeddings that capture rich visual and morphological characteristics of tissue architecture. `imageTCGAutils` facilitates the integration of these tile-level embeddings with nuclei-level segmentation and classification results generated by HoVerNet. Because these data sources operate at different spatial resolutions and use distinct coordinate systems, the package provides functions such as `matchHoverNetToTiles()` to compute scaling factors and assign nuclei-level features to their corresponding tiles. This alignment enables downstream analyses that combine cellular morphological context from nuclei classification with the global representations encoded in tile-level embeddings. Additionally, the package includes functionality to import into R user-generated results obtained from CONCH (CONtrastive learning from Captions for Histopathology) [@lu2024visual], a vision–language foundation model pretrained on 1.17 million histopathology image–caption pairs. CONCH achieves state-of-the-art performance across multiple tasks, including image classification, segmentation, and image–text retrieval, thereby enabling multi-modal analyses that incorporate both visual and textual information. In this vignette, we demonstrate how to work with tile-level embeddings derived from whole-slide images (WSIs). Each tile corresponds to a tissue patch, and its embedding is a high-dimensional vector summarizing visual and morphological features extracted by Prov-GigaPath. We begin by importing the tile-level data using imageFeatureTCGA. Next, we perform principal component analysis (PCA) to reduce the high-dimensional embeddings to two principal components, which facilitates visualization and preliminary exploration of the data. We then visualize the spatial layout of the tiles on the tissue slide, coloring by the principal components to examine patterns in the embedding space. # Installation ```{r install, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("waldronlab/imageTCGAutils") ``` # Loading packages ```{r library import, message=FALSE} library(BiocStyle) library(imageFeatureTCGA) library(imageTCGAutils) library(ggplot2) library(dplyr) library(sfdep) library(spdep) library(SpatialExperiment) library(data.table) ``` # Import Prov-GigaPath tile level embeddings ```{r import tile-level emb, results = 'hide'} ## filter with catalog getCatalog("provgigapath") |> dplyr::filter(Project.ID == "TCGA-OV") |> dplyr::pull(filename) # select Ovarian Cancer Slide as an example tile_prov_url <- paste0( "https://store.cancerdatasci.org/provgigapath/tile_level/", "TCGA-23-1021-01Z-00-DX1.F07C221B-D401-47A5-9519-10DE59CA1E9D.csv.gz" ) example_slide <- ProvGiga(tile_prov_url) |> import() ``` # Embedding PCA ```{r run PCA} # Extract embedding numbers for pca embedding_cols <- grep("^[0-9]+$", names(example_slide), value = TRUE) # Run PCA pca_res <- prcomp(example_slide[, embedding_cols], scale. = TRUE) pca_example_slide <- bind_cols( example_slide, as_tibble(pca_res$x)[, 1:2] |> rename(PC1 = "PC1", PC2 = "PC2") ) ``` ```{r visualize PC} ggplot(pca_example_slide, aes(PC1, PC2)) + geom_point(alpha = 0.6, size = 1) + theme_minimal() + labs(title = "Tile-level PCA Ovarian Cancer Embedding: Single Slide") ggplot(pca_example_slide, aes(tile_x, tile_y, color = PC1)) + geom_point(size = 1) + scale_color_viridis_c() + coord_equal() + theme_minimal() + labs(title = "Tissue layout colored by PC1") ``` # Spatial Patterns To investigate spatial patterns in the tissue, we use the PCA-reduced embeddings for each tile. Each tile has a physical location (tile_x, tile_y) on the slide, which allows us to explore how similar embedding values cluster across space. We construct a k-nearest neighbor graph to define which tiles are spatially “connected,” and then compute global and local spatial autocorrelation metrics. ```{r extract coords} coords <- pca_example_slide[, c("tile_x", "tile_y")] nb <- knn2nb(knearneigh(coords, k = 6)) lw <- nb2listw(nb, style = "W") ``` Next, we calculate global spatial autocorrelation using Moran’s I and Geary’s C, which quantify the overall tendency of similar PC1 values to cluster or disperse on the tissue slide. We also compute Local Moran’s I (LISA) to detect local clusters of similar embedding values. ```{r metrics} mi <- moran.test(pca_example_slide$PC1, lw) gc <- geary.test(pca_example_slide$PC1, lw) lisa <- localmoran(pca_example_slide$PC1, lw) pca_example_slide$localI <- lisa[, "Ii"] pca_example_slide$localI_pval <- lisa[, "Pr(z != E(Ii))"] mi gc ``` We visualize the spatial patterns. The Moran scatterplot shows the relationship between each tile’s PC1 value and the mean of its neighbors, while the LISA plot highlights local clusters (“hotspots”) of high or low PC1 values across the tissue slide. ```{r metrics 2} moran.plot(pca_example_slide$PC1, lw, labels = FALSE, main = "Moran scatterplot of PC1") # LISA visualization df_lisa <- data.frame(coords, Ii = lisa[, "Ii"]) ggplot(df_lisa, aes(x = tile_x, y = tile_y, color = Ii)) + geom_point(size = 0.5) + scale_color_viridis_c() + coord_equal() + theme_minimal() + ggtitle("Local Moran's I (LISA) for PC1") ``` # Adding HoverNet Nuclei Features You can import HoVerNet segmentation results as a `SpatialExperiment` or `SpatialFeatureExperiment`. In this section we want show you how to integrate HoVerNet classification and segmentation output with Prov-GigaPath embeddings. ```{r importHover} # import HoVerNet hov_file <- paste0( "https://store.cancerdatasci.org/hovernet/h5ad/", "TCGA-23-1021-01Z-00-DX1.F07C221B-D401-47A5-9519-10DE59CA1E9D.h5ad.gz" ) hn_spe <- HoverNet(hov_file, outClass = "SpatialExperiment") |> import() # import Prov-GigaPath tile_prov_url <- paste0( "https://store.cancerdatasci.org/provgigapath/tile_level/", "TCGA-23-1021-01Z-00-DX1.F07C221B-D401-47A5-9519-10DE59CA1E9D.csv.gz" ) pg_spe<- ProvGiga(tile_prov_url) |> import() ``` ```{r cell coordinates} # Extract cell coordinates from HoVerNet cell_coords <- spatialCoords(hn_spe) # Extract nuclei metadata cell_meta <- colData(hn_spe) cell_meta$x <-cell_coords[,1] cell_meta$y <-cell_coords[,2] ``` # Visualizing Hovernet nuclei vs tile coordinates to see that they do not match perfectly. You can use matchHoverNetToTiles to compute the scaling factor. ```{r plotting hovernet vs tiles PCA} plot(cell_meta$x, cell_meta$y, pch=16, col="#0000FF20") points(pca_example_slide$tile_x, pca_example_slide$tile_y, pch=16, col="#FF000020") ``` # Scale factor between nuclei coordinates and tile coordinates ```{r compute scaling factor} match_hv_pg <- matchHoverNetToTiles(hn_spe, pg_spe) ``` ```{r visualizing tile level with hovernet} ggplot(match_hv_pg$tiles_with_nuclei, aes(tile_x, tile_y, color = cell_type_label, size = N)) + geom_point(alpha = 0.7) + coord_equal() + theme_minimal() + labs(title = "All HoverNet cell types per tile") ggplot(match_hv_pg$tiles_with_nuclei, aes(tile_x, tile_y, color = dominant_cell_type)) + geom_point(size = 2) + coord_equal() + theme_minimal() + labs(title = "Per-tile dominant HoverNet cell type") ``` # Session Info ```{r sessioninfo} sessionInfo() ``` # References