--- title: "Import HoverNet and ProvGigaPath features with imageFeatureTCGA" author: "Ilaria Billato" date: "`r BiocStyle::doc_date()`" output: BiocStyle::html_document: toc: true number_sections: true toc_float: true toc_depth: 3 package: imageFeatureTCGA bibliography: ../inst/references.bib vignette: > %\VignetteIndexEntry{imageFeatureTCGA} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # imageFeatureTCGA ```{r setup, include=FALSE} knitr::opts_chunk$set(cache = TRUE, echo = TRUE) ``` ```{r load_packages, include=TRUE, results="hide", message=FALSE, warning=FALSE} library(imageFeatureTCGA) library(SummarizedExperiment) library(dplyr) ``` # Overview `imageFeatureTCGA` provides convenient access to histopathology-derived data from **TCGA** through two complementary pipelines: - **HoVerNet** → cell segmentation and classification [@graham2019hover] - **ProvGigaPath** → slide- and tile-level embeddings [@xu2024gigapath] These datasets can be imported directly into R as **Bioconductor objects**, facilitating downstream integration with TCGA omics and clinical data. # Installation ```{r install, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("waldronlab/imageFeatureTCGA") ``` # Data structure and technical details The datasets accessible through `imageFeatureTCGA` originate from whole-slide histopathology images processed by deep learning pipelines. They are distributed as precomputed features to avoid the computational cost of running segmentation and embedding models locally. ## HoVerNet outputs HoVerNet provides nuclei segmentation and classification results at the single-cell level. Each detected nucleus is represented by: - spatial coordinates (`x`, `y`) in pixel units relative to the slide - predicted cell type labels - class probabilities - polygon contours describing nuclear boundaries (when available) When imported as a `SpatialExperiment` or `SpatialFeatureExperiment`, the data are structured as follows: - **columns** represent individual nuclei - **colData** stores cell-level metadata (coordinates, cell types) - **assays** may contain quantitative features (e.g., probabilities) - **metadata** may include segmentation contours and image information These objects enable spatial analyses and integration with other Bioconductor workflows for spatial transcriptomics and imaging data. ## ProvGigaPath embeddings ProvGigaPath is a foundation model trained on large-scale pathology image tiles that produces high-dimensional embeddings summarizing visual and morphological features. Two levels of embeddings are provided: ### Slide-level embeddings Slide-level embeddings summarize the entire whole-slide image into a single feature vector. - one row per slide - embedding dimension corresponds to the encoder output size - suitable for slide-level prediction or clustering tasks ### Tile-level embeddings Tile-level embeddings provide localized representations of tissue regions. Each tile entry includes: - spatial coordinates (`tile_x`, `tile_y`) corresponding to the tile position on the slide - a high-dimensional embedding vector - optional metadata describing tile extraction parameters These embeddings enable spatial analyses of tissue heterogeneity and can be integrated with cell-level data from HoVerNet using complementary packages such as `imageTCGAutils`. # Available Data Use the following function to download the catalog of available files: ```{r getcatalog} getCatalog() ``` ## Formats - **HoVerNet** data is available in `JSON`, `GeoJSON`, `thumb` and `H5AD` formats. - **ProvGigaPath** data is available in CSV format. Note that the `thumb` format refers to the png thumbnails of the whole-slide images. ### HoVerNet data ```{r listhover} getCatalog("hovernet") ``` ### ProvGigaPath data ```{r listprovgiga} getCatalog("provgigapath") ``` # Importing HoVerNet data You can import HoVerNet segmentation results as either a `SpatialExperiment` or `SpatialFeatureExperiment`. Here we selectively import a file based on its filename, but you can also filter by other metadata fields such as `Project.ID`, `pipeline`, `format`, etc. ```{r importHover} hspe <- getCatalog("hovernet") |> dplyr::filter( filename == paste( "TCGA-VG-A8LO-01A-01-DX1", "B39A4D64-82A1-4A04-8AB6-918F3058B83B", "json", "gz", sep = "." ) ) |> getFileURLs() |> HoverNet(outClass = "SpatialExperiment") |> import() hspe ``` Each cell is represented with: - `x`, `y` spatial coordinates - cell type and type probabilities - optional contours stored in metadata ```{r spcoords} colData(hspe) ``` # Importing ProvGigaPath embeddings ## Slide-level embeddings ProvGigaPath embeddings summarize tile or slide-level image features. In this example, we import slide-level embeddings for a single file. Each row corresponds to a slide, with an embedding vector describing the image-derived features. ```{r importProvGiga} getCatalog("provgigapath") |> dplyr::filter( filename == paste( "TCGA-VG-A8LO-01A-01-DX1", "B39A4D64-82A1-4A04-8AB6-918F3058B83B", "csv", "gz", sep = "." ) & level == "slide_level" ) |> getFileURLs() |> ProvGiga() |> import() ``` ## Tile-level embeddings ProvGigaPath tile-level embeddings provide a more granular representation of image features at the tile level. Each row corresponds to a tile, with spatial coordinates (`tile_x`, `tile_y`) and an embedding vector describing the image-derived features for that tile. In this example, we filter the catalog to the tile-level file corresponding to the same slide as above. ```{r} getCatalog("provgigapath") |> dplyr::filter( filename == paste( "TCGA-VG-A8LO-01A-01-DX1", "B39A4D64-82A1-4A04-8AB6-918F3058B83B", "csv", "gz", sep = "." ) & level == "tile_level" ) |> getFileURLs() |> ProvGiga() |> import() ``` # Importing multiple ProvGigaPath files One can also import multiple files at once. Here we filter the catalog to the first three slide-level files for the TCGA-GBM project, and import them as a `ProvGigaList`. Each element of the list corresponds to a slide, with the same structure as described above for slide-level embeddings. Note that the `ProvGigaList` constructor can also accept a vector of file paths or URLs. The `import` method for `ProvGigaList` will then import each file in the list and return either a single `SummarizedExperiment` or a list of `SummarizedExperiment` objects based on the diversity of the data levels in the input files. In this example, the catalog is filtered to a slide-level subset, so the output is a single `SummarizedExperiment` object with three columns corresponding to the three slides. ```{r ProvGigaList} pgl <- getCatalog("provgigapath") |> dplyr::filter(level == "slide_level", Project.ID == "TCGA-GBM") |> dplyr::slice(1:3) |> getFileURLs() |> ProvGigaList() |> import() pgl ``` ## Mixed level imports The `ProvGigaList` constructor can also accept a mix of slide- and tile-level files. In this case, the `import` method will return a list of `SummarizedExperiment` objects, one for each data level. Here we filter the catalog to include both slide- and tile-level files for the same slide, and import them together. ```{r ProvGigaListMixed} pgl_mixed <- getCatalog("provgigapath") |> dplyr::filter( filename %in% c( paste( "TCGA-VG-A8LO-01A-01-DX1", "B39A4D64-82A1-4A04-8AB6-918F3058B83B", "csv", "gz", sep = "." ) ) & level %in% c("slide_level", "tile_level") ) |> getFileURLs() |> ProvGigaList() |> import() pgl_mixed ``` # See also You can explore the full documentation through the MOFA and Point Pattern Analysis vignettes in the `imageTCGA` manuscript [repository][1]. [1]: https://github.com/billila/manuscript_imageTCGA/ Note. More vignettes will be added as new feature types and workflows become available. # Shiny App: *imageTCGA* The [imageTCGA](https://github.com/billila/imageTCGA) Shiny application provides an interactive interface for exploring TCGA Diagnostic Image Database metadata. Click here to explore the shiny app: [imageTCGA](https://shiny.sph.cuny.edu/app/imageTCGA/) # Session Info
Click here for Session Info ```{r sessioninfo} sessionInfo() ```