---
title: "Import HoverNet and ProvGigaPath features with imageFeatureTCGA"
author: "Ilaria Billato"
date: "`r BiocStyle::doc_date()`"
output:
  BiocStyle::html_document:
    toc: true
    number_sections: true
    toc_float: true
    toc_depth: 3
package: imageFeatureTCGA
bibliography: ../inst/references.bib
vignette: >
  %\VignetteIndexEntry{imageFeatureTCGA}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# imageFeatureTCGA

```{r setup, include=FALSE}
knitr::opts_chunk$set(cache = TRUE, echo = TRUE)
```

```{r load_packages, include=TRUE, results="hide", message=FALSE, warning=FALSE}
library(imageFeatureTCGA)
library(SummarizedExperiment)
library(dplyr)
```

# Overview

`imageFeatureTCGA` provides convenient access to
histopathology-derived data from **TCGA** through two complementary pipelines:

- **HoVerNet** → cell segmentation and classification [@graham2019hover]
- **ProvGigaPath** → slide- and tile-level embeddings [@xu2024gigapath]

These datasets can be imported directly into R as **Bioconductor objects**,
facilitating downstream integration with TCGA omics and clinical data.

# Installation

```{r install, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("waldronlab/imageFeatureTCGA")
```

# Data structure and technical details

The datasets accessible through `imageFeatureTCGA` originate from
whole-slide histopathology images processed by deep learning pipelines.
They are distributed as precomputed features to avoid the computational
cost of running segmentation and embedding models locally.

## HoVerNet outputs

HoVerNet provides nuclei segmentation and classification results at the
single-cell level. Each detected nucleus is represented by:

- spatial coordinates (`x`, `y`) in pixel units relative to the slide
- predicted cell type labels
- class probabilities
- polygon contours describing nuclear boundaries (when available)

When imported as a `SpatialExperiment` or `SpatialFeatureExperiment`,
the data are structured as follows:

- **columns** represent individual nuclei
- **colData** stores cell-level metadata (coordinates, cell types)
- **assays** may contain quantitative features (e.g., probabilities)
- **metadata** may include segmentation contours and image information

These objects enable spatial analyses and integration with other
Bioconductor workflows for spatial transcriptomics and imaging data.

## ProvGigaPath embeddings

ProvGigaPath is a foundation model trained on large-scale pathology image
tiles that produces high-dimensional embeddings summarizing visual and
morphological features.

Two levels of embeddings are provided:

### Slide-level embeddings

Slide-level embeddings summarize the entire whole-slide image into a
single feature vector.

- one row per slide
- embedding dimension corresponds to the encoder output size
- suitable for slide-level prediction or clustering tasks

### Tile-level embeddings

Tile-level embeddings provide localized representations of tissue regions.

Each tile entry includes:

- spatial coordinates (`tile_x`, `tile_y`) corresponding to the tile
  position on the slide
- a high-dimensional embedding vector
- optional metadata describing tile extraction parameters

These embeddings enable spatial analyses of tissue heterogeneity and can
be integrated with cell-level data from HoVerNet using complementary
packages such as `imageTCGAutils`.


# Available Data

Use the following function to download the catalog of available files:

```{r getcatalog}
getCatalog()
```

## Formats

- **HoVerNet** data is available in `JSON`, `GeoJSON`, `thumb` and `H5AD`
    formats.
- **ProvGigaPath** data is available in CSV format.

Note that the `thumb` format refers to the png thumbnails of the whole-slide
images.

### HoVerNet data

```{r listhover}
getCatalog("hovernet")
```

### ProvGigaPath data

```{r listprovgiga}
getCatalog("provgigapath")
```

# Importing HoVerNet data

You can import HoVerNet segmentation results as either a `SpatialExperiment` or
`SpatialFeatureExperiment`. Here we selectively import a file based on its
filename, but you can also filter by other metadata fields such as `Project.ID`,
`pipeline`, `format`, etc.

```{r importHover}
hspe <- getCatalog("hovernet") |>
    dplyr::filter(
        filename == paste(
            "TCGA-VG-A8LO-01A-01-DX1",
            "B39A4D64-82A1-4A04-8AB6-918F3058B83B",
            "json",
            "gz",
            sep = "."
        )
    ) |>
    getFileURLs() |>
    HoverNet(outClass = "SpatialExperiment") |>
    import()
hspe
```

Each cell is represented with:

- `x`, `y` spatial coordinates
- cell type and type probabilities
- optional contours stored in metadata

```{r spcoords}
colData(hspe)
```

# Importing ProvGigaPath embeddings

## Slide-level embeddings

ProvGigaPath embeddings summarize tile or slide-level image features. In this
example, we import slide-level embeddings for a single file. Each row
corresponds to a slide, with an embedding vector describing the image-derived
features.

```{r importProvGiga}
getCatalog("provgigapath") |>
    dplyr::filter(
        filename == paste(
            "TCGA-VG-A8LO-01A-01-DX1",
            "B39A4D64-82A1-4A04-8AB6-918F3058B83B",
            "csv",
            "gz",
            sep = "."
        ) &
        level == "slide_level"
    ) |>
    getFileURLs() |>
    ProvGiga() |>
    import()
```

## Tile-level embeddings

ProvGigaPath tile-level embeddings provide a more granular representation of
image features at the tile level. Each row corresponds to a tile, with spatial
coordinates (`tile_x`, `tile_y`) and an embedding vector describing the
image-derived features for that tile. In this example, we filter the catalog
to the tile-level file corresponding to the same slide as above.

```{r}
getCatalog("provgigapath") |>
    dplyr::filter(
        filename == paste(
            "TCGA-VG-A8LO-01A-01-DX1",
            "B39A4D64-82A1-4A04-8AB6-918F3058B83B",
            "csv",
            "gz",
            sep = "."
        ) &
        level == "tile_level"
    ) |>
    getFileURLs() |>
    ProvGiga() |>
    import()
```

# Importing multiple ProvGigaPath files

One can also import multiple files at once. Here we filter the catalog to the
first three slide-level files for the TCGA-GBM project, and import them as a
`ProvGigaList`. Each element of the list corresponds to a slide, with the same
structure as described above for slide-level embeddings. Note that the
`ProvGigaList` constructor can also accept a vector of file paths or URLs. The
`import` method for `ProvGigaList` will then import each file in the list and
return either a single `SummarizedExperiment` or a list of
`SummarizedExperiment` objects based on the diversity of the data levels in the
input files. In this example, the catalog is filtered to a slide-level subset,
so the output is a single `SummarizedExperiment` object with three columns
corresponding to the three slides.

```{r ProvGigaList}
pgl <- getCatalog("provgigapath") |>
    dplyr::filter(level == "slide_level", Project.ID == "TCGA-GBM") |>
    dplyr::slice(1:3) |>
    getFileURLs() |>
    ProvGigaList() |>
    import()
pgl
```

## Mixed level imports

The `ProvGigaList` constructor can also accept a mix of slide- and tile-level
files. In this case, the `import` method will return a list of
`SummarizedExperiment` objects, one for each data level. Here we filter the
catalog to include both slide- and tile-level files for the same slide, and
import them together.

```{r ProvGigaListMixed}
pgl_mixed <- getCatalog("provgigapath") |>
    dplyr::filter(
        filename %in% c(
            paste(
                "TCGA-VG-A8LO-01A-01-DX1",
                "B39A4D64-82A1-4A04-8AB6-918F3058B83B",
                "csv",
                "gz",
                sep = "."
            )
        ) &
        level %in% c("slide_level", "tile_level")
    ) |>
    getFileURLs() |>
    ProvGigaList() |>
    import()
pgl_mixed
```


# See also

You can explore the full documentation through the MOFA and Point Pattern
Analysis vignettes in the `imageTCGA` manuscript [repository][1].

[1]: https://github.com/billila/manuscript_imageTCGA/
    
Note. More vignettes will be added as new feature types and workflows become
available.

# Shiny App: *imageTCGA*

The [imageTCGA](https://github.com/billila/imageTCGA) Shiny application
provides an interactive interface for exploring TCGA Diagnostic Image Database
metadata.

Click here to explore the shiny app:
[imageTCGA](https://shiny.sph.cuny.edu/app/imageTCGA/)

# Session Info

<details>
    <summary>Click here for Session Info</summary>
```{r sessioninfo}
sessionInfo()
```
</details>