---
title: "scafari_vignette"
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
    
vignette: >
  %\VignetteIndexEntry{scafari_vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    fig.width = 7, # Width of plots in inches
    fig.height = 5 # Height of plots in inches
)
```

# Introduction
Single-cell DNA sequencing (scDNA-seq) enables the analysis of mutation profiles
at single-cell resolution. This cutting-edge technique makes it possible to shed
light on the previously inaccessible complexity of heterogeneous tissues such
as tumors.
scafari is an R package that offers easy-to-use data quality control as well as 
explorative variant analyses and visualization for scDNA-seq data.

Other scDNA-seq analysis packages are:
- mosaic (https://missionbio.github.io/mosaic/)
- optima  `r BiocStyle::Biocpkg('optima')`
- scDNA (https://github.com/bowmanr/scDNA)

# Requirements
To run scafari, you need R (Version 4.4 or higher).

# Running scafari
## Installation

A development version of scafari will be available on GitHub and can be 
installed as follows:

```{r install, eval=F}
BiocManager::install("scafari")
```

##  Getting started
A basic workflow analysis of scDNA-seq data is demonstrated in the following on
an exemplary dataset. 
First we need to load scafari.

```{r loading , eval=T, warning=FALSE}
library(scafari)
```

Once the package is installed and have been loaded, the analysis can start.
scafari is available as R package and shiny GUI. You can run the shiny version
by RStudio executing the function `launchScafariShiny()`.
The scafari workflow can be split into four main parts.

1. Sequencing information and quality control
2. Panel analysis
3. Variant analysis
4. Analyses of variants of special interest

These parts are represented in the four tabs of the shiny app and several 
functions in the R package.

## Sequencing information and quality control
The analysis start with uploading the .h5 file. In the shiny app an interactive
upload is provided on the starting page.
Information from the .h5 file are loaded in an sce class object in this step. 

```{r reading, message=FALSE}
library(SingleCellExperiment)

# Load .h5 file
h5_file_path <- system.file("extdata", "demo.h5", package = "scafari")
h5 <- h5ToSce(h5_file_path)
sce <- h5$sce_amp
se.var <- h5$se_var
```

After successful data upload the metadata including sequencing information can 
be accessed and basic quality information can be inspected.

```{r metadata}
# Get metadata
metadata(sce)
```

Further, reads per barcode can be visualized in a log-log plot.

```{r loglopplot, message=FALSE}
logLogPlot(sce)
```


## Panel analysis
The panel analysis start by normalizing the read counts using 
`normalizeReadCounts()`. The normalized read counts are stored in the 
corresponding assay (`normalized.counts`).

```{r normalize_rc}
sce <- normalizeReadCounts(sce = sce)
```

After read counts are normalized they can be annotated with biological relevant
information such as exon number and canonical transcript ID.

Therefore, known canonicals should be provided. They can be dowloaded like it 
is shown here: 

```{r known_canon, eval=TRUE}
library(R.utils)

temp_dir <- tempdir()
known_canon_path <- file.path(
    temp_dir,
    "UCSC_hg19_knownCanonical_goldenPath.txt"
)

if (!file.exists(known_canon_path)) {
    url <- paste0(
        "http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/",
        "knownCanonical.txt.gz"
    )
    destfile <- file.path(
        temp_dir,
        "UCSC_hg19_knownCanonical_goldenPath.txt.gz"
    )

    # Download the file
    download.file(url, destfile)

    # Decompress the file
    R.utils::gunzip(destfile, remove = FALSE)
}
```


```{r annotate_amp}
try(annotateAmplicons(sce = sce, known.canon = known_canon_path))
```

The exact genomic locations of the amplicons can be inspected in the amplicon
distribution plot.

```{r plot_amp_dist, warning=FALSE}
plotAmpliconDistribution(sce)
```

The amplicon performance plot depicts normalized read counts per amplicon. 
This can help to identify low performing amplicons and get deeper insights into
potential copy number alterations (Sarah). 

```{r plot_norm_rc}
plotNormalizedReadCounts(sce = sce)
```

Insights in the normalized read counts per amplicon on single cell resolution 
is provided in the panel uniformity plot. Each dot represenents one cell and 
each row an amplicon. The amplicons are sorted by their mean reads, from high 
to low. Amplicons meeting at least 20\% of the average depth of coverage are 
represented in blue, others in orange.

```{r plot_panel_uniformity, warning=FALSE}
plotPanelUniformity(sce, interactive = FALSE)
```


## Variant analysis
The variant analysis starts with variant filtering and is performed on criteria
such as sequencing depth, variant allele frequency (VAF), genoytpe data and 
genotype quality data. Further, the minimum percentage of cells that could be
assigned with a genotype other than missing (wild type, heterozygous,
homozygous) and the minimum percentage of cells with mutated genotypes 
(heterozygous, homozygous).


```{r filter_variants, message = FALSE}
library(SummarizedExperiment)

filteres <- filterVariants(
    depth.threshold = 10,
    genotype.quality.threshold = 30,
    vaf.ref = 5,
    vaf.het = 35,
    vaf.hom = 95,
    min.cell = 50,
    min.mut.cell = 1,
    se.var = se.var,
    sce = sce,
    shiny = FALSE
)
se.f <-
    SummarizedExperiment(
        assays =
            list(
                VAF = t(filteres$vaf.matrix.filtered),
                Genotype = t(filteres$genotype.matrix.filtered),
                Genoqual = t(filteres$genoqual.matrix.filtered)
            ),
        rowData = filteres$variant.ids.filtered,
        colData = filteres$cells.keep
    )

# Filter out cells in sce object
# Find the indices of the columns to keep
indices_to_keep <- match(filteres$cells.keep,
    SummarizedExperiment::colData(sce)[[1]],
    nomatch = 0
)

# Subset the SCE using these indices
sce_filtered <- sce[, indices_to_keep]
SingleCellExperiment::altExp(sce_filtered, "variants") <- se.f
```

After filtering the variants are annotated with biological information like gene
names and clinical variant metrics. By default, annotation is performed via the
Mission Bio API, which provides information on the names of the gene and the 
affected protein, the coding effect, the functional classification (e.g. 
intronic, coding), ClinVar information, the dbSNP ID and
the DANN (Deleterious Annotation of genetic variants using Neural Networks) 
score. 
The annotated variants can than be inspected.

```{r annotate_variants}
suppressPackageStartupMessages(library(DT))
suppressPackageStartupMessages(library(dplyr))

# Load sce object with filtered variants
# Check metadata
metadata(altExp(sce_filtered))

# Annotate variants
sce_filtered <- annotateVariants(sce = sce_filtered)

# Check metadata. Now a annotation flag is present
metadata(altExp(sce_filtered))

# The annotations are stored in the rowData
rowData(altExp(sce_filtered)) %>%
    as.data.frame() %>%
    head() %>%
    datatable()
```

To gain deeper insights into the VAF of filtered variants across the cells, a 
VAF heatmap with genotype and chromosome annotations is available.

```{r plot_variant_heatmap, fig.height=8}
plotVariantHeatmap(sce = sce_filtered)
```


More detailed insights into the genotype and their quality can be obtained in a
violin plot. 

```{r plot_gt_quality}
plotGenotypequalityPerGenotype(sce = sce_filtered)
```


## Analysis of variants of interest
To further focus on selected variants of interest, there is a set of function 
to analyze user selected variants. Thereby, variants forming cell 
clusters/clones can be identified and analyzed regarding VAF and genotype. 
First the user need to define a set of variants of interest.

```{r variants_of_interest}
variants.of.interest <- c(
    "KIT:chr4:55599436:T/C",
    "TET2:chr4:106158216:G/A",
    "FLT3:chr13:28610183:A/G",
    "TP53:chr17:7577427:G/A"
)
```


After selecting the variants of interest, the cells can be clustered based on 
their VAF. The clustering returns the k-means results and the clusterplot itself
in a list object.
For stable k-means clustering a seed can be used `set.seed(1)`.


```{r cluster_variants, fig.show='hide'}
plot <- clusterVariantSelection(
    sce = sce_filtered,
    variants.of.interest = variants.of.interest,
    n.clust = 3
)

names(plot)
```

The clustered cells plot is stored in `clusterplot`.

```{r show_clusters}
plot$clusterplot
```

To compare the clusters regarding the variant selection there are various 
visualizations available. The distribution of variant allele frequency (VAF) 
within each cell cluster can be obtained using `plotClusterVAF()`. 

```{r plot_clusters_vaf, warning=FALSE}
plotClusterVAF(
    sce = sce_filtered,
    variants.of.interest = variants.of.interest,
    gg.clust = plot$clusterplot
)
```

```{r plot_clusters_vaf_map, warning=FALSE}
plotClusterVAFMap(
    sce = sce_filtered,
    variants.of.interest = variants.of.interest,
    gg.clust = plot$clusterplot
)
```


Further, the clusters can be analyzed regarding the genotype using 
`plotClusterGenotype()`.

```{r plot_clusters_gt, warning=FALSE}
plotClusterGenotype(
    sce = sce_filtered,
    variants.of.interest = variants.of.interest,
    gg.clust = plot$clusterplot
)
```


## Data
# Session information

```{r session}
sessionInfo()
```