---
title: "ActiveDriverWGS"
author: "Helen Zhu, Juri Reimand"
date: "`R Sys.Date()`"
output: 
  rmarkdown:::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{ActiveDriverWGS}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
                      warning=FALSE,
                      message=FALSE,
                      width=500)
options(max.print=35)
run_genome_chunks <- requireNamespace("BSgenome.Hsapiens.UCSC.hg19", quietly = TRUE)
```

# Introduction
Cancer is driven by somatic mutations. A few of these mutations confer a survival advantage to the tumour (drivers) while most mutations play a passive role in tumour development (passengers). Most known driver mutations are located in protein-coding genes in whole exome sequencing datasets. As whole genome sequencing datasets are increasingly available, new methods are required to discover driver mutations in the vast noncoding regulatory genome. ActiveDriverWGS is a statistical model to detect candidate driver mutations genome-wide.

### The ActiveDriverWGS Model
ActiveDriverWGS is a recurrence-based method which builds on the idea that driver mutations are subject to positive selection and should appear more frequently than expected by chance alone. This method analyzes the mutational burden of **SNVs and short indels** (less than 50bps) in functionally defined genomic **elements**. These elements can include the protein-coding regions of genes (i.e., exons), as well as noncoding regulatory regions such as promoters, enhancers and untranslated regions. Each element can include one or multiple sub-elements; for example protein-coding genes have multiple exons tested together in ActiveDriverWGS while the introns of the are considered as background.

Optionally, the tested genomic elements may also include **active sites** of interest which reside within the elements themselves. Examples of such active sites include post-translational modification (PTM) sites in protein-coding genes, miRNA binding sites in mRNA sequences and transcription factor binding sites (TFBS) in enhancers. If sites are specified in addition to the elements, enrichment of site-specific mutations are estimated in elements with enriched mutational burden. This additional test asks if the sites in a predicted driver element are enriched in mutations even beyond the mutations seen in the given driver region.

\newline
ActiveDriverWGS uses a Poisson generalized linear regression to compare the mutational burden of elements against the expected mutational burden of a background window (Figure 1). In our work, we have optimized the background to be 50,000 bps upstream and downstream of the elements. For an element comprising a single sub-element the background sequence would total to 100 kbps. 

If an element is segmented as multiple sub-elements, such as the exons of a protein coding gene, the inter-segment sequences are also used to calculate the expected background rate, up to +/- 50 kbps around every segment. ActiveDriverWGS also incorporates the effect of mutational signatures, specifically the probability distribution of SNVs arising across trinucleotide contexts which vary with mutational processes. Users also have the option to exclude hyper-mutated samples which decrease the accuracy of driver discovery.

The default genome build for ActiveDriverWGS is **hg19**. Additional options for human (**hg38**) and mouse (**mm9**, **mm10**) are now supported. Please refer to the user manual and the option `ref_genome` of `ActiveDriverWGS()`. 

### Publication
For a more detailed reference on ActiveDriverWGS, please refer to the [publication](https://doi.org/10.1016/j.molcel.2019.12.027).
Helen Zhu*, Liis Uuskula-Reimand*, Keren Isaev*, Lina Wadi, Azad Alizada, Shimin Shuai, Vincent Huang, Dike Aduluso-Nwaobasi, Marta Paczkowska, Diala Abd-Rabbo, Oliver Ocsenas, Minggao Liang, J. Drew Thompson, Yao Li, Luyao Ruan, Michal Krassowski, Irakli Dzneladze, Jared T. Simpson, Mathieu Lupien, Lincoln D. Stein, Paul C. Boutros, Michael D. Wilson, Jüri Reimand. **Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks**. Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027 .


```{r pressure, echo=FALSE, fig.cap="The ActiveDriverWGS Model", out.width = '75%'}
knitr::include_graphics("ADWGS_diagram.png")
```

# Input Data
ActiveDriverWGS requires a file for somatic `mutations` and a file for genomic `elements`. A third optional file for `sites` can be specified by the user. Elements may contain multiple segments, each represented in a separate row. Segments belonging to the same element must share a common id unique to the element. Sites are only incorporated in the model if they reside within elements but not all elements need to contain sites. Elements may contain multiple sites. Site ids have to match element ids.

### Mutations

The `mutations` data must be in a data frame containing the columns with the correct column names `chr`, `pos1`, `pos2`, `ref`, `alt`, and `patient`. Additional columns may be included but will not be analyzed. Patient ID is required since per element, only one mutation per patient is counted towards element-specific mutation rate and driver prediction. This allows us to avoid inflation of significance in cases where an element is locally hypermutated in a single patient. 

1) `chr`: autosomal chromosomes as chr1 to chr22 and sex chromosomes as chrX and chrY for the human genome, and chr1 to chr19 for mouse

2) `pos1`: the start position of the mutation in base 1 coordinates

3) `pos2`: the end position of the mutation in base 1 coordinates

4) `ref`: the reference allele as a string containing the bases A, T, C or G

5) `alt`: the alternate allele as a string containing the bases A, T, C or G

6) `patient`: the patient identifier as a string

### Elements and Sites

The `elements` and `sites` data must be in data frames containing the columns with the correct column names `chr`, `start`, `end`, and `id`. Additional columns may be included but will not be analyzed.

1) `chr`: autosomal chromosomes as chr1 to chr22 and sex chromosomes as chrX and chrY

2) `start`: the start position of the element or site in base 0 coordinates (BED format)

3) `end`: the end position of the element or site in base 0 coordinates (BED format)

4) `id`: the element identifier - if the element contains multiple segments such as exons, each segment should be a separate row with the segment coordinates and the element identifier as id. Elements can be coding or noncoding such as exons of protein coding genes or active enhancers.

### Example
```{r input}
library(ActiveDriverWGS)

data("cll_mutations")
head(cll_mutations)

data("cancer_genes")
head(cancer_genes)

data("cancer_gene_sites")
head(cancer_gene_sites)

```

## Importing BED12 & BED4 Files as Input Regions
For elements and sites written in BED12 files and BED4 files, the functions `prepare_elements_from_BED12()` and `prepare_elements_from_BED4()` can be used to read the file and adapt it to fulfill the format requirements for the `elements` and `sites` parameters of the ActiveDriverWGS function. For more information on the BED12 or BED4 format, please refer to the [UCSC guidelines](https://genome.ucsc.edu/FAQ/FAQformat.html#format1). In this example, elements are adapted from annotations for protein coding genes from [GENCODE.v19](https://www.gencodegenes.org/human/release_19.html) for chromosome 17 and sites are adapted from post-translational modification (TPM) sites from [ActiveDriverDB](https://activedriverdb.org/).

```{r prepare_elements_from_BED12 and prepare_elements_from_BED4}
elements = prepare_elements_from_BED12(
  system.file(
    "extdata", 
    "chr17.coding_regions.bed", 
    package = "ActiveDriverWGS", 
    mustWork = TRUE))

head(elements)

sites = prepare_elements_from_BED4(
  system.file(
    "extdata", 
    "chr17.PTM_sites.bed", 
    package = "ActiveDriverWGS", 
    mustWork = TRUE))

head(sites)
```

# Basic Use
ActiveDriverWGS can be run with the mutations file, the elements file and an optional sites file. In this example, mutations are adapted from the [Alexandrov et al, 2013](https://www.nature.com/articles/nature12477) dataset for chronic lymphocytic leukemia (CLL) patients. Regions are adapted from the [cancer gene census](https://cancer.sanger.ac.uk/census) and annotations for protein coding genes are adapted from GENCODE.v19.

```{r ActiveDriverWGS, eval = run_genome_chunks}

some_genes = c("ATM", "MYD88", "NOTCH1")

results = ActiveDriverWGS(mutations = cll_mutations,
                          elements = cancer_genes[cancer_genes$id %in% some_genes,],
                          sites = cancer_gene_sites)

```

All three genes have a significant enrichment of somatic mutations as indicated by the `fdr_element` field. Note that NOTCH1 is also highlighted for its site-related mutations since two of two mutations in NOTCH1 affect a PTM site and the FDR value `fdr_site` is also significant. 

## Parameter Interpretation
ActiveDriverWGS has several adjustable parameters:

1) `window_size`: A background window both upstream and downstream of the element in which mutation rates are assumed to remain constant. We have optimized this parameter on the PCAWG dataset to be 50,000 bps for SNVs and indels. For a single non-fragmented element the background sequence is thus 100 kbps, while elements with multiple fragments capture more sequence space and therefore their background sequence is added up to a maximum of 2x50 kbps per sub-element.

2) `filter_hyper_MB`: The threshold for the number of somatic mutations (SNVs and indels combined) per megabase above which a sample is considered hypermutated. We define the default to be 30 mutations/megabase according to published literature. The genome is assumed to be 3000 mbps.

3) `recovery.dir`: The directory for writing recovery files for ActiveDriverWGS. If the directory does not exist, it will be created. If the parameter is unspecified, recovery files will not be saved. As an ActiveDriverWGS query for large datasets may be computationally heavy, specifying a recovery directory will recover previously computed results if a query is interrupted. The results of each element are stored in this folder and can be recovered if the process is terminated while in progress. 

4) `mc.cores`: The number of cores that the user wishes to allocate to running ActiveDriverWGS. For more information, refer to the R package [parallel](https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf).

# Interpreting the Results

```{r results, eval = run_genome_chunks}
results

```

ActiveDriverWGS will return results in a data frame format with the following columns.

1) `id`: The identifier for the element.

2) `pp_element`: The p-value associated with enrichment of mutations in the element. A value of NA indicates that no mutations fall within the element.

3) `element_muts_obs`: Number of patients with mutations in the element. A value of NA indicates that no mutations fall within the element.

4) `element_muts_exp`: Number of expected patients with mutations in the element. A value of NA indicates that no mutations fall within the element.

5) `element_enriched`: Boolean indicating whether an enrichment of mutations in the element is observed. A value of NA indicates that no mutations fall within the element.

6) `pp_site`: The p-value associated with enrichment of mutations in the sites.

7) `site_muts_obs`: Number of patients with mutations in the sites. A value of 0 means that sites exist but are unaffected by mutations. A value of NA indicates that no site resides within the element.

8) `site_muts_exp`: Number of expected patients with mutations in the sites. A value of 0 means that sites exist but are unaffected by mutations. A value of NA indicates that no site resides within the element.

9) `site_enriched`: Boolean indicating whether an enrichment of mutations in the sites is observed. A value of NA indicates that no site resides within the element.

10) `fdr_element`: FDR corrected p-value associated with the element. 

11) `fdr_site`: FDR corrected p-value associated with the site.

12) `has_site_mutations`: "V" indicating the presence of site mutations. An empty string indicates that no site mutations are present.

## Multiple Testing Corrections
In ActiveDriverWGS, multiple testing is performed for all given regions using the Benjamini-Hochberg FDR method. We encourage the users to filter by FDR corrected values rather than p-values to eliminate false positives. FDR correction makes the conservative assumption that genes/regions with 0 mutations have P = 1. FDR correction for sites is conducted over a restricted set of hypotheses, comprising genes/regions that have a significant enrichment of mutations (FDR < 0.05) at the level of genes or regions.


# Adapting ActiveDriverWGS to High Performance Computing Clusters
Compute time increases with the number of samples, mutations and regions. Hence, the two main functions integral to ActiveDriverWGS have also been made available in the package and can be adapted to individual local high performance computing clusters (HPCCs). One approach is to assign a subset of testable elements (and/or cancer types) to individual compute nodes and later use an additional script that collects the data from these nodes. 

### 1. format_muts
This function formats the mutations data frame, removes hyper-mutated samples and removes non-mitochondrial mutations in extrachromosomal regions. It adds an additional column to the mutations data frame that provides the trinucleotide context of the given mutation which will be later used to estimate the mutational distribution across signatures.

### 2. ADWGS_test
This function calculates the enrichment of mutations for a particular region id. It applies a Poisson generalized linear regression model across mutation signatures to identify enriched regions.

## Example
The following example demonstrates how to build an ActiveDriverWGS pipeline which can be adapted to HPCCs. Parallel jobs are executed for a list of element ids whereas mutation data and element coordinates are prepared prior to the process. Note that the creation of GRanges objects is part of the `ActiveDriverWGS()` wrapper function and must be completed manually by users wishing to create personalized pipelines. The function `ADWGS_test()` needs a link to the reference genome object (from the R package `BSgenome`) as an argument, and assumes that the mutations and element coordinates match the reference genome. Also, note that multiple testing corrections will need to be recalculated by the user after the results have been collected from individual jobs.

```{r pipeline, eval = run_genome_chunks}
library(GenomicRanges)

# Loading elements & creating a GRanges object
data(cancer_genes)
gr_element_coords = GRanges(seqnames = cancer_genes$chr,
                            IRanges(start = cancer_genes$start,
                                    end = cancer_genes$end),
                            mcols = cancer_genes$id)

# Loading sites & creating a GRanges object
data(cancer_gene_sites)
gr_site_coords = GRanges(seqnames = cancer_gene_sites$chr,
                         IRanges(start = cancer_gene_sites$start,
                                 end = cancer_gene_sites$end),
                         mocols = cancer_gene_sites$id)

# Loading mutations, format muts & creating a GRanges object
data(cll_mutations)

# load the default reference genome
this_genome = BSgenome.Hsapiens.UCSC.hg19::Hsapiens

# format_muts
cll_mutations = format_muts(cll_mutations, this_genome,
                            filter_hyper_MB = 30)

gr_maf = GRanges(cll_mutations$chr,
                 IRanges(start = cll_mutations$pos1,
                         end = cll_mutations$pos2),
                 mcols = cll_mutations[,c("patient", "tag")])

# Examplifying the ATM Element
id = "ATM"
```
**Note** that when splitting tasks using the `ADWGS_test` function, only the parameter `id` needs to modified for each element while `gr_element_coords`, `gr_sites` and `gr_maf` can be the complete datasets. However, the compute time and memory can be saved in very large analyses by providing more-optimal subsets of these datasets for each test, for example those limited to specific chromosomes.
```{r result, eval = run_genome_chunks}
# Result of 1 input element
result = ADWGS_test(id = id,
                    gr_element_coords = gr_element_coords,
                    gr_site_coords = gr_site_coords,
                    gr_maf = gr_maf,
                    win_size = 50000, this_genome = this_genome)

result
```

## Technical Support
For questions, technical support or to report bugs and errors, please use our [GitHub](https://github.com/reimandlab/ActiveDriverWGSR).