---
title: "Practical uses for provided CpG probe annotations"
author:
- Sean K. Maden
date: "`r format(Sys.time(), '%d %B, %Y')`"
package: recountmethylation
output:
  BiocStyle::html_document:
    code_folding: show
    toc: no
    tocfloat: no
  BiocStyle::pdf_document: 
    toc: no
vignette: > 
  %\VignetteIndexEntry{Practical uses for CpG annotations}
  %\VignetteDepends{RCurl}
  %\usepackage[UTF-8]{inputenc} 
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---

This vignette provides several practical demonstrations of how to use 
HM450K and EPIC platform CpG probe annotations, which are provided in the 
platform-specific manifests. These examples are shown for an example 
`RGChannelSet` using functions from the `minfi` package.

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
libv <- c("minfi", "minfiData", "minfiDataEPIC", "ggplot2")
sapply(libv, library, character.only = TRUE)
```

# Probe manifest background

The CpG probe manifest files were provided by Illumina and contain a number of
useful fields pertaining to the genome location of a CpG probe. When we read 
DNAm IDAT files into a `SummarizedExperiment` object using the `minfi` package 
(e.g. using `read.metharray.exp()` or similar), they are automatically
combined with an array platform manifest containing rich probe-level details 
about the probe chemistry, the genome location, and various functional features 
and annotations to be aware of.

## Getting the HM450K manifest annotations from an `RGChannelSet`

We can easily access the manifest of the older HM450K platform. Using an 
example `RGChannelSet` object from the `minfiData` package, we can extract 
and inspect the full HM450K platform manifest with `getAnnotation()` as follows:

```{r}
rg <- get(data("RGsetEx"))
man <- as.data.frame(getAnnotation(rg))
colnames(man)
```

Array platform manifests are made available in the Bioconductor packages
`IlluminaHumanMethylation450kmanifest` and `IlluminaHumanMethylationEPICmanifest`, 
so be sure to have these installed before working with DNAm arrays as 
`SummarizedExperiment` objects. These manifests are unchanged from the original 
manifest files, which can be accessed from Illumina's website. The HM450K manifest 
can be found [here](https://support.illumina.com/array/array_kits/infinium_humanmethylation450_beadchip_kit/downloads.html), and the EPIC manifest can be found [here](https://support.illumina.com/array/array_kits/infinium-methylationepic-beadchip-kit/downloads.html). 
The official documents also include detailed descriptions of the manifest columns.

## Getting the EPIC manifest annotations from an `RGChannelSet`

The newer EPIC platform contains a manifest with updated probe information, 
including several new annotation columns useful for probe filters.

```{r}
rg.epic <- get(data("RGsetEPIC"))
man.epic <- as.data.frame(getAnnotation(rg.epic))
colnames(man.epic)
```

# Annotation-based probe filters

There are many reasons we may want to filter or remove a CpG probe prior to 
performing downstream analyses such as differential methylation tests or 
epigenome-wide association tests. For instance, it is common to filter CpG probes 
which are likely to be impacted by underlying population genetic variation, or
probes which have consistent poor quality. Two common reasons to filter probes 
are , or based on studies showing extensive cross-reactivity that limits
a probe's reliability. We can see how to perform these filters below.

## Filtering on common SNPs

We can filter a probe based on high likelihood of a confounding single nucleotide
polymorphism (SNP), or a variant that occurs near a probe (e.g. within the 50bp annealing
sequence), or if it overlaps the exact CpG location. These probe-proximal SNPs can
be identified either from the manifest (see above), or accessed independently using 
`getSnpInfo()`. For instance:

```{r}
getSnpInfo(rg)
```

It is recommended that CpG probes which contain a common SNP be filtered prior to analysis.
Common SNPs are identified based on their minor allele frequency (MAF, e.g. from 
columns `Probe_maf`, `CpG_maf`, or `SBE_maf`). 

The function `dropLociWithSnps()` allows one to filter a mapped `MethylSet` or 
`RatioSet` based on the SNP MAF frequency. For example:

```{r}
gm <- preprocessRaw(rg) # make MethylSet
gms <- mapToGenome(gm) # make GenomicMethylSet
gmsf <- dropLociWithSnps(gms) # filter probes with SNPs
```

The minimum MAF frequency filter can be changed in `dropLociWithSnps()` by 
setting the `maf` argument. See `?dropLociWithSnps` for details. Also note the
SNP information is a product of its reference, and details about the specific
genetic variant databases mined for this SNP info can be found in the manifest
documentation (see above).

## Filtering on cross-reactive CpG probes

A number of probes on both the EPIC and HM450K platforms were previously found to
possess high cross-reactivity, an undesirable quality that limits probe measurement 
reliability. We can identify and remove probes with evidence of cross-reactivity
from either platform by simply filtering on the identifiers of the probes to be
removed. Several commonly utilized sources of CpG IDs for cross-reactive probes 
include:

* Chen et al 2013 cross-reactive HM450K probes ([source](https://www.tandfonline.com/doi/full/10.4161/epi.23470)). Note, these 
probes have been uploaded [here](https://github.com/metamaden/methyPre/tree/master/data).

* Pidsley et al 2016 cross-reactive EPIC probes ([source](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1066-1)).
Note, these probes have also been uploaded [here](https://github.com/metamaden/methyPre/tree/master/data).

Access lists of cross-reactive probes with the `recountmethylation` function 
using `get_crossreactive_cpgs()` (see `?get_crossreactive_cpgs` for details).

# Querying and quantifying genome regions and functional groups

Many DNAm studies note whether methylation occurs in a gene's promoter or body region.
One reason is that we expect the functional consequences of DNAm to vary given its
location in a gene, or indeed if it occurs in a gene at all. Another common way to 
annotate CpG probes is according to their position relative to protein-coding gene 
regions as well as relative to cytosine- and guanine-dense regions called CG Islands. 
We can quickly group probes according to either their gene or CG Island region using 
information in the manifest, as shown below.

## Find probes mapping to a gene

The gene IDs can be found under the columns `UCSC_RefGene_Name` (for the common
gene name) and `UCSC_RefGene_Accession` (for the official RefGene accession).
Note that entries in these columns correspond to transcript labels, so gene names
and IDs may be repeated multiple times separated by a semi-colon character. This
formatting means we need to use regular expression to isolate all the probes by 
a given gene name. 

We can quantify the number of intergenic (non-gene-mapping) and genic (gene-mapping)
probes with:

```{r}
as.data.frame(table(man$UCSC_RefGene_Name==""))
```

We can further grab all probes mapping to the gene `ARID1A` using:

```{r}
gene.str <- "ARID1A" # common gene name
gene.patt <- paste0("(^|;)", gene.str, "($|;)") # regex pattern
which.gene <- which(grepl(gene.patt, man$UCSC_RefGene_Name)) # gene filter
manf <- man[which.gene,]
dim(manf)
```

## Viewing gene functional region frequencies

We can find the gene relation annotations under the column `UCSC_RefGene_Group` in
the manifest. Values under this column are either blank, if a probe does not map
to a gene (a.k.a. "intergenic probes"), or contains one or more location annotations
corresponding to various overlapping transcript IDs. 

To get all mapped gene regions as a vector, use:

```{r}
gene.regionv <- unlist(strsplit(man$UCSC_RefGene_Group, ";"))
```

To view all possible group values, we can use:

```{r}
unique(gene.regionv)
```

To view the frequency with which each region is mapped by an array probe, use:

```{r}
df <- as.data.frame(table(gene.regionv)) # get group frequencies
ggplot(df, aes(x = gene.regionv, y = Freq, fill = gene.regionv)) + theme_bw() +
  geom_bar( stat = "identity") + xlab("Gene group") + ylab("Num. probes") +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))
```

## Identifying promoter-mapping and gene body-mapping CpG probes

We can use this logic of the `UCSC_RefGene_Group` variable to build conditions 
for new functional regions of interest, namely promoter and gene body regions. 

For promoters, we can use a common definition of a probe that maps to either the 
`5'UTR`, `TSS1500`, or `TSS200` annotation labels. We can make the new promoter 
variable using:

```{r}
# make pattern: "(^|;)5'UTR($|;)|(^|;)TSS1500($|;)|(^|;)TSS200($|;)"
prom.catv <- c("5'UTR", "TSS1500", "TSS200")
prom.patt <- paste0("(^|;)", prom.catv, "($|;)", collapse = "|")
# use regex to detect promoter-mapping probes
man$promoter <- ifelse(grepl(prom.patt, man$UCSC_RefGene_Group),
                       TRUE, FALSE)
as.data.frame(table(man$promoter))
```

One way we can define the gene body is as all gene-mapping probes which don't
map to the gene promoter. Using a similar strategy as for the new `promoter` 
variable, we can define a new `gene_body` variable with:

```{r}
# make pattern: "(^|;)Body($|;)|(^|;)3'UTR($|;)"
body.catv <- c("Body", "3'UTR")
body.patt <- paste0("(^|;)", body.catv, "($|;)", collapse = "|")
# use regex to detect body-mapping probes
man$gene_body <- ifelse(grepl(body.patt, man$UCSC_RefGene_Group),
                       TRUE, FALSE)
table(man$gene_body)
```

Since the manifest is transcript-centric, you will find that some probes map to
the promoter of one transcript and the body of another transcript. You can view
these as follows:

```{r, eval = FALSE}
dfp <- as.data.frame(table(man$gene_body, man$promoter))
dfp$region <- c("intergenic", "body-only", "promoter-only", "body-and-promoter")
ggplot(dfp, aes(x = region, y = Freq, fill = region)) + theme_bw() +
  geom_bar( stat = "identity") + xlab("Gene region") + ylab("Num. probes") +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))
```

## CG Island annotations

It is comparatively straightforward to view a probe's relation to a CG Island 
from the manifest using the variable `Relation_to_Island`. Unlike with the 
transcript-centric gene annotations, values in this variable occur 1:1 with probe 
IDs, as only one category applies to any given genome coordinate location.

We can see the unique categories for this variable from the manifest using:

```{r}
unique(man$Relation_to_Island)
```

Categories in `Relation_to_Island` are [defined in the official documentation](https://support.illumina.com/content/dam/illumina-support/documents/downloads/productfiles/methylationepic/infinium-methylationepic-manifest-column-headings.pdf) 
as:

* `Island` : CG Island, a region with high cytosine- and guaning content. If a 
probe maps to a CG Island, the island name/coordinates can be found in the column `Islands_Name`.
* `N_Shore`/`S_Shore` : The upstream (5'/North) and downstream (3'/South) shore 
regions. These are up to 2kbp-wide regions flanking CG islands.
* `N_Shelf`/`S_Shelf` : The upstream (5'/North) and downstream (3'/South) shelf 
regions. These are 2-4kbp-wide regions flanking CG Islands and their shores.
* `OpenSea` : Any regions occurring between CG islands.

We can plot the frequency of relative CG Island locations across HM450K array
probes as follows:

```{r, eval = FALSE}
dfp <- as.data.frame(table(man$Relation_to_Island))
ggplot(dfp, aes(x = Var1, y = Freq, fill = Var1)) + theme_bw() +
  geom_bar(stat = "identity") + xlab("Relation_to_Island") + ylab("Num. probes") +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))
```

It is also easy to add the "genic" information, or whether a probe maps to a gene,
alongside the CG Island annotations, as follows:

```{r}
dfp <- as.data.frame(table(man$Relation_to_Island, man$UCSC_RefGene_Name==""))
dfp$region <- paste0(dfp$Var1," ; ", c(rep("genic", 6), rep("intergenic", 6)))
ggplot(dfp, aes(x = region, y = Freq, fill = Var1)) + theme_bw() +
  geom_bar(stat = "identity") + xlab("Relation_to_Island;Genic_status") + 
  ylab("Num. probes") +
  theme(legend.position = "none", 
        axis.text.x = element_text(angle = 45, hjust = 1))
```

# Conclusions

This vignette described practical uses for the platform-specific, manifest-based 
annotations for CpG probes on HM450K and EPIC platforms. 

Additional resources, information, and guidance for analyzing compilations of 
DNAm array datasets can be found in the following vignettes:

* [`recountmethylation`](https://bioconductor.org/packages/release/bioc/html/recountmethylation.html) R/Bioconductor package documentation 
* [`minfi`](https://bioconductor.org/packages/release/bioc/html/minfi.html) R/Bioconductor package and documentation
* [Illumina's EPIC array documentation](https://support.illumina.com/array/array_kits/infinium-methylationepic-beadchip-kit/downloads.html) This official resource includes helpful descriptions of manifest
columns in the file called `infinium-methylationepic-manifest-column-headings.pdf`.