---
title: '`methyvim`: Variable Importance Measures for Differential Methylation'
author: |
  | [Nima Hejazi](http://nimahejazi.org)
  | Division of Biostatistics
  | University of California, Berkeley
  | Berkeley, CA, 94720, USA
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document
bibliography: vignette-refs.bib
abstract: |
  We present a general algorithm for the nonparametric estimation of effects of
  DNA methylation at CpG sites scattered across the genome, complete with honest
  statistical inference for such estimates. This approach leverages variable
  importance measures, a class of parameters that arise in the study of causal
  inference. The parameters we persent are defined in such a manner that they
  provided targeted estimates of the relative importance of CpG sites in the
  case of binary exposures/treatments assigned at the level of subjects. Such
  parameters come equipped with rich scientific interpretations, providing an
  avenue to move beyond linear models, applying modern developments in machine
  learning to estimating quantities of scientific interest in computational
  biology.
vignette: >
  %\VignetteIndexEntry{methyvim: Variable Importance for DNA Methylation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## Introduction

DNA methylation is a fundamental epigenetic process known to play an important
role in the regulation of gene expression. DNA methylation occurrs as a chemical
modification of CpG sites in which a methyl group ($\text{CH}_3$) is added to a
carbon of the cytosine ring structure; in humans, this process is carried out by
a set of eznymes known as the the DNA methyltransferases. Numerous biological
and medical studies have implicated DNA methylation as playing a role in disease
processes with a molecular signature. Perhaps unsurprisingly then, numerous
biotechnologies have been developed to probe the molecular mechanisms of this
epigenetic process. Modern assays, like the Illumina _Infinium_ arrays, allow
for DNA methylation signatures at nearly $850,000$ CpG sites scattered across
the genome to be measured simultaneously; moreover, much effort has been
invested, by the bioinformatics community, in the development of tools for
properly removing technological effects may contaminate biological signatures
measured by such assays. Despite these leaps and bounds in both biological and
bioninformatical techniques, most statistical methods available for the analysis
of data produced by such assays rely on (generalized) linear models.

Here, we present an alternative to such approaches, in the form of nonparametric
estimation procedures inspired by machine learning and causal inference.
Specifically, we provide a technique for obtaining estimates of nonparametric
_variable importance measures_ (__VIM__), parameters with rich scientific
interpretations under the standard (untestable) assumptions used in statistical
causal inference, defining a limited set of VIMs specifically with respect to
the type of data commonly produced by DNA methylation assays. For VIMs defined
in such a manner, targeted minimum loss-based estimates may be readily computed
based on the data made available by DNA methylation assays. Our contribution,
`methyvim` is an R package that provides facilities for performing differential
methylation analyses within exactly this scope.

As the substantive contribution of our work is an estimation procedure, we focus
on data produced by 450k and 850k (EPIC) arrays made by Illumina and assume that
data has been subjected to proper quality control and normalizaton procedures,
as outlined by others in the computational biology community
[@fortin2014functional, @dedeurwaerder2013comprehensive]. For a general
discussion of the framework of targeted minimum loss-based estimation and the
role this approach plays in statistical causal inference, the interested reader
is invited to consult @vdl2011targeted and @vdl2017targeted. For a more general
introduction to (statistical) causal inference, @pearl2009causality and
@hernan2018causal may be of interest.

---

## Methodology

The core functionality of this package is made available via the eponymous
`methyvim` function, which implements a statistical algorithm designed to
compute targeted estimates of VIMs, defined in such a way that the VIMs
represent parameters of scientific interest in computational biology
experiments; moreover, these VIMs are defined such that they may be estimated in
a manner that is very nearly assumption-free, that is, within a _fully
nonparametric statistical model_. __The statistical algorithm consists in
several major steps:__

1. Pre-screening of genomic sites is used to isolate a subset of sites for
  which there is cursory evidence of differential methylation. For the sake of
  computational feasibility, targeted minimum loss-based estimates of VIMs are
  computed only for this subset of sites. Currently, the available screening
  approach adapts core routines from the
  [`limma`](http://bioconductor.org/packages/release/bioc/html/limma.html) R
  package. Future releases will support functionality from other
  packages (e.g., [`randomForest`](https://cran.r-project.org/web/packages/randomForest/index.html),
  [`tmle.npvi`](https://CRAN.R-project.org/package=tmle.npvi)). Following the
  style of the function for performing screening via `limma`, users may write
  their own screening functions and are invited to contribute such functions to
  the core software package by opening pull requests at the GitHub repository.

2. Nonparametric estimates of VIMs, for the specified target parameter, are
  computed at each of the CpG sites passing the screening step. The VIMs are
  defined in such a way that the estimated effects is of an exposure/treatment
  on the methylation status of a target CpG site, controlling for the observed
  methylation status of the neighbors of that site. Currently, routines are
  adapted from the [`tmle`](https://CRAN.R-project.org/package=tmle) R package.
  Future releases will support doubly-robust estimates of these VIMs (via the
  [`drtmle`](https://cran.r-project.org/web/packages/drtmle/index.html) package)
  and add parameters for continuous treatments/exposures (via the
  [`tmle.npvi`](https://CRAN.R-project.org/package=tmle.npvi) package).

3. Since pre-screening is performed prior to estimating VIMs, we make use of a
  multiple testing correction uniquely suited to such settings. Due to the
  multiple testing nature of the estimation problem, a variant of the Benjamini
  & Hochberg procedure for controlling the False Discovery Rate (FDR) is applied
  [@benjamini1995controlling]. Specifically, we apply the "modified marginal
  Benjamini & Hochberg step-up False Discovery Rate controlling procedure for
  multi-stage analyses" (FDR-MSA), which is guaranteed to control the FDR as if
  all sites were tested (i.e., without screening) [@tuglus2009modified].

---

## Parameters of Interest

For discrete-valued treatments or exposures:

* The _average treatment effect_ (ATE): The effect of a binary exposure or
  treatment on the observed methylation at a target CpG site is estimated,
  controlling for the observed methylation at all other CpG sites in the same
  neighborhood as the target site, based on an additive form. In particular, the
  parameter estimate represents the __additive difference__ in methylation that
  would have been observed at the target site had all observations received the
  treatment versus the counterfactual under which none received the treatment.

* The _relative risk_ (RR): The effect of a binary exposure or treatment on the
  observed methylation at a target CpG site is estimated, controlling for the
  observed methylation at all other CpG sites in the same neighborhood as the
  target site, based on an geometric form. In particular, the parameter estimate
  represents the __multiplicative difference__ in methylation that would have
  been observed at the target site had all observations received the treatment
  versus the counterfactual under which none received the treatment.

Support for continuous-valued treatments or exposures is _planned but not yet
available_, though work is underway to incorporate into our methodology the
following

* A _nonparametric variable importance measure_ (NPVI) [@chambaz2012estimation]:
  The effect of continous-valued exposure or treatment (the observed methylation
  at a target CpG site) on an outcome of interest is estimated, controlling for
  the observed methylation at all other CpG sites in the same neighborhood as
  the target (treatment) site, based on a parameter that compares values of the
  treatment against a reference value taken to be the null. In particular, the
  implementation provided is designed to assess the effect of differential
  methylation at the target CpG site on a (typically) phenotype-level outcome
  of interest (e.g., survival), in effect providing an nonparametric evaluation
  of the impact of methylation at the target site on said outcome.

_As previously noted, in all cases, an estimator of the target parameter is
constructed via targeted minimum loss-based estimation._

Having now discussed the foundational principles of the estimation procedure
employed and the statistical algorithm implemented, it is best to proceed by
examining `methyvim` by example.

---

## Preliminaries: Setting up the Data

```{r reqs, echo=FALSE}
suppressMessages(library(tmle))
suppressMessages(library(minfi))
suppressMessages(library(SummarizedExperiment))
```

First, we'll load the `methyvim` package and the example data contained in the
`methyvimData` package that accompanies it:

```{r prelims}
set.seed(479253)
library(methyvim)
library(methyvimData)
```

Now, let's load the data set and seed the RNG:

```{r get-data}
data(grsExample)
grsExample
var_int <- as.numeric(colData(grsExample)[, 1])
```

The example data object is of class `GenomicRatioSet`, provided by the `minfi`
package. The summary provided by the `print` method gives a wealth of
information on the experiment that generated the data -- since we are working
with a simulated data set, we need not concern ourselves with much of this
information.

We can create an object of class `methytmle` from any `GenomicRatioSet` object
simply invoking the S4 class constructor:

```{r make-methytmle}
mtmle <- .methytmle(grsExample)
```

Since the `methytmle` class builds upon the `GenomicRatioSet` class, it contains
all of the slots of `GenomicRatio`s. The new class introduced in the `mehtyvim`
package includes several new slots:

* `call` - the form of the original call to the `methyvim` function.
* `screen_ind` - indices identifying CpG sites that pass the screening process.
* `clusters` - non-unique IDs corresponding to the manner in wich sites are
  treated as neighbors. These are assigned by genomic distance (bp) and respect
  chromosome boundaries (produced via a call to `bumphunter::clusterMaker`).
* `var_int` - the treatment/exposure status for each subject. Currently, these
  must be binary, due to the definition of the supported targeted parameters.
* `param` - the name of the target parameter from which the estimated VIMs are
  defined.
* `vim` - a table of statistical results obtained from estimating VIMs for each
  of the CpG sites that pass the screening procedure.
* `ic` - the measured array values for each of the CpG sites passing the
  screening, transformed into influence curve space based on the chosen target
  parameter.

The interested analyst might consider consulting the documentation of the
`minfi` package for an in-depth description of all of the other slots that
appear in this class [@aryee2014minfi]. Having examined the core structure of
the package, it is time now to discuss the analytic capabilities implemented.

---

## Differential Methylation Based on a Binary Treatment or Exposure

### The Average Treatment Effect as Variable Importance Measure

The average treatment effect (ATE) is a canonical parameter that arises in
statistical causal inference, often denoted $\psi_0 = \psi_0(1) - \psi_0(0)$,
representing the difference in an outcome between the counterfactuals under
which all subjects received the treatment/exposure and under which none received
such treatment/exposure. Under additional (untestable) assumptions, this
parameter has a richer interpretation as a mean counterfactual outcome, wherein
the counterfactuals used in this definition define causal effects. When causal
assumptions remain unfulfilled or untested, this parameter may still be
estimated in the form of a nonparametric VIM.

Estimating such the VIM corresponding to such a parameter requires two separate
regression steps: one for the treatment mechanism (propensity score) and one for
the outcome regression. Technical details on the nature of these regressions are
discussed in @hernan2018causal, and details for estimating these regressions in
the framework of targeted minimum loss-based estimation are discussed in
@vdl2011targeted.

#### Super Learning for nonparametric parameter estimation

Nonparametric and data-adaptive regressions (i.e., machine learning) may be used
in the two regression steps outlined above. This is implemented using the Super
Learner algorithm, which produces optimal combinations of such regression
functions (i.e., stacked regressions) using cross-validation [@vdl2007super,
@breiman1996stacked].

`methyvim` makes performing such estimation for CpG sites using a given VIM
essentially trivial:

```{r methyvim-ate-sl}
suppressMessages(
  methyvim_ate_sl <- methyvim(data_grs = grsExample, sites_comp = 25,
                              var_int = var_int, vim = "ate", type = "Mval",
                              filter = "limma", filter_cutoff = 0.10,
                              parallel = FALSE, tmle_type = "sl"
                             )
)
methyvim_ate_sl
```

As is clear from examining the object `methyvim_ate_sl`, the output resembles
exactly that returned when examining objects of class `GenomicRatioSet` from the
`minfi` R package. In particular, the returned `methytmle` object is merely a
modified form (in particular, a subclass) of the input `GenomicRatioSet` object
-- thus, it contains all of the original slots, with all experimental data
intact. Several extra pieces of information are contained within the output
object as well_.

We can take a look at the results produced from the estimation procedure by
examining the `"vim"` `slot` of the `methytmle` object:

```{r methyvim-ate-sl-print}
head(slot(methyvim_ate_sl, "vim"))
```

From the table displayed, we note that we have access to point estimates of the
ATE ("est_ATE") as well as lower and upper confidence interval bounds for each
estimate ("lowerCI_ATE" and "upperCI_ATE", respectively). Additional statistical
information we have access to include the variance ("Var_ATE") of the estimate
as well as the p-value ("pval") associated with each estimate (based on
Wald-style testing procedures). Beyond these, key bioinformatical quantities
(with respect to the algorithm outlined above) are also returned; these include
the total number of neighbors of the target site, the number of neighboring
sites controlled for when estimating the effect of exposure on DNA methylation,
and, finally, the maximum correlation between the target site and any given site
in its full set of neighbors.

#### Generalized linear models for parameter estimation

In cases where nonparametric regressions may not be preferred (e.g., where time
constraints are of concern), generalized linear models (GLMs) may be used to fit
the two regression steps required for estimating a VIMs for the ATE.

`methyvim` makes performing such estimation for CpG sites using a given VIM
essentially trivial:

```{r methyvim-ate-glm}
suppressMessages(
  methyvim_ate_glm <- methyvim(data_grs = grsExample, sites_comp = 25,
                               var_int = var_int, vim = "ate", type = "Mval",
                               filter = "limma", filter_cutoff = 0.10,
                               parallel = FALSE, tmle_type = "glm"
                              )
)
methyvim_ate_glm
```

Just as before, we can take a look at the results produced from the estimation
procedure by examining the `"vim"` `slot` of the `methytmle` object:

```{r methyvim-ate-glm-print}
head(slot(methyvim_ate_glm, "vim"))
```

_Remark:_ Here, the estimates are obtained via GLMs, making each of the
regression steps less robust than if nonparametric regressions were used. It is
expected that these estimates differ from those obtained previously.

---

### The Risk Ratio as Variable Importance Measure

The risk ratio (RR) is another popular parameter that arises in statistical
causal inference, denoted $\psi_0 = \frac{\psi_0(1)}{\psi_0(0)}$, representing
the multiplicative contrast of an outcome between the counterfactuals under
which all subjects received the treatment/exposure and under which none received
such treatment/exposure. Under additional (untestable) assumptions, this
parameter has a richer interpretation as a mean counterfactual outcome, wherein
the counterfactuals used in this definition define causal effects. When causal
assumptions remain unfulfilled or untested, this parameter may still be
estimated in the form of a nonparametric VIM.

Just as before (in the case of the ATE), there are two regression steps required
for estimating VIMs based on this parameter. We do so in a manner analogous to
that described previously.

#### Super Learning for nonparametric parameter estimation

Nonparametric and data-adaptive regressions (i.e., machine learning) may be used
in the two regression steps required for estimating a VIM based on the RR. This
is implemented using the Super Learner algorithm.

`methyvim` makes performing such estimation for CpG sites using a given VIM
essentially trivial:

```{r methyvim-rr-sl}
methyvim_rr_sl <- methyvim(data_grs = grsExample, sites_comp = 25,
                            var_int = var_int, vim = "rr", type = "Mval",
                            filter = "limma", filter_cutoff = 0.10,
                            parallel = FALSE, tmle_type = "sl"
                           )
methyvim_rr_sl
```

We can take a look at the results produced from the estimation procedure by
examining the `"vim"` `slot` of the `methytmle` object:

```{r methyvim-rr-sl-print}
head(slot(methyvim_rr_sl, "vim"))
```


#### Generalized linear models for parameter estimation

In cases where nonparametric regressions may not be preferred (e.g., where time
constraints are of concern), generalized linear models (GLMs) may be used to fit
the two regression steps required for estimating a VIMs for the ATE.

`methyvim` makes performing such estimation for CpG sites using a given VIM
essentially trivial:

```{r methyvim-rr-glm}
methyvim_rr_glm <- methyvim(data_grs = grsExample, sites_comp = 25,
                            var_int = var_int, vim = "rr", type = "Mval",
                            filter = "limma", filter_cutoff = 0.10,
                            parallel = FALSE, tmle_type = "glm"
                           )
methyvim_rr_glm
```

Just as before, we can take a look at the results produced from the estimation
procedure by examining the `"vim"` `slot` of the `methytmle` object:

```{r methyvim-rr-glm-print}
head(slot(methyvim_rr_glm, "vim"))
```

_Remark:_ Here, the estimates are obtained via GLMs, making each of the
regression steps less robust than if nonparametric regressions were used. It is
expected that these estimates differ from those obtained previously.

---

## Data Analysis with `methyvim`

In order to explore practical applications of the `methyvim` package, as well as
the full set of utilities it provides, our toy example (of just $10$ CpG sites)
is unfortunately insufficient. To proceed, we will use a publicly available
example data set produced by the Illumina 450K array, from the `minfiData` R
package. Now, let's load the package and data set, and take a look

```{r setup-minfidata}
suppressMessages(library(minfiData))
data(MsetEx)
mset <- mapToGenome(MsetEx)
grs <- ratioConvert(mset)
grs
```

After loading the data, which comes in the form of a raw `MethylSet` object, we
perform some further processing by mapping to the genome (with `mapToGenome`)
and converting the values from the methylated and unmethylated channels to
Beta-values (via `ratioConvert`). These two steps together produce an object of
class `GenomicRatioSet`, like what we had worked with previously.

For this example analysis, we'll treat the condition of the patients as the
exposure/treatment variable of interest. The `methyvim` function requires that
this variable either be `numeric` or easily coercible to `numeric`. To
facilitate this, we'll simply convert the covariate (currently a `character`):

```{r, minfidata-maketx}
var_int <- (as.numeric(as.factor(colData(grs)$status)) - 1)
```
__n.b.__, the re-coding process results in "normal" patients being assigned a
value of 1 and cancer patients a 0.

Now, we are ready to analyze the effects of cancer status on DNA methylation
using this data set. To do this with a targeted minimum loss-based estimate of
the Average Treatment Effect, we may proceed as follows:

```{r, minfidata-methyvim}
suppressMessages(
  methyvim_cancer_ate <- methyvim(data_grs = grs, var_int = var_int,
                                  vim = "ate", type = "Beta", filter = "limma",
                                  filter_cutoff = 0.20, obs_per_covar = 2,
                                  parallel = FALSE, sites_comp = 125,
                                  tmle_type = "glm"
                                 )
)
```
Note that we set the `obs_per_covar` argument to a relatively low value (2,
where the recommended default is 20) for the purposes of this example. We do
this only to exemplify the estimation procedure and would point out here that
such low values will compromise the quality of inference obtained as this
setting directly affects the definition of the target parameter.

Further, note that here we apply the `glm` flavor of the `tmle_type` argument,
which produces faster results by fitting models for the propensity score and
outcome regressions using a limited number of parametric models. By contrast,
the `sl` (for "Super Learning") flavor fits these two regressions using highly
nonparametric and data-adaptive procedures (i.e., via machine learning).

Just as before, we can view a table of results by examining the `vim` slot of
the produced `methytmle` object:

```{r vim-cancer-ate}
head(slot(methyvim_cancer_ate, "vim"))
```

Finally, we may compute FDR-corrected p-values, by applying a modified procedure
for controlling the False Discovery Rate for multi-stage analyses (FDR-MSA)
[@tuglus2009modified]. We do this by simply applying the `fdr_msa` function:

```{r fdr-msa}
fdr_p <- fdr_msa(pvals = slot(methyvim_cancer_ate, "vim")$pval,
                 total_obs = nrow(methyvim_cancer_ate))
```

Having explored the results of our analysis numerically, we now proceed to use
the visualization tools provided with the `methyvim` R package to further
enhance our understanding of the results.

### Visualization of Results

While making allowance for users to explore the full set of results produced by
the estimation procedure (by way of exposing these directly to the user), the
`methyvim` package also provides _three_ (3) visualization utilities that
produce plots commonly used in examining the results of differential methylation
analyses.

A simple call to `plot` produces side-by-side histograms of the raw p-values
computed as part of the estimation process and the corrected p-values obtained
from using the FDR-MSA procedure.

```{r methyvim-pvals-both}
plot(methyvim_cancer_ate)
```
__Remark:__ The plots displayed above may also be generated separately by
explicitly setting the argument "type" to `plot.methytmle`. For a plot of the
raw p-values, specify `type = "raw_pvals"`, and for a plot of the FDR-corrected
p-values, specify `type = "fdr_pvals"`.

While histograms of the p-values may be generally useful in inspecting the
results of the estimation procedure, a more common plot used in examining the
results of differential methylation procedures is the volcano plot, which plots
the parameter estimate along the x-axis and $-\text{log}_{10}(\text{p-value})$
along the y-axis. We implement such a plot in the `methyvolc` function:

```{r methyvim-volcano}
methyvolc(methyvim_cancer_ate)
```

The purpose of such a plot is to ensure that very low (possibly statistically
significant) p-values do not arise from cases of low variance. This appears to
be the case in the plot above (notice that most parameter estimates are _near
zero_, even in cases where the raw p-values are quite low).

Yet another popular plot for visualizing effects in such settings is the
heatmap, which plots estimates of the raw methylation effects (as measured by
the assay) across subjects using a heat gradient. We implement this in the
`methyheat` function:

```{r methyvim-heatmap}
methyheat(methyvim_cancer_ate)
```

Invoking `methyheat` in this manner produces a plot of the top sites ($25$, by
default) based on the raw p-value, using the raw methylation measures in the
plot. This uses the exceptional `superheat` R package [@barter2017superheat].

---

## Session Information

```{r session-info, echo=FALSE}
sessionInfo()
```

---

## References