---
title: "Analyzing Targeted Bisulfite Sequencing data with SOMNiBUS"
author:
- name: Kaiqiong Zhao
  affiliation: Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, Canada
date: "`r BiocStyle::doc_date()`"
output:
   BiocStyle::html_document:
    highlight: pygments
    toc_float: true
    fig_width: 10
    keep_md: true
    fig_caption: yes
bibliography: bibliography.bib
vignette: >
  %\VignetteIndexEntry{Analyzing Targeted Bisulfite Sequencing data with SOMNiBUS}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  fig.width = 10,
  comment = "#>"
)
```


# Installation

Install the package SOMNiBUS from Bioconductor.

```{r installation, echo=TRUE, eval=FALSE}
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")

BiocManager::install("SOMNiBUS")
```

```{r, eval=FALSE, echo=FALSE}
ROOT_PACKAGE_PATH <- paste(getwd(), "/", sep = "")
devtools::document(ROOT_PACKAGE_PATH)
devtools::load_all(ROOT_PACKAGE_PATH)
```

# Introduction

**SOMNiBUS** aims to analyze count-based methylation data on predefined genomic regions, such as those obtained by targeted sequencing, and thus to identify differentially methylated regions (DMRs) that are associated with phenotypes or traits surch as cell types.

Major advantages of **SOMNiBUS**

- enable complex associations with multiple phenotypes / traits using a Generalized Additive Model approach
- the modeling strategy incorporates count-based methylation error rate arguments (i.e., p0 or false positive rate and p1 or true positive rate)

For a more comprehensive introduction of the **SOMNiBUS** approach, please read our SOMNiBUS paper [@Zhao2020].

## Citation

If you use this package, please cite our SOMNiBUS paper [@Zhao2020].

# Application
## Rheumatoid arthritis study

Throughout this vignette, we illustrate the **SOMNiBUS** approach with analysis of a targeted region from a rheumatoid arthritis (RA) study. See `help(RAdat)` for further details. In this example, the phenotype of major interest is the RA status (coded as `RA`) and the adjusting variable is the cell type status (coded as `T_cell`) which is binary because the experiment used cell-type-separated blood samples, and methylation profiles were characterized for both T-cells and Monocytes. We will refer to both `RA` and `T_cell` as *covariates*.
We are going to use the package `SOMNiBUS` to investigate the methylation patterns in this region and study association with RA status and cell type.
```{r load package, echo=TRUE, message=FALSE, warning=FALSE}
library(SOMNiBUS)
```

# Input data

Currently, we require a matrix-type input of the methylated reads (`Meth_Counts`) and the read depth (`Total_Counts`) for CpG sites of each sample. Inputs in another format, such as *Bismark* or a `BSeq` object from the *bsseq* package are now supported using the formatting functions `formatFromBSseq()`, `formatFromBismark()` and `formatFromBSmooth()` (more details below).

Before using the package, the input data matrix (or data frame) should be formatted  such that:

1. each row represents a CpG site
2. the first 4 columns should contain the information of `Meth_Counts` (methylated counts), `Total_Counts` (read depths), `Position` (Genomic position for the CpG site) and `ID` (sample ID)
3. the covariate(s), such as disease status or cell type composition are listed in column 5 and onwards.

An example of the input data:

```{r}
data("RAdat")
head(RAdat)
```
We implemented 3 functions dedicated to the conversion of outputs generated by standard whole-genome shotgun bisulfite sequencing (WGBS) tools such as *BSseq* R package (Hansen, Langmead, and Irizarry 2012), *Bismark* (Krueger, and Andrews 2011) and *BSmooth* (Hansen, Langmead, and Irizarry 2012) alignment suites.

## **formatFromBSseq**

```{r, eval = FALSE}
formatFromBSseq <- function(bsseq_dat, verbose = TRUE)
```

The function `formatFromBSseq` reads and converts a `BSseq` object (`bsseq_dat`) into a list of `data.frame`s (one per chromosome) to a format compatible with `runSOMNiBUS` and `binomRegMethModel`. Each `data.frame` contains rows as individual CpGs appearing in all samples. The first 4 columns contain the information of `Meth_Counts` (methylated counts), `Total_Counts` (read depths), `Position` (Genomic position for the CpG site) and `ID` (sample ID). 

The additional information (such as disease status, sex, age) extracted from the BSseq object are listed in column 5 and onwards and will be considered as covariate information by *SOMNiBUS* algorithms. 

## **formatFromBSseq** and **formatFromBSmooth**

The functions `formatFromBismark` and `formatFromBSmooth` utilize pre-existing methods implemented in the *bsseq* R package, `read.bismark` and `read.bsmooth` to convert, respectively, *Bismark* and *BSmooth* outputs into `BSseq` objects.

Once this conversion is applied, we call `formatFromBSseq` to generate the final output (described above).

```{r, eval = FALSE}
formatFromBismark <- function(..., verbose = TRUE)

formatFromBSmooth <- function(..., verbose = TRUE)
```

`...` refers to the parameters from `bsseq::read.bismark()` or `bsseq::read.bsmooth()` functions. Use `?bsseq::read.bismark()` or `?bsseq::read.bsmooth()` for more information.

## Filtering CpGs and samples

To better use the information in the methylation dataset, on one hand, **SOMNiBUS** uses a smoothing technique (regression splines) to borrow information from the nearby CpG sites; on the other hand, our approach uses regression-based modelling to take advantage of information contained across samples. Therefore, this algorithm does not require filtering out the CpG sites that have methylation levels measured only in a small part of the samples, or the samples that have overall poor read-depths and many missing values. Our analysis of differentially methylated regions (DMRs) requires filtering only on the following two conditions:

- individual CpGs that have zero reads in a particular sample (no observation available)
- samples that have missing values in any of the covariates of interest (i.e missing values for `T_cell` or `RA` in the data set `RAdat`)

```{r}
RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ])
```

## Adjusting for covariates and adding interactions

- we currently only accept numeric input for the covariates used to fit the model. we recommend that first you transform your categorical variables into appropriate dummy variables
- interaction terms can be added in the analysis model. To do that, the program requires that users add another column of covariate values into the input data set calculated as the product of two existing covariates whose interaction is of interest.

# Analysis

The smooth covariate estimation and the region-wise test steps are wrapped into a function `binomRegMethModel`. See `help(binomRegMethModel)` for more details. We can use the following code to run the analysis with both covariates `T_cell` and `RA`.

If there is a single region to analyze, we can directly call the function binomRegMethModel. We can use the following code to run the analysis with both covariates T_cell and RA.

```{r}
out <- binomRegMethModel(data = RAdat.f, n.k = rep(5, 3), p0 = 0.003, 
                         p1 = 0.9, Quasi = FALSE, RanEff = FALSE,
                         verbose = FALSE)
```

Or, we can use the argument `covs` to specify that we only want the covariate `T_cell` in the model.

```{r, eval=FALSE}
out.ctype <- binomRegMethModel(data = RAdat.f, n.k = rep(5, 2), p0 = 0.003,
                               p1 = 0.9, covs = "T_cell", verbose = FALSE)
```

If the analysis encompasses multiple regions, we use a wrapper function `runSOMNiBUS` (See `help(runSOMNiBUS)` for more details) which encapsulates the function `binomRegMethModel`. This function splits the methylation data into regions (according to different approaches) and, for each region, calls the function `binomRegMethModel` to fit a (dispersion-adjusted) binomial regression model to regional methylation data. It returns a list (one element by independent region) of results generated by the function `binomRegMethModel`. Each result reports the estimated smooth covariate effects and regional p-values for the test of DMRs. Over or under dispersion across loci is accounted for in the model by the combination of a multiplicative dispersion parameter (or scale parameter) and a sample-specific random effect.

The four main approaches are:

- **region** which splits methylation data into regions based on the spacing of CpGs. See `help(splitDataByRegion)` for more details.
- **density** which splits methylation data into regions based on the density of CpGs. See `help(splitDataByDensity)` for more details.
- **gene** which splits methylation data into regions based on the genomic annotations. See `help(splitDataByGene)` for more details.
- **chromatin** which splits methylation data into regions based on the chromatin states. See `help(splitDataByChromatin)` for more details.

Two generic approaches have also been implemented to enable users to use their own annotations for partitioning purposes:

- **bed** which splits methylation data into regions based on the genomic annotations provided under the form of a 1-based BED file (the bases are numerated starting at 1). See `help(splitDataByBed)` for more details.
- **granges** which splits methylation data into regions based on the genomic annotations provided under the form of a `GenomicRanges` object. See `help(splitDataByGRanges)` for more details.

Specifically, the **granges** approach is used internally to align and partition annotation data coming from **bed**, **gene** and **chromatin** approaches. 

Each partitioning function requires a data frame (`dat`) with rows as individual CpGs appearing in all the samples. The first 4 columns contain the information of `Meth_Counts` (methylated counts), `Total_Counts` (read depths), `Position` (Genomic position for the CpG site) and `ID` (sample ID). The covariate information, such as disease status or cell type composition, are listed in column 5 and onwards. These partitioning functions return a named list of data.frame containing the data of each independent region. By default, the partitioning approach (`region`) splits the methylation data into regions based on the spacing between CpGs.

The following command line enables to split my the input data based on the spacing between CpG (CpG islands) and analyze each region:

```{r, cache = TRUE}
outs <- runSOMNiBUS(dat = RAdat.f,
                    split = list(approach = "region", gap = 100),
                    n.k = rep(10,3), p0 = 0.003, p1 = 0.9, 
                    min.cpgs = 10, max.cpgs = 2000, verbose = TRUE)
```

## Error rates p0 and p1

In the example data set, we have cell type separated samples. The error rates for individual samples can be estimated by a E-M algorithm [@lakhal2017smoothed] using the package `SmoothMSC`. The error rate default values,
$p_0=0.003$ and $p_1=0.9$, were estimated as the average incomplete ($p_0$) or over- conversion ($1-p_1$) of the metabisulfite. These two estimated values coincide roughly with the incomplete and over conversion rates related to bisulfite sequencing experiment reported in @prochenka2015cautionary. Both parameters, p0 and p1, correspond to the false positive rate and the true positive rate respectively, where 1-p1 being the false negative rate.

For experiments with samples from a tissue containing a mixture of cell types, the user could consider the following ways to specify the error rates p0 and 1-p1.

- If users have conducted experiments for measuring error/conversion rates, such as adding spike-in sequences of DNA that are known in advance to be methylated or unmethylated into the bisulfite sequencing procedure, they can use the measured error rates for the input of `p0` and `p1`
- One can also use the error rates (incomplete and over conversion rates) that have been previous reported in the literature.
- Another option is to use our default values.

##  Basis dimensions n.k:

Argument `n.k` in the `binomRegMethModel` is the dimension of the basis expansion for smooth covariate effects.
The exact number `n.k` used for each functional parameter is not crucial, because it only sets an upper bound. We recommend choosing a basis dimension approximately equal to the number of unique CpGs in the region divided by 20. Please notice that, this parameter is computed automatically (overwriting the value provided by the user if any), when several regions are generated by the partitioning function within the wrapper function `runSOMNiBUS`.

```{r}
as.integer(length(unique(RAdat.f$Position)) / 20)
```

# Results

## testing the null hypothesis

Under the null hypothesis, we are expecting no effects of the covariates over the region-wide methylation status.


```{r}
out$reg.out
```

## Estimation of the smooth covariate effects

```{r, fig.cap="The estimates (solid lines) and 95% pointwise confidence intervals (dashed lines) of the intercept, the smooth effect of cell type and RA on methylation levels."}
binomRegMethModelPlot(BEM.obj = out)
```


We can also force the covariate effect plots to have the same vertical range, for all covariates, by specifying `same.range = TRUE`.

```{r, fig.cap="The estimates (solid lines) and 95% pointwise confidence intervals (dashed lines) of the intercept, the smooth effect of cell type and RA on methylation levels. (Same ranges of Y axis.)"}
binomRegMethModelPlot(out, same.range = TRUE)
```

The user can select a subset of covariates of interest by indicating the name of those covariates with the `covs` arguments.

```{r, fig.cap="The estimates (solid lines) and 95% pointwise confidence intervals (dashed lines) of the smooth effect of cell type and RA on methylation levels.", echo = TRUE, eval = TRUE}
# creating plot 
binomRegMethModelPlot(BEM.obj = out, same.range = FALSE, verbose = FALSE,
                      covs = c("RA", "T_cell"))
```

The `mfrow` parameter allows you to create a matrix of plots in one plotting space. It takes a vector of length two as an argument, corresponding to the number of rows and columns in the resulting plotting matrix.

```{r, fig.cap="The estimates (solid lines) and 95% pointwise confidence intervals (dashed lines) of the intercept, the smooth effect of cell type and RA on methylation levels.", echo = TRUE, fig.height = 10, eval = TRUE}
# creating a 2x2 matrix 
binomRegMethModelPlot(BEM.obj = out, same.range = FALSE, verbose = FALSE,
                      mfrow = c(2,2))

# creating a 3x1 matrix 
binomRegMethModelPlot(BEM.obj = out, same.range = FALSE, verbose = FALSE,
                      mfrow = c(3,1))
```


## Predicted methylation levels

First, construct a new data set for prediction. Make sure that the Position in the new data set is the same as the original input `data` in `runSOMNiBUS` (or `binomRegMethModel`).

```{r, eval = TRUE}
# simulating new data
pos <- out$uni.pos
my.p <- length(pos)
newdata <- expand.grid(pos, c(0, 1), c(0, 1))
colnames(newdata) <- c("Position", "T_cell", "RA")
```

The predicted methylation levels can be calculated from function `binomRegMethModelPred` using both prediction types (*link.scale* and *proportion*). See `help(binomRegMethModelPred)` for more details.

```{r, eval = TRUE}
# prediction of methylation levels for the new data (logit scale)
my.pred.log <- binomRegMethModelPred(out, newdata, type = "link.scale", 
                                 verbose = FALSE)

# prediction of methylation levels for the new data (proportion)
my.pred.prop <- binomRegMethModelPred(out, newdata, type = "proportion", 
                                  verbose = FALSE)
```

We can visualize the prediction results using the function `binomRegMethPredPlot` (see `help(binomRegMethPredPlot)` for more details). We defined some experimental design in order to identify the different expression patterns based on the different disease and cell type status. In the following example, we created 4 groups of samples:

- controls (`RA == 0`) t-cells (`T_cell == 1`),

- controls (`RA == 0`) monocytes (`T_cell == 0`),

- cases (`RA == 1`) t-cells (`T_cell == 1`),

- cases (`RA == 1`) monocytes (`T_cell == 0`).

And add the design to the input data and prediction results using the following code:

```{r, echo = TRUE, eval = TRUE}
# creating the experimental design
newdata$group <- ""
newdata[(newdata$RA == 0 & newdata$T_cell == 0),]$group <- "CTRL MONO"
newdata[(newdata$RA == 0 & newdata$T_cell == 1),]$group <- "CTRL TCELL"
newdata[(newdata$RA == 1 & newdata$T_cell == 0),]$group <- "RA MONO"
newdata[(newdata$RA == 1 & newdata$T_cell == 1),]$group <- "RA TCELL"

# merge input data and prediction results
pred <- cbind(newdata, Logit = my.pred.log, Prop = my.pred.prop)
head(pred)
```

Once the data are ready, we create a `style` describing the way each group should be displayed in the plot. If `style = NULL`, the default color and line type are picked.

```{r, echo = TRUE, eval = TRUE}
# creating the custom style for each experimental group
style <- list(
  "CTRL MONO" = list(color = "blue", type = "solid"),
  "CTRL TCELL" = list(color = "green", type = "solid"),
  "RA MONO" = list(color = "red", type = "solid"),
  "RA TCELL" = list(color = "black", type = "solid")
) 

```


The following code enables to visualize the two types of prediction results.
```{r, fig.cap="The predicted methylation levels in the logit scale for the 4 groups of samples with different disease and cell type status.", echo = TRUE, eval = TRUE}
# creating plot (logit scale)
binomRegMethPredPlot(pred = pred, pred.type = "link.scale", 
                     pred.col = "Logit", group.col = "group", 
                     title = "Logit scale",
                     style = style, verbose = TRUE)
```

```{r, fig.cap="The predicted methylation levels in the proportion scale for the 4 groups of samples with different disease and cell type status.", echo = TRUE, eval = TRUE}
# creating plot (proportion)
binomRegMethPredPlot(pred = pred, pred.type = "proportion",
                     pred.col = "Prop", group.col = "group",
                     title = "Proportion scale",
                     style = style, verbose = FALSE)

```

By default, no experimental design is required (`group.col = NULL`). In this case, the prediction results are displayed as a scatter plot.

```{r, fig.cap="The predicted methylation levels in the logit scale without any experimental design.", eval = TRUE, echo = TRUE}
binomRegMethPredPlot(pred = pred, pred.type = "link.scale", pred.col = "Logit", 
                     group.col = NULL, title = "Logit scale", verbose = FALSE)
```

By putting the group value to `NA` or `""` (empty character), we can specifically exclude some experimental groups from the plot. In the following example, we excluded monocytes.

```{r, fig.cap="The predicted methylation levels in the logit scale for the T-cell samples.", echo = TRUE, eval = TRUE}
# exclusion of the MONO cells (not T_cell)
pred[(pred$RA == 0 & pred$T_cell == 0),]$group <- NA
pred[(pred$RA == 1 & pred$T_cell == 0),]$group <- ""

# creating plot (logit scale)
binomRegMethPredPlot(pred = pred, pred.type = "link.scale", 
                     pred.col = "Logit", group.col = "group", 
                     title = "Logit scale",
                     style = style, verbose = FALSE)

```


# Session info {.unnumbered}

Here is the output of `sessionInfo()` on the system on which this document was
compiled running pandoc `r rmarkdown::pandoc_version()`:

```{r sessionInfo, echo=FALSE}
sessionInfo()
```

# References {.unnumbered}