Install the package SOMNiBUS from Bioconductor.
SOMNiBUS aims to analyze count-based methylation data on predefined genomic regions, such as those obtained by targeted sequencing, and thus to identify differentially methylated regions (DMRs) that are associated with phenotypes or traits surch as cell types.
Major advantages of SOMNiBUS
For a more comprehensive introduction of the SOMNiBUS approach, please read our SOMNiBUS paper (Kaiqiong Zhao 2020).
If you use this package, please cite our SOMNiBUS paper (Kaiqiong Zhao 2020).
Throughout this vignette, we illustrate the SOMNiBUS approach with analysis of a targeted region from a rheumatoid arthritis (RA) study. See help(RAdat) for further details. In this example, the phenotype of major interest is the RA status (coded as RA) and the adjusting variable is the cell type status (coded as T_cell) which is binary because the experiment used cell-type-separated blood samples, and methylation profiles were characterized for both T-cells and Monocytes. We will refer to both RA and T_cell as covariates.
We are going to use the package SOMNiBUS to investigate the methylation patterns in this region and study association with RA status and cell type.
Currently, we require a matrix-type input of the methylated reads (Meth_Counts) and the read depth (Total_Counts) for CpG sites of each sample. Inputs in another format, such as Bismark or a BSeq object from the bsseq package, will be incorporated in the future.
Before using the package, the input data matrix (or data frame) should be formatted such that:
Meth_Counts (methylated counts), Total_Counts (read depths), Position (Genomic position for the CpG site) and ID (sample ID)An example of the input data:
head(RAdat)
#>   Meth_Counts Total_Counts  Position ID T_cell RA
#> 1           0            8 102711629  1      0  1
#> 2           0            2 102711630  1      0  1
#> 3           0           12 102711702  1      0  1
#> 4           0            4 102711703  1      0  1
#> 5           0           15 102711747  1      0  1
#> 6           0            6 102711748  1      0  1To better use the information in the methylation dataset, on one hand, SOMNiBUS uses a smoothing technique (regression splines) to borrow information from the nearby CpG sites; on the other hand, our approach uses regression-based modelling to take advantage of information contained across samples. Therefore, this algorithm does not require filtering out the CpG sites that have methylation levels measured only in a small part of the samples, or the samples that have overall poor read-depths and many missing values. Our analysis of differentially methylated regions (DMRs) requires filtering only on the following two conditions:
T_cell or RA in the data set RAdat)The smooth covariate estimation and the region-wise test steps are wrapped into a function binomRegMethModel. See help(binomRegMethModel) for more details. We can use the following code to run the analysis with both covariates T_cell and RA.
Or, we can use the argument covs to specify that we only want the covariate T_cell in the model.
In the example data set, we have cell type separated samples. The error rates for individual samples can be estimated by a E-M algorithm (Lakhal-Chaieb et al. 2017) using the package SmoothMSC. The error rate default values,
\(p_0=0.003\) and \(p_1=0.9\), were estimated as the average incomplete (\(p_0\)) or over- conversion (\(1-p_1\)) of the metabisulfite. These two estimated values coincide roughly with the incomplete and over conversion rates related to bisulfite sequencing experiment reported in Prochenka et al. (2015). Both parameters, p0 and p1, correspond to the false positive rate and the true positive rate respectively, where 1-p1 being the false negative rate.
For experiments with samples from a tissue containing a mixture of cell types, the user could consider the following ways to specify the error rates p0 and 1-p1.
p0 and p1Argument n.k in the binomRegMethModel is the dimension of the basis expansion for smooth covariate effects.
The exact number n.k used for each functional parameter is not crucial, because it only sets an upper bound. We recommend choosing a basis dimension approximately equal to the number of unique CpGs in the region divided by 20.
Under the null hypothesis, we are expecting no effects of the covariates over the region-wide methylation status.
Figure 1: The estimates (solid red lines) and 95% pointwise confidence intervals (dashed red lines) of the intercept, the smooth effect of cell type and RA on methylation levels
We can also force the covariate effect plots to have the same vertical range, for all covariates, by specifying same.range= T.
Figure 2: The estimates (solid red lines) and 95% pointwise confidence intervals (dashed red lines) of the intercept, the smooth effect of cell type and RA on methylation levels
(Same ranges of Y axis.)
First, construct a new data set for prediction. Make sure that the Position in the new data set is the same as the original input data in binomRegMethModel.
pos <- out$uni.pos
my.p <- length(pos)
newdata <- expand.grid(pos, c(0, 1), c(0, 1))
colnames(newdata) <- c("Position", "T_cell", "RA")The predicted methylation levels can be calculated from function binomRegMethModelPred
plot(pos[order(pos)], (my.pred[(newdata$RA == 0 & newdata$T_cell == 0)])[order(pos)],
  type = "l", xlab = "Position",
  ylab = "Predicted methylation levels (in logit scale)", col = "blue", main = "Logit scale", ylim = c(min(my.pred), max(my.pred)), lwd = 2
)
lines(pos[order(pos)], (my.pred[(newdata$RA == 0 & newdata$T_cell == 1)])[order(pos)],
  type = "l", xlab = "Position",
  ylab = "predicted", col = "green", lwd = 2
)
lines(pos[order(pos)], (my.pred[(newdata$RA == 1 & newdata$T_cell == 0)])[order(pos)],
  type = "l", xlab = "Position",
  ylab = "predicted", col = "red", lwd = 2
)
lines(pos[order(pos)], (my.pred[(newdata$RA == 1 & newdata$T_cell == 1)])[order(pos)],
  type = "l", xlab = "Position",
  ylab = "predicted", col = "black", lwd = 2
)
legend("top", c("RA MONO", "RA TCELL", "CTRL MONO", "CTRL TCELL"),
  fill = c("red", "black", "blue", "green"),
  title = "Disease and Cell Type", bty = "n", cex = 0.8
)
Figure 3: The predicted methylation levels in the logit scale for the 4 groups of samples with different disease and cell type status
my.pred <- binomRegMethModelPred(out, newdata, type = "proportion")
plot(pos[order(pos)], (my.pred[(newdata$RA == 0 & newdata$T_cell == 0)])[order(pos)],
  type = "l", xlab = "Position",
  ylab = "Predicted methylation levels (in logit scale)", col = "blue", main = "Proportion scale", ylim = c(min(my.pred), max(my.pred)), lwd = 2
)
lines(pos[order(pos)], (my.pred[(newdata$RA == 0 & newdata$T_cell == 1)])[order(pos)],
  type = "l", xlab = "Position",
  ylab = "predicted", col = "green", lwd = 2
)
lines(pos[order(pos)], (my.pred[(newdata$RA == 1 & newdata$T_cell == 0)])[order(pos)],
  type = "l", xlab = "Position",
  ylab = "predicted", col = "red", lwd = 2
)
lines(pos[order(pos)], (my.pred[(newdata$RA == 1 & newdata$T_cell == 1)])[order(pos)],
  type = "l", xlab = "Position",
  ylab = "predicted", col = "black", lwd = 2
)
legend("top", c("RA MONO", "RA TCELL", "CTRL MONO", "CTRL TCELL"),
  fill = c("red", "black", "blue", "green"),
  title = "Disease and Cell Type", bty = "n", cex = 0.8
)
Figure 4: The predicted methylation levels proportion scale (right) for the 4 groups of samples with different disease and cell type status
Here is the output of sessionInfo() on the system on which this document was
compiled running pandoc 2.5:
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] SOMNiBUS_1.0.0   BiocStyle_2.20.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.6          knitr_1.33          magrittr_2.0.1     
#>  [4] splines_4.1.0       lattice_0.20-44     R6_2.5.0           
#>  [7] rlang_0.4.11        highr_0.9           stringr_1.4.0      
#> [10] tools_4.1.0         grid_4.1.0          nlme_3.1-152       
#> [13] mgcv_1.8-35         xfun_0.23           jquerylib_0.1.4    
#> [16] htmltools_0.5.1.1   yaml_2.2.1          digest_0.6.27      
#> [19] bookdown_0.22       Matrix_1.3-3        BiocManager_1.30.15
#> [22] sass_0.4.0          VGAM_1.1-5          evaluate_0.14      
#> [25] rmarkdown_2.8       stringi_1.6.2       compiler_4.1.0     
#> [28] bslib_0.2.5.1       magick_2.7.2        stats4_4.1.0       
#> [31] jsonlite_1.7.2Kaiqiong Zhao, Lajmi Lakhal‐Chaieb, Karim Oualkacha. 2020. “A Novel Statistical Method for Modeling Covariate Effects in Bisulfite Sequencing Derived Measures of Dna Methylation.” Biometrics preprint. https://doi.org/10.1111/biom.13307.
Lakhal-Chaieb, Lajmi, Celia MT Greenwood, Mohamed Ouhourane, Kaiqiong Zhao, Belkacem Abdous, and Karim Oualkacha. 2017. “A Smoothed Em-Algorithm for Dna Methylation Profiles from Sequencing-Based Methods in Cell Lines or for a Single Cell Type.” Statistical Applications in Genetics and Molecular Biology 16 (5-6): 333–47.
Prochenka, Agnieszka, Piotr Pokarowski, Piotr Gasperowicz, Joanna Kosińska, Piotr Stawiński, Renata Zbieć-Piekarska, Magdalena Spólnicka, Wojciech Branicki, and Rafał Płoski. 2015. “A Cautionary Note on Using Binary Calls for Analysis of Dna Methylation.” Bioinformatics 31 (9): 1519–20.