Navigating the complexity of DNA methylation data poses substantial challenges
due to its intricate biological patterns. The regionalpcs package is
conceived to address the substantial need for enhanced summarization and
interpretation at a regional level. Traditional methodologies, while
foundational, may not fully encapsulate the biological nuances of methylation
patterns, thereby potentially yielding interpretations that may be suboptimal
or veer towards inaccuracies. This package introduces and utilizes regional
principal components (rPCs), designed to adeptly capture biologically relevant
signals embedded within DNA methylation data. Through the implementation of
rPCs, researchers can gain new insights into complex interactions and
effects in methylation data that might otherwise be missed.
The regionalpcs package can be easily installed from Bioconductor, providing
you with the latest stable version suitable for general use. Alternatively,
for those interested in exploring or utilizing the latest features and
developments, the GitHub version can be installed directly.
Install regionalpcs from Bioconductor using the command below. Ensure
that your R version is compatible with the package version available on
Bioconductor for smooth installation and functionality.
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("regionalpcs")To access the development version of regionalpcs directly from GitHub,
which might include new features or enhancements not yet available in the
Bioconductor version, use the following commands. Note that the development
version might be less stable than the officially released version.
# install devtools package if not already installed
if (!requireNamespace("devtools", quietly=TRUE))
    install.packages("devtools")
# download and install the regionalpcs package
devtools::install_github("tyeulalio/regionalpcs")regionalpcs R Package TutorialThis tutorial depends on several Bioconductor packages. These packages should be loaded at the beginning of the analysis.
library(regionalpcs)
library(GenomicRanges)
library(tidyr)
library(tibble)
library(dplyr)
library(minfiData)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)Let’s proceed to load the packages, briefly understanding their roles in this tutorial.
regionalpcs: Primary package for summarizing and interpreting DNA
methylation data at a regional level.
GenomicRanges: Facilitates representation and manipulation of genomic
intervals and variables defined along a genome.
tidyr, tibble, dplyr: Assist in data tidying, representation,
and manipulation.
minfiData: Provides example Illumina 450k data, aiding in the
demonstration of regionalpcs functionalities.
TxDb.Hsapiens.UCSC.hg19.knownGene: Accommodates transcriptome data,
useful for annotating results.
Once the packages are loaded, we can start using the functions provided by each package.
In this tutorial, we’ll utilize a sample dataset, MsetEx.sub, which is a
subset derived from Illumina’s Human Methylation 450k dataset, specifically
preprocessed to contain 600 CpGs across 6 samples. The dataset is stored in a
MethylSet object, which is commonly used to represent methylation data.
The methylation beta values, denoting the proportion of methylated cytosines at a particular CpG site, will be extracted from this dataset for our subsequent analyses.
# Load the MethylSet data
data(MsetEx.sub)
# Display the first few rows of the dataset for a preliminary view
head(MsetEx.sub)
#> class: MethylSet 
#> dim: 6 6 
#> metadata(0):
#> assays(2): Meth Unmeth
#> rownames(6): cg00050873 cg00212031 ... cg00455876 cg01707559
#> rowData names(0):
#> colnames(6): 5723646052_R02C02 5723646052_R04C01 ... 5723646053_R05C02
#>   5723646053_R06C02
#> colData names(13): Sample_Name Sample_Well ... Basename filenames
#> Annotation
#>   array: IlluminaHumanMethylation450k
#>   annotation: ilmn12.hg19
#> Preprocessing
#>   Method: Raw (no normalization or bg correction)
#>   minfi version: 1.21.2
#>   Manifest version: 0.4.0
# Extract methylation M-values from the MethylSet
# M-values are logit-transformed beta-values and are often used in differential
# methylation analysis for improved statistical performance.
mvals <- getM(MsetEx.sub)
# Display the extracted methylation beta values
head(mvals)
#>            5723646052_R02C02 5723646052_R04C01 5723646052_R05C02
#> cg00050873          3.502348         0.4414491          4.340695
#> cg00212031         -3.273751         0.9234662         -2.614777
#> cg00213748          2.076816        -0.1309465          1.260995
#> cg00214611         -3.438838         1.7463950         -2.270551
#> cg00455876          1.839010        -0.9927320          1.619479
#> cg01707559         -3.303987        -0.6433201         -3.540887
#>            5723646053_R04C02 5723646053_R05C02 5723646053_R06C02
#> cg00050873        0.24458355        -0.3219281         0.2744392
#> cg00212031       -0.21052257        -0.6861413        -0.1397595
#> cg00213748       -1.10373279        -1.6616553        -0.1270869
#> cg00214611        0.29029649        -0.2103599        -0.6138630
#> cg00455876       -0.09504721        -0.2854655         0.6361273
#> cg01707559       -0.74835377        -0.4678048        -1.1345421Note that MsetEx.sub provides a manageable slice of data that we can
utilize to illustrate the capabilities of regionalpcs without requiring
extensive computational resources.
Now that we have our dataset loaded and methylation values extracted,
let’s proceed with demonstrating the core functionalities of regionalpcs.
In this step, we’ll obtain the genomic coordinates of the CpG sites in our
methylation dataset using the 450k array probe annotations using the minfi
package.
# Map the methylation data to genomic coordinates using the mapToGenome
# function. This creates a GenomicMethylSet (gset) which includes genomic 
# position information for each probe.
gset <- mapToGenome(MsetEx.sub)
# Display the GenomicMethylSet object to observe the structure and initial 
# entries.
gset
#> class: GenomicMethylSet 
#> dim: 600 6 
#> metadata(0):
#> assays(2): Meth Unmeth
#> rownames(600): cg01003813 cg03723195 ... cg03930849 cg08265308
#> rowData names(0):
#> colnames(6): 5723646052_R02C02 5723646052_R04C01 ... 5723646053_R05C02
#>   5723646053_R06C02
#> colData names(13): Sample_Name Sample_Well ... Basename filenames
#> Annotation
#>   array: IlluminaHumanMethylation450k
#>   annotation: ilmn12.hg19
#> Preprocessing
#>   Method: Raw (no normalization or bg correction)
#>   minfi version: 1.21.2
#>   Manifest version: 0.4.0
# Convert the genomic position data into a GRanges object, enabling genomic 
# range operations in subsequent analyses.
# The GRanges object (cpg_gr) provides a versatile structure for handling 
# genomic coordinates in R/Bioconductor.
cpg_gr <- granges(gset)
# Display the GRanges object for a preliminary view of the genomic coordinates.
cpg_gr
#> GRanges object with 600 ranges and 0 metadata columns:
#>              seqnames    ranges strand
#>                 <Rle> <IRanges>  <Rle>
#>   cg01003813     chrX   2715058      *
#>   cg03723195     chrX   2847354      *
#>   cg00074638     chrX   2883307      *
#>   cg01728536     chrX   7070451      *
#>   cg01864404     chrX   8434367      *
#>          ...      ...       ...    ...
#>   cg26983430     chrY  24549675      *
#>   cg01757887     chrY  25114810      *
#>   cg00061679     chrY  25314171      *
#>   cg03930849     chrY  26716289      *
#>   cg08265308     chrY  28555550      *
#>   -------
#>   seqinfo: 2 sequences from hg19 genome; no seqlengthsGene regions, which include functional segments such as promoters, gene bodies,
and intergenic regions, play pivotal roles in gene expression and regulation.
Summarizing methylation patterns across these regions can provide insights
into potential gene regulatory mechanisms and associations with phenotypes
or disease states. Herein, we will delve into how to succinctly summarize
methylation data at these crucial genomic segments using the regionalpcs
package.
First, let’s load the gene region annotations. Make sure to align the genomic builds of your annotations and methylation data.
# Obtain promoter regions
# The TxDb object 'txdb' facilitates the retrieval of transcript-based 
# genomic annotations.
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
# Extracting promoter regions with a defined upstream and downstream window. 
# This GRanges object 'promoters_gr' will be utilized to map and summarize 
# methylation data in promoter regions.
promoters_gr <- suppressMessages(promoters(genes(txdb), 
                                    upstream=1000, 
                                    downstream=0))
# Display the GRanges object containing the genomic coordinates of promoter 
# regions.
promoters_gr
#> GRanges object with 23056 ranges and 1 metadata column:
#>         seqnames              ranges strand |     gene_id
#>            <Rle>           <IRanges>  <Rle> | <character>
#>       1    chr19   58874215-58875214      - |           1
#>      10     chr8   18247755-18248754      + |          10
#>     100    chr20   43280377-43281376      - |         100
#>    1000    chr18   25757446-25758445      - |        1000
#>   10000     chr1 244006887-244007886      - |       10000
#>     ...      ...                 ...    ... .         ...
#>    9991     chr9 115095945-115096944      - |        9991
#>    9992    chr21   35735323-35736322      + |        9992
#>    9993    chr22   19109968-19110967      - |        9993
#>    9994     chr6   90538619-90539618      + |        9994
#>    9997    chr22   50964906-50965905      - |        9997
#>   -------
#>   seqinfo: 93 sequences (1 circular) from hg19 genomeCreating a region map, which systematically assigns CpGs to specific gene
regions, stands as a crucial precursor to gene-region summarization using the
regionalpcs package. This mapping elucidates the physical positioning of
CpGs within particular gene regions, facilitating our upcoming endeavors to
comprehend how methylation varies across distinct genomic segments. We’ll use
the create_region_map function from the regionalpcs package. This function
takes two genomic ranges objects, cpg_gr contains CpG positions and genes_gr
contains gene region positions. Make sure both positions are aligned to the
same genome build (e.g. GrCH37, CrCH38).
# get the region map using the regionalpcs function
region_map <- regionalpcs::create_region_map(cpg_gr=cpg_gr, 
                                                genes_gr=promoters_gr)
# Display the initial few rows of the region map.
head(region_map)
#>     gene_id     cpg_id
#> 1 100131434 cg00466309
#> 2 100133941 cg05230942
#> 3 100133957 cg00636562
#> 4     10084 cg01303569
#> 5      1069 cg01056373
#> 6     10857 cg01333849Note: The second column of region_map must contain values matching the
rownames of your methylation dataframe.
In this final section, we’ll summarize gene regions using Principal
Components (PCs) to capture the maximum variation. We’ll utilize the
compute_regional_pcs function from the regionalpcs package for this.
Let’s calculate the regional PCs using our gene regions for demonstration purposes.
# Display head of region map
head(region_map)
#>     gene_id     cpg_id
#> 1 100131434 cg00466309
#> 2 100133941 cg05230942
#> 3 100133957 cg00636562
#> 4     10084 cg01303569
#> 5      1069 cg01056373
#> 6     10857 cg01333849
dim(region_map)
#> [1] 211   2
# Compute regional PCs
res <- compute_regional_pcs(meth=mvals, region_map=region_map, pc_method="gd")
#> Using Gavish-Donoho methodThe function returns a list containing multiple elements. Let’s first look at what these elements are.
# Inspect the output list elements
names(res)
#> [1] "regional_pcs"     "percent_variance" "loadings"         "info"The first element (res$regional_pcs) is a data frame containing the
calculated regional PCs.
# Extract regional PCs
regional_pcs <- res$regional_pcs
head(regional_pcs)[1:4]
#>               5723646052_R02C02 5723646052_R04C01 5723646052_R05C02
#> 100131434-PC1         -3.887949        1.82168108        -4.2354365
#> 100133941-PC1          1.100370        0.06719947         0.1932361
#> 100133957-PC1         -2.032026        0.19154436        -0.9193417
#> 10084-PC1             -5.011579        2.34574585        -4.2050727
#> 1069-PC1              -2.004785        1.20363622        -2.5436686
#> 10857-PC1             -1.986294        1.31482789        -2.0044911
#>               5723646053_R04C02
#> 100131434-PC1         2.1707679
#> 100133941-PC1        -0.7438658
#> 100133957-PC1         1.1903709
#> 10084-PC1             2.2586300
#> 1069-PC1              1.5851359
#> 10857-PC1             2.1369556The output is a data frame with regional PCs for each region as rows and our samples as columns. This is our new representation of methylation values, now on a gene regional PC scale. We can feed these into downstream analyses as is.
The number of regional PCs representing each gene region was determined by the
Gavish-Donoho method. This method allows us to identify PCs that capture actual
signal in our data and not the noise that is inherent in any dataset. To explore
alternative methods, we can change the pc_method parameter.
# Count the number of unique gene regions and PCs
regions <- data.frame(gene_pc = rownames(regional_pcs)) |>
    separate(gene_pc, into = c("gene", "pc"), sep = "-")
head(regions)
#>        gene  pc
#> 1 100131434 PC1
#> 2 100133941 PC1
#> 3 100133957 PC1
#> 4     10084 PC1
#> 5      1069 PC1
#> 6     10857 PC1
# number of genes that have been summarized
length(unique(regions$gene))
#> [1] 142
# how many of each PC did we get
table(regions$pc)
#> 
#> PC1 
#> 142We have summarized each of our genes using just one PC. The number of PCs depends on three main factors: the number of samples, the number of CpGs in the gene region, and the noise in the methylation data.
By default, the compute_regional_pcs function uses the Gavish-Donoho method.
However, we can also use the Marcenko-Pasteur method by setting the pc_method
parameter:
# Using Marcenko-Pasteur method
mp_res <- compute_regional_pcs(mvals, region_map, pc_method = "mp")
#> Using Marchenko-Pastur method
# select the regional pcs
mp_regional_pcs <- mp_res$regional_pcs
# separate the genes from the pc numbers
mp_regions <- data.frame(gene_pc = rownames(mp_regional_pcs)) |>
    separate(gene_pc, into = c("gene", "pc"), sep = "-")
head(mp_regions)
#>        gene  pc
#> 1 100131434 PC1
#> 2 100133941 PC1
#> 3 100133957 PC1
#> 4     10084 PC1
#> 5      1069 PC1
#> 6     10857 PC1
# number of genes that have been summarized
length(unique(mp_regions$gene))
#> [1] 142
# how many of each PC did we get
table(mp_regions$pc)
#> 
#> PC1 
#> 142The Marcenko-Pasteur and the Gavish-Donoho methods are both based on Random Matrix Theory, and they aim to identify the number of significant PCs that capture the true signal in the data and not just the noise. However, these methods differ in how they select the number of significant PCs. The Marcenko-Pasteur method typically selects more PCs to represent a gene region compared to the Gavish-Donoho method. This may be due to the different ways in which the two methods estimate the noise level in the data.
Ultimately, the choice between the two methods depends on the specific needs and goals of the analysis. The Gavish-Donoho method tends to provide more conservative results, while the Marcenko-Pasteur method may capture more of the underlying signal in the data. Researchers should carefully consider their objectives and the characteristics of their data when selecting a method.
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2           
#>  [2] GenomicFeatures_1.58.0                            
#>  [3] AnnotationDbi_1.68.0                              
#>  [4] minfiData_0.51.0                                  
#>  [5] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
#>  [6] IlluminaHumanMethylation450kmanifest_0.4.0        
#>  [7] minfi_1.52.0                                      
#>  [8] bumphunter_1.48.0                                 
#>  [9] locfit_1.5-9.10                                   
#> [10] iterators_1.0.14                                  
#> [11] foreach_1.5.2                                     
#> [12] Biostrings_2.74.0                                 
#> [13] XVector_0.46.0                                    
#> [14] SummarizedExperiment_1.36.0                       
#> [15] Biobase_2.66.0                                    
#> [16] MatrixGenerics_1.18.0                             
#> [17] matrixStats_1.4.1                                 
#> [18] dplyr_1.1.4                                       
#> [19] tibble_3.2.1                                      
#> [20] tidyr_1.3.1                                       
#> [21] GenomicRanges_1.58.0                              
#> [22] GenomeInfoDb_1.42.0                               
#> [23] IRanges_2.40.0                                    
#> [24] S4Vectors_0.44.0                                  
#> [25] BiocGenerics_0.52.0                               
#> [26] regionalpcs_1.4.0                                 
#> [27] BiocStyle_2.34.0                                  
#> 
#> loaded via a namespace (and not attached):
#>   [1] splines_4.4.1             BiocIO_1.16.0            
#>   [3] bitops_1.0-9              preprocessCore_1.68.0    
#>   [5] XML_3.99-0.17             lifecycle_1.0.4          
#>   [7] lattice_0.22-6            MASS_7.3-61              
#>   [9] base64_2.0.2              scrime_1.3.5             
#>  [11] magrittr_2.0.3            limma_3.62.0             
#>  [13] sass_0.4.9                rmarkdown_2.28           
#>  [15] jquerylib_0.1.4           yaml_2.3.10              
#>  [17] doRNG_1.8.6               askpass_1.2.1            
#>  [19] cowplot_1.1.3             DBI_1.2.3                
#>  [21] RColorBrewer_1.1-3        abind_1.4-8              
#>  [23] zlibbioc_1.52.0           quadprog_1.5-8           
#>  [25] purrr_1.0.2               RCurl_1.98-1.16          
#>  [27] GenomeInfoDbData_1.2.13   ggrepel_0.9.6            
#>  [29] RMTstat_0.3.1             irlba_2.3.5.1            
#>  [31] rentrez_1.2.3             genefilter_1.88.0        
#>  [33] annotate_1.84.0           dqrng_0.4.1              
#>  [35] DelayedMatrixStats_1.28.0 codetools_0.2-20         
#>  [37] DelayedArray_0.32.0       xml2_1.3.6               
#>  [39] tidyselect_1.2.1          UCSC.utils_1.2.0         
#>  [41] ScaledMatrix_1.14.0       beanplot_1.3.1           
#>  [43] illuminaio_0.48.0         GenomicAlignments_1.42.0 
#>  [45] jsonlite_1.8.9            multtest_2.62.0          
#>  [47] survival_3.7-0            tools_4.4.1              
#>  [49] Rcpp_1.0.13               glue_1.8.0               
#>  [51] SparseArray_1.6.0         xfun_0.48                
#>  [53] HDF5Array_1.34.0          withr_3.0.2              
#>  [55] BiocManager_1.30.25       fastmap_1.2.0            
#>  [57] rhdf5filters_1.18.0       fansi_1.0.6              
#>  [59] openssl_2.2.2             digest_0.6.37            
#>  [61] rsvd_1.0.5                R6_2.5.1                 
#>  [63] colorspace_2.1-1          RSQLite_2.3.7            
#>  [65] utf8_1.2.4                generics_0.1.3           
#>  [67] data.table_1.16.2         rtracklayer_1.66.0       
#>  [69] httr_1.4.7                S4Arrays_1.6.0           
#>  [71] pkgconfig_2.0.3           gtable_0.3.6             
#>  [73] blob_1.2.4                siggenes_1.80.0          
#>  [75] htmltools_0.5.8.1         bookdown_0.41            
#>  [77] scales_1.3.0              png_0.1-8                
#>  [79] knitr_1.48                tzdb_0.4.0               
#>  [81] reshape2_1.4.4            rjson_0.2.23             
#>  [83] nlme_3.1-166              curl_5.2.3               
#>  [85] cachem_1.1.0              rhdf5_2.50.0             
#>  [87] stringr_1.5.1             restfulr_0.0.15          
#>  [89] GEOquery_2.74.0           pillar_1.9.0             
#>  [91] grid_4.4.1                reshape_0.8.9            
#>  [93] vctrs_0.6.5               BiocSingular_1.22.0      
#>  [95] beachmat_2.22.0           xtable_1.8-4             
#>  [97] evaluate_1.0.1            readr_2.1.5              
#>  [99] cli_3.6.3                 compiler_4.4.1           
#> [101] Rsamtools_2.22.0          rlang_1.1.4              
#> [103] crayon_1.5.3              rngtools_1.5.2           
#> [105] nor1mix_1.3-3             mclust_6.1.1             
#> [107] plyr_1.8.9                stringi_1.8.4            
#> [109] BiocParallel_1.40.0       munsell_0.5.1            
#> [111] PCAtools_2.18.0           Matrix_1.7-1             
#> [113] hms_1.1.3                 sparseMatrixStats_1.18.0 
#> [115] bit64_4.5.2               ggplot2_3.5.1            
#> [117] Rhdf5lib_1.28.0           KEGGREST_1.46.0          
#> [119] statmod_1.5.0             memoise_2.0.1            
#> [121] bslib_0.8.0               bit_4.5.0