--- title: "recountmethylation User's Guide" author: - Sean K. Maden - Reid F. Thompson - Kasper D. Hansen - Abhinav Nellore date: "`r format(Sys.time(), '%d %B, %Y')`" bibliography: bibliography.bib csl: cell-numeric.csl package: recountmethylation output: BiocStyle::html_document: code_folding: show toc: yes tocfloat: yes BiocStyle::pdf_document: toc: yes toc_depth: 2 vignette: > %\VignetteIndexEntry{recountmethylation User's Guide} %\VignetteDepends{RCurl} %\usepackage[UTF-8]{inputenc} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- ```{r setup, echo = FALSE, warning = FALSE} suppressMessages(library(knitr)) suppressMessages(library(GenomicRanges)) suppressMessages(library(limma)) suppressMessages(library(minfi)) suppressMessages(library(ExperimentHub)) suppressMessages(library(recountmethylation)) knitr::opts_chunk$set(echo = TRUE, eval = TRUE, warning = FALSE, message = FALSE) ``` # Introduction and overview `recountmethylation` is an R/Bioconductor package providing resources to access and analyze compilations of public DNA methylation (DNAm) array data from the Gene Expression Omnibus (GEO). The database compilation files span two array platforms and include mined, mapped, and model-based sample metadata. The DNAm signals can be accessed in a variety of formats and data storage types. This User's Guide shows how to use the `recountmethylation` package, including crucial background about the platforms and datatypes, and runnable examples using 2 small example files. Additional info and more advanced analysis examples are contained in other package vignettes. ## Compilation releases The `recountmethylation` resource now includes three compilation versions, detailed in the table below. The initial versions only included samples run using the HM450K platform, while newer versions also included samples run using the EPIC platform. These compilations currently include 93,306 samples run on the HM450K platform, 38,122 samples run on the EPIC platform, and 131,428 total samples. ```{r} dft <- data.frame(release = c("first", "second", "third", "total"), version.label = c("0.0.1", "0.0.2", "0.0.3", "all"), date = c("11/20/2020", "01/06/2021", "12/21/2022", "12/21/2022"), hm450k.samples = c(35360, 50400, 7546, sum(c(35360, 50400, 7546))), epic.samples = c(0, 12650, 25472, sum(c(0, 12650, 25472)))) dft$combined.samples <- dft$hm450k.samples + dft$epic.samples knitr::kable(dft, align = "c") ``` ## Database files and access Database compilation file download and access is managed by the `get_db` functions, where the DNAm array platform type using the `platform` argument (see `?get_db` for details). Both HM450K and EPIC/HM850K platforms are currently supported (see below for platform details). Note you will need between 50-180 Gb of disk space to store a single database file. Files pair sample metadata and assay data in various formats, including `HDF5-SummarizedExperiment` database directories, and `HDF5` database files with the `.h5` extension. The databases are located at [https://methylation.recount.bio/](https://methylation.recount.bio/), and file details are viewable as follows: ```{r, echo = TRUE, message = TRUE} sm <- as.data.frame(smfilt(get_servermatrix())) if(is(sm, "data.frame")){knitr::kable(sm, align = "c")} ``` ## ExperimentHub integration The DNAm array database files are indexed on `ExperimentHub`, and are viewable as follows. Note, the cache needs to be set with `R_user_dir()` per instructions [here](https://bioconductor.org/packages/devel/bioc/vignettes/ExperimentHub/inst/doc/ExperimentHub.html#default-caching-location-update). ```{r, eval = F} cache.path <- tools::R_user_dir("recountmethylation") setExperimentHubOption("CACHE", cache.path) hub <- ExperimentHub::ExperimentHub() # connect to the hubs rmdat <- AnnotationHub::query(hub, "recountmethylation") # query the hubs ``` In addition to using the `getdb` functions, the `HDF5` (".h5"" extension) files may be downloaded from the hubs. ```{r, eval = F} fpath <- rmdat[["EH3778"]] # download with default caching rhdf5::h5ls(fpath) # load the h5 file ``` Note that whether downloads use the hubs or `getdb` functions, caching is implemented to check for previously downloaded database files. ## Disclaimer Please note the following disclaimer, which also shows when `recountmethylation` is loaded: ``` Databases accessed with `recountmethylation` contain data from GEO (ncbi.nlm.nih.gov/geo/), a live public database where alterations to online records can cause discrepancies with stored data over time. We cannot guarantee the accuracy of stored data, and advise users cross-check their findings with latest available records. ``` # Background This section includes essential background about DNAm array platforms, assays and file types, and sample metadata. ## DNAm arrays Databases include human samples run on the Illumina Infinium HM450K BeadArray platform. HM450K is a popular 2-channel platform that probes over 480,000 CpG loci genome-wide, with enriched coverage at CG islands, genes, and enhancers @sandoval_validation_2011. The more recently released EPIC/HM850K platform contains an expanded probe set targeting over 850,000 CpGs, including more than 90% of the HM450K probes, with greater coverage of potential intergenic regulatory regions @pidsley_critical_2016. Array processing generates 2 intensity files (IDATs) per sample, one each for the red and green color channels. These raw files also contain control signals useful for quality evaluations @noauthor_illumina_2010. The BeadArray probes use either of 2 bead technologies, known as Type I and Type II, where the majority (72%) of probes use the latter. For Type II probes, a single bead assay informs a single probe, while Type I probes use 2 beads each. Practically, this means the bead-specific matrices found in `RGChannelSet` objects are larger than the probe-specific matrices found in derived object types (e.g. for HM450K samples, 622,399 assays for red/green signal matrices versus 485,512 assays for methylated/unmethylated signal, DNAm fractions matrices, see below). ## `SummarizedExperiment` object classes DNAm array sample IDATs can be read into an R session as an object of class `RGChannelSet`, a type of `SummarizedExperiment`. These objects support analyses of high-throughput genomics datasets, and they include slots for assay matrices, sample metadata, and experiment metadata. During a typical workflow, normalization and preprocessing convert `RGChannelSet` objects into new types like `MethylSet` and `RatioSet`. While not all IDAT information is accessible from every object type (e.g. only `RGChannelSet`s can contain control assays), derived objects like `MethylSet`s and `RatioSet`s may be smaller and/or faster to access. Three `SummarizedExperiment` databases are provided as `HDF5-SummarizedExperiment` files, including an unnormalized `RGChannelSet` (red/green signals), an unnormalized `MethylSet` (methylated/unmethylated signals) and a normalized `GenomicRatioSet` (DNAm fractions). For the latter, DNAm fractions (logit2 Beta-values, or M-values) were normalized using the out-of-band signal or "noob" method, an effective within-sample normalization that removes signal artifacts @triche_low-level_2013. ## Database file types Database files are stored as either `HDF5` or `HDF5-SummarizedExperiment`. For most R users, the latter files will be most convenient to work with. `HDF5`, or hierarchical data format 5, combines compression and chunking for convenient handling of large datasets. `HDF5-SummarizedExperiment` files combine the benefits of `HDF5` and `SummarizedExperiment` entities using a DelayedArray-powered backend. Once an `HDF5-SummarizedExperiment` file is loaded, it can be treated similarly to a `SummarizedExperiment` object in active memory. That is, summary and subset operations execute rapidly, and realization of large data chunks in active memory is delayed until called for by the script (see examples). ## Sample metadata Sample metadata are included with DNAm assays in the database files. Currently, metadata variables include GEO record IDs for samples (GSM) and studies (GSE), sample record titles, learned labels for tissue and disease, sample type predictions from the MetaSRA-pipeline, and DNAm model-based predictions for age, sex, and blood cell types. Access sample metadata from `SummarizedExperiment` objects using the `pData` minfi function (see examples). Examples in the `data_analyses` vignette illustrate some ways to utilize the provided sample metadata. Provided metadata derives from the GSE-specific SOFT files, which contain experiment, sample, and platform metadata. Considerable efforts were made to learn, harmonize, and predict metadata labels. Certain types of info lacking in the `recountmethylation` metadata may be available in the SOFT files, especially if it is sample non-specific (e.g. methods text, PubMed ID, etc.) or redundant with DNAm-derived metrics (e.g. DNAm summaries, predicted sex, etc.). It is good practice to validate the harmonized metadata with original metadata records, especially where labels are ambiguous or there is insufficient information for a given query. GEO GSM and GSE records can be viewed from a browser, or SOFT files may be downloaded directly. Packages like GEOmetadb and GEOquery are also useful to query and summarize GEO metadata. # `HDF5-SummarizedExperiment` example This example shows basic handling for `HDF5-SummarizedExperiment` (a.k.a. "h5se") files. For these files, the `getdb` function returns the loaded file. Thanks to a `DelayedArray` backend, even full-sized `h5se` databases can be treated as if they were fully loaded into active memory. ## Obtain the test database The test `h5se` dataset includes sample metadata and noob-normalized DNAm fractions (Beta-values) for chromosome 22 probes for 2 samples. Datasets can be downloaded using the `getdb` series of functions (see `?getdb` for details), where the `dfp` argument specifies the download destination. The test `h5se` file is included in the package "inst" directory, and can be loaded as follows. ```{r} dn <- "remethdb-h5se_gr-test_0-0-1_1590090412" path <- system.file("extdata", dn, package = "recountmethylation") h5se.test <- HDF5Array::loadHDF5SummarizedExperiment(path) ``` ## Inspect and summarize the database Common characterization functions can be used on the dataset after it has been loaded. These include functions for `SummarizedExperiment`-like objects, such as the `getBeta`, `pData`, and `getAnnotation` minfi functions. First, inspect the dataset using standard functions like `class`, `dim`, and `summary` as follows. ```{r} class(h5se.test) # inspect object class ``` ```{r} dim(h5se.test) # get object dimensions ``` ```{r} summary(h5se.test) # summarize dataset components ``` Access the sample metadata for the 2 available samples using `pData`. ```{r} h5se.md <- minfi::pData(h5se.test) # get sample metadata dim(h5se.md) # get metadata dimensions ``` ```{r} colnames(h5se.md) # get metadata column names ``` Next get CpG probe-specific DNAm fractions, or "Beta-values", with `getBeta` (rows are probes, columns are samples). ```{r} h5se.bm <- minfi::getBeta(h5se.test) # get dnam fractions dim(h5se.bm) # get dnam fraction dimensions ``` ```{r} colnames(h5se.bm) <- h5se.test$gsm # assign sample ids to dnam fractions knitr::kable(head(h5se.bm), align = "c") # show table of dnam fractions ``` Access manifest information for probes with `getAnnotation`. This includes the bead addresses, probe type, and genome coordinates and regions. For full details about the probe annotations, consult the minfi and Illumina platform documentation. ```{r} an <- minfi::getAnnotation(h5se.test) # get platform annotation dim(an) # get annotation dimensions ``` ```{r} colnames(an) # get annotation column names ``` ```{r} ant <- as.matrix(t(an[c(1:4), c(1:3, 5:6, 9, 19, 24, 26)])) # subset annotation knitr::kable(ant, align = "c") # show annotation table ``` # `HDF5` database and example To provide more workflow options, bead-specific red and green signal data have been provided with sample metadata in an `HDF5`/`h5` file. This example shows how to handle objects of this type with `recountmethylation`. ## Obtain the test database The test `h5` file includes metadata and bead-specific signals from chromosome 22 for the same 2 samples as in the `h5se` test file. Note `getdb` functions for `h5` files simply return the database path. Since the test `h5` file has also been included in the package "inst" folder, get the path to load the file as follows. ```{r} dn <- "remethdb-h5_rg-test_0-0-1_1590090412.h5" # get the h5se directory name h5.test <- system.file("extdata", "h5test", dn, package = "recountmethylation") # get the h5se dir path ``` ## Inspect and summarize the database Use the file path to read data into an `RGChannelSet` with the `getrg` function. Setting `all.gsm = TRUE` obtains data for all samples in the database files, while passing a vector of GSM IDs to `gsmv` argument will query a subset of available samples. Signals from all available probes are retrieved by default, and probe subsets can be obtained by passing a vector of valid bead addresses to the `cgv` argument. ```{r} h5.rg <- getrg(dbn = h5.test, all.gsm = TRUE) # get red/grn signals from an h5 db ``` To avoid exhausting active memory with the full-sized `h5` dataset, provide either `gsmv` or `cgv` to `getrg`, and set either `all.cg` or `all.gsm` to FALSE (see `?getrg` for details). As in the previous example, use `pData` and `getAnnotation` to get sample metadata and array manifest information, respectively. Access the green and red signal matrices in the `RGChannelSet` with the `getRed` and `getGreen` minfi functions. ```{r} h5.red <- minfi::getRed(h5.rg) # get red signal matrix h5.green <- minfi::getGreen(h5.rg) # get grn signal matrix dim(h5.red) # get dimensions of red signal matrix ``` ```{r} knitr::kable(head(h5.red), align = "c") # show first rows of red signal matrix ``` ```{r} knitr::kable(head(h5.green), align = "c") # show first rows of grn signal matrix ``` ```{r} identical(rownames(h5.red), rownames(h5.green)) # check cpg probe names identical ``` Rows in these signal matrices map to bead addresses rather than probe IDs. These matrices have more rows than the `h5se` test Beta-value matrix because any type I probes use data from 2 beads each. # Validate DNAm datasets This section demonstrates validation using the test databases. Full code to reproduce this section is provided but not evaluated, as it involves a download from the GEO servers. As the disclaimer notes, it is good practice to validate data against the latest available GEO files. This step may be most useful for newer samples published close to the end compilation date (through November 7, 2020 for current version), which may be more prone to revisions at initial publication. ## Download and read IDATs from the GEO database server Use the `gds_idat2rg` function to download IDATs for the 2 test samples and load these into a new `RGChannelSet` object. Do this by passing a vector of GSM IDs to `gsmv` and the download destination to `dfp`. (note, chunks in this section are fully executable, but not evaluated for this vignette). ```{r, eval = FALSE} # download from GEO dlpath <- tempdir() # get a temp dir path gsmv <- c("GSM1038308", "GSM1038309") # set sample ids to identify geo.rg <- gds_idat2rg(gsmv, dfp = dlpath) # load sample idats into rgset colnames(geo.rg) <- gsub("\\_.*", "", colnames(geo.rg)) # assign sample ids to columns ``` ## Compare DNAm signals Extract the red and green signal matrices from `geo.rg`. ```{r, eval = FALSE} geo.red <- minfi::getRed(geo.rg) # get red signal matrix geo.green <- minfi::getGreen(geo.rg) # get grn signal matrix ``` Match indices and labels between the GEO and `h5` test signal matrices. ```{r, eval = FALSE} int.addr <- intersect(rownames(geo.red), rownames(h5.red)) # get probe address ids geo.red <- geo.red[int.addr,] # subset geo rgset red signal geo.green <- geo.green[int.addr,] # subset gro rgset grn signal geo.red <- geo.red[order(match(rownames(geo.red), rownames(h5.red))),] geo.green <- geo.green[order(match(rownames(geo.green), rownames(h5.green))),] identical(rownames(geo.red), rownames(h5.red)) # check identical addresses, red identical(rownames(geo.green), rownames(h5.green)) # check identical addresses, grn class(h5.red) <- "integer"; class(h5.green) <- "integer" # set matrix data classes to integer ``` Finally, compare the signal matrix data. ```{r, eval = FALSE} identical(geo.red, h5.red) # compare matrix signals, red ``` ```{r, eval = FALSE} identical(geo.green, h5.green) # compare matrix signals, grn ``` ## Compare DNAm Beta-values Before comparing the GEO-downloaded data to data from the `h5se.test` database, normalize the data using the same out-of-band or "noob" normalization technique that was used to generate data in the `h5se` database. ```{r, eval = FALSE} geo.gr <- minfi::preprocessNoob(geo.rg) # get normalized se data ``` Next, extract the Beta-values. ```{r, eval = FALSE} geo.bm <- as.matrix(minfi::getBeta(geo.gr)) # get normalized dnam fractions matrix ``` Now match row and column labels and indices. ```{r, eval = FALSE} h5se.bm <- as.matrix(h5se.bm) # set dnam fractions to matrix int.cg <- intersect(rownames(geo.bm), rownames(h5se.bm)) geo.bm <- geo.bm[int.cg,] # subset fractions on shared probe ids geo.bm <- geo.bm[order(match(rownames(geo.bm), rownames(h5se.bm))),] ``` Finally, compare the two datasets. ```{r, eval = FALSE} identical(summary(geo.bm), summary(h5se.bm)) # check identical summary values ``` ```{r, eval = FALSE} identical(rownames(geo.bm), rownames(h5se.bm)) # check identical probe ids ``` # Troubleshooting and tips This section describes how to address potential issues with accessing the database files or working with the `DelayedArray` based objects locally. ## Issue: large file downloads don't complete If repeated attempts to download the database compilation files fail, you may try the following: * First ensure your internet connection is stable and there is sufficient space at the download destination for the database file. * Second, try increasing your timeout duration beyond the default before repeating the download attempt with `getdb`. Check the current timeout for an R session with `getOptions('timeout')`, then manually increase the timeout duration with `options(timeout = new.time)`. * Finally, you may attempt to download a server file using command line calls to your system terminal or console. For instance, on a Mac you might try `wget -r `. If this doesn't work, you can again attempt to increase the timeout duration and repeat the download attempt. ## Issue: unexpected function behaviors for `DelayedArray` inputs Unexpected function behaviors may arise when using `DelayedArray`-based inputs. These essentially arise from lacking interoperativity between normal matrices and the `DelayedArray`-based matrices. Known examples include: * `minfi::detectionP()`: Throws error for specific subsets of data, such as for queries of exactly 50 samples. ```{r, eval = FALSE} detectionP(rg[,1:50]) # get detection pvalues from rgset "Error in .local(Red, Green, locusNames, controlIdx, TypeI.Red, TypeI.Green, dim(Red_grid) == dim(detP_sink_grid) are not all TRUE" ``` * `minfi::preprocessFunnorm()`: Throws error when called for an `RGChannelSet` of type `HDF5-SummarizedExperiment`. ```{r, eval = FALSE} preprocessFunnorm(rg) # get noob-normalized data "Error: 'preprocessFunnorm()' only supports matrix-backed minfi objects."" ``` These and other related errors may be addressed by instantiating the data query, or the data chunk, as a new non-`DelayedArray` object. For example, remake a subset of the full `h5se` dataset, `rg`, as follows. ```{r, eval = FALSE} rg.h5se <- loadHDF5SummarizedExperiment(rg.path) # full h5se RGChannelSet rg.sub <- rg.h5se[,c(1:20)] # subset samples of interest rg.new <- RGChannelSet(Red = getRed(rg.sub), Green = getGreen(rg.sub), annotation = annotation(rg.sub)) # re-make as non-DA object gr <- preprocessFunnorm(rg.new) # repeat preprocessing ``` Alternatively, non-`DelayedArray` `RGChannelSet` objects can be readily generated from the full `h5` `RGChannelSet` database with the provided function `getrg()`. # Get more help Consult the Data Analyses [vignette](link.url) and main [manuscript](link.url) for analysis examples and details about data compilations. # Session info ```{r get_sessioninfo} sessionInfo() ``` # Works Cited