% \VignetteIndexEntry{MethylAid: Visual and Interactive quality control of Illumina Human DNA Methylation array data} % \VignettePackage{MethylAid} % \VignetteEngine{knitr::knitr} \documentclass[12pt]{article} \usepackage[english]{babel} %%remove tildes e.q. in bibliography <>= BiocStyle::latex() @ \author{Maarten van Iterson, Elmar Tobi, Roderick Slieker, Wouter den Hollander, Rene Luijk, \\Eline Slagboom and Bas Heijmans \\Department of Molecular Epidemiology, \\Leiden University Medical Center, Leiden, The Netherlands} \title{MethylAid: Visual and Interactive quality control of Illumina Human DNA Methylation array data} \begin{document} \maketitle \tableofcontents \section{Introduction} \Biocpkg{MethylAid} is specially designed for quality control of large sets of Illumina Human DNA methylation array data sets e.g., epigenomewide association studies (EWAS) using the 450k or 850k (EPIC) arrays. Extracting intensities from IDAT files can be done in batches and/or in parallel to reduce memory load and/or overcome long run-times. It requires two function calls in going from IDAT files to launch the interactive web application; \Rfunction{summarize} and \Rfunction{visualize}. For more information see van Iterson \emph{et al.} \cite{iterson2014}. To show the utility of \Biocpkg{MethylAid}, we first show a quick example using a small set of idat files taken from the \Biocexptpkg{minfiData} package. A second example uses presummarized data on 500 samples which can be used directly to launch the web application. A third example shows how level 1 data downloaded from The Cancer Genome Atlas can be used. Similar, in the fourth example we show how data downloaded from GEO\cite{Edgar2002} can be used. For example, Liu \emph{et al.} \cite{Liu2013} studied genome-wide DNA methylation levels to determine whether Rheumatoid arthritis patients has methylation differences comparing to normal controls in peripheral blood leukocytes (PBLs) and deposit raw data on GEO. Furthermore, we show how the summarization can be performed in batches or in parallel using the \Biocpkg{BiocParallel} package which provides a uniform idiom for parallel computing resources. \Biocpkg{MethylAid} identifies poorly performing samples that are to be removed prior to processing and further analysis of the data. Other \R{}/\Bioconductor{}-packages such as, \Biocpkg{wateRmelon, minfi, methylumi, lumi, COHCAP, ChAMP}, are available for processing, i.e. background, dye-bias and probe correction or for further analysis such as the detection of differentially methylated regions. Many of these packages include quality control as well. The package \href{https://github.com/Jfortin1/shinyMethyl}{shinyMethyl} allows quality control assessment and interactive exploration of 450k array data in a similar way as \Biocpkg{MethylAid}. We have a demo running at \href{http://shiny.bioexp.nl/MethylAid}{http://shiny.bioexp.nl/MethylAid} using the example-data of the package. \section{Quick start} This example shows how to summarize a small set of idat files e.g. the summarized data fits into the available RAM. The \Biocexptpkg{minfiData}-package provides such a set. The package contains 450k DNA methylation data on 6 samples across 2 groups. Since, \Biocpkg{MethylAid} uses internally the function \Rfunction{read.metharray.exp} from the \Biocpkg{minfi}-package\cite{Aryee2014}, target information should be provided in a similar way as is done when using the \Biocpkg{minfi}-package. <>= library(MethylAid) library(minfiData) baseDir <- system.file("extdata", package = "minfiData") targets <- read.metharray.sheet(baseDir) @ The function \Rfunction{summarize} performs the summarization of the the control probes present on the array. The output produced by \Rfunction{summarize} is an object of class \Robject{summarizedData} which should be passed on to the function \Rfunction{visualize} which in turn will launch the interactive web application. <>= data <- summarize(targets) visualize(data) @ \bioccomment{\Rfunction{visualize} can only be called interactively from within a \R{} session.} \section{Example using presummarized 450k data} For those who directly want to explore the web application we have presummarized 450k data on 500 samples. <>= library(MethylAid) data(exampleData) visualize(exampleData) @ \bioccomment{\Rfunction{visualize} can only be called interactively from within a \R{} session.} When your webbrowser opens you should see something like Figure \ref{screenshot_MethylAid}. In the `About` panel a description is given of what you see and can do. \begin{figure} \centering \includegraphics[width=\textwidth]{screenshot_MethylAid} \caption{Screen shot \Rpackage{MethylAid} interactive web application: The web interface contains three parts; a panel with widgets to control the appearance of the quality control plots, tab-panels for chosing the quality control plot of interest and the interactive plotting area. When the interactive web application is launched the rotated M versus U plot is shown. Here using example data from the package on 500 450k human methylation samples with selected ``scan\_batches'' using the input selector widget ``highlight metadata column''. The different ``scan\_batches'' are indicated with different colors. The dashed-vertical line indicated the filter threshold, samples below the threshold should be removed.} \label{screenshot_MethylAid} \end{figure} \section{Example using 450k data downloaded from The Cancer Genome Atlas} The Cancer Genome Atlas (\url{http://cancergenome.nih.gov/}) provides a valuable source of genomic data of various types of different cancers. Here all Breast invasive carcinoma (BRCA) 450k level 1 DNA methylation data were download from the TCGA Data Portal. A targets file can be constructed from the sdrf file but some preprocessing is necessary. The following code chunk show how to make a minimal targets file from a sdrf file. There is also a check to see if all files are present otherwise these will be removed from the targets file. \fixme{In the future this might be added as an internal check.} <>= sdrfFile <- list.files(pattern="sdrf", full.name=TRUE) targets <- read.table(sdrfFile, header=TRUE, sep="\t") path <- "path_to_idat_files" targets <- targets[file.exists(file.path(path, targets$Array.Data.File)),] targets <- targets[grepl("Red", targets$Array.Data.File),] targets$Basename <- gsub("_Red.*$", "", file.path(path, targets$Array.Data.File)) rownames(targets) <- basename(targets$Basename) head(targets) @ Now the $2\times 137$ idat files can be summarized. This could be too much to read in at once therefore the option \Rcode{batchSize = 15} is used for summarization of the data in batches of size 15. Furthermore, the summarized data is stored as an \Rcode{RData}-object for later use, using the option \Rcode{file = "tcgaBRCA"}. After summarization the data can be loaded into \R{} and passed to \Rfunction{visualize} for interactive exploration of the data. <>= summarize(targets, batchSize = 15, file = "tcgaBRCA") load("tcgaBRCA.RData") visualize(tcgaBRCA) @ \bioccomment{\Rfunction{visualize} can only be called interactive from within a \R{} session.} \section{Example using 450k data downloaded from GEO} Several 450k data sets are available from GEO some include raw idat files e.g. the study of Liu \emph{et al.} \cite{Liu2013} with GEO series number: GSE42861. The idat files of this study are available under GSE42861\_RAW.tar. Target information was extracted using the \Biocpkg{GEOquery} package. \bioccomment{To run the following code chunk a considerable amount of RAM should be available.} <>= library(GEOquery) gse <- getGEO("GSE42861") targets <- pData(phenoData(gse[[1]])) path <- "path_to_idat_files" targets$Basename <- file.path(path, gsub("_Grn.*$", "", basename(targets$supplementary_file))) rownames(targets) <- basename(targets$Basename) @ Again we store the summarized data for later use and summarize the data in batches of size 15. <>= summarize(targets, batchSize = 15, file="RA") load("RA.RData") visualize(RA) @ \bioccomment{\Rfunction{visualize} can only be called interactive from within a \R{} session.} \section{Parallel summarization} \Biocpkg{MethylAid} was specially designed for quality control of large set of DNA methylation data e.g. EWAS studies. Summarization can be performed in batches to overcome memory problems e.g. when too many idat files are read at once. Just provide the option \Rcode{batchSize} to the \Rfunction{summarize} function as we did in the previous example. Too overcome long run-times summarization can be performed in parallel as well using various computing resource facilitated by the \Biocpkg{BiocParallel} package. The \Rfunction{summarize} accepts a \Rcode{bpparam} argument e.g. \Robject{MulticoreParam} (\bioccomment{unfortunately this is not available on Windows}). For example, perform summarization on a multiple core machine with 8 cores requested is easy as this: <>= library(BiocParallel) tcga <- summarize(targets, batchSize = 15, BPPARAM = MulticoreParam(workers = 8)) @ The \Biocpkg{BiocParallel} also allows parallelization on a cluster of computers using different job schedulers. For example, using the appropriate configuration file a \Robject{BatchJobsParam}-object can be constructed and passed on to the \Rfunction{summarize}-function. <>= library(BiocParallel) conffile <- system.file("scripts/config.R", package="MethylAid") BPPARAM <- BatchJobsParam(workers = 10, progressbar = FALSE, conffile = conffile) summarize(targets, batchSize = 50, BPPARAM = BPPARAM) @ The script folder of the package contains example files for parallel summarization on a cluster computer using the Sun Grid Engine job scheduler. For more information on how to setup a configuration file for other job schedulers see the description of \Rpackage{BatchJobs} \url{https://github.com/tudo-r/BatchJobs} (\Biocpkg{BiocParallel} relies on the cran \R{} package \Rpackage{BatchJobs}\cite{Bischl2011}). Probaly, you should at least set your email address in \Rcode{scripts/summarize.sh} and use the proper \R{} executable e.g change this in \Rcode{scripts/summarize.sh} and \Rcode{scripts/sge.tmpl}. \section{Customize \Biocpkg{MethylAid}} \subsection{User-defined thresholds} \Biocpkg{MethylAid} uses pre-defined thresholds in the filter control plots to determine outlying samples. These thresholds are based on experience with several 450k data sets that were analysed in the Department of Molecular Epidemiology (Leiden University Medical Center). The values of these thresholds are further supported by a large reference set available from the \Biocpkg{MethylAidData}-package. However, if one wished to use customised thresholds, e.g. for hydroxymethylation data, these can be given as arguments the the \Rfunction{visualize}-function. <>= visualize(exampleData, thresholds = list(MU = 10.5, OP = 11.75, BS = 12.75, HC = 13.25, DP = 0.95)) @ \subsection{Generating your own reference dataset} The \Biocpkg{MethylAidData}-package provides a reference dataset on 2800 samples. A reference data set can be used to compare with another data set. For example, call \Rfunction{visualize} with the argument \Rcode{background} as shown below. <>= library(MethylAid) data(exampleData) ##500 samples library(MethylAidData) data(exampleDataLarge) ##2800 samples outliers <- visualize(exampleData, background=exampleDataLarge) head(outliers) @ Every dataset that has been summarized by \Biocpkg{MethylAid} can be used as a reference data set. Furthermore, different summarized data sets can be combined to one bigger reference data set using the \Rfunction{combine}. <>= library(MethylAid) data(exampleData) exampleData combine(exampleData, exampleData) @ \section{Session info} Here is the output of \Rfunction{sessionInfo} on the system on which this document was compiled: <>= toLatex(sessionInfo()) @ \bibliography{MethylAid} \end{document}