% \VignetteIndexEntry{BEclear tutorial} % \VignetteKeywords{DNA Methylation, Batch Effects, Matrix Completion} % \VignettePackage{BEclear} \documentclass[12pt]{article} <>= options(width=65) @ \SweaveOpts{eps=FALSE,echo=TRUE} \begin{document} \SweaveOpts{concordance=TRUE} \title{BEclear - Correct for batch effects in DNA methylation data} \author{Markus Merl*\\ *email: \texttt{beclear.package@gmail.com}} \date{Modified: February 18, 2015 Compiled: \today} \maketitle This example guides to the BEclear package to correct for batch effects in DNA methylation data. The package provides some functions to detect and correct such batch effects. The core function BEclear is based on Latent Factor Models [1] and can also be used to predict missing values in any other matrix containing real numbers. \tableofcontents \section{Introduction} The individual chapters guide through the available methods of the BEclear package in a logical order following an example of correcting some batch affected DNA methylation data. This article should only give a small tutorial, more details about the individual methods can always be found in the help sections of the BEclear package, e.g. through typing "?calcMedians" on the R console. To work with the methods contained in the BEclear package, a matrix or data.frame with genes as rownames and samples as column names as well as a samples data.frame with the first column named "sample\_id" and the second column named "batch\_id" is needed as input. To run an example workflow, we first have to load the example data into our workspace. A matrix with 250 genes as rows and 40 samples as columns containing beta values, as well as a data.frame containing the samples and corresponding batch names emerges. To get an intuition of how the inuput data for the methods should look like, we print out the first 10 rows and five columns of the matrix on the screen and also the first 10 rows of the samples data.frame. <>= library(BEclear) data(BEclearData) ex.data[1:10,1:5] ex.samples[1:10,] @ The beta values stored in the ex.data matrix are obtained from level 3 BRCA data from the TCGA portal [2]. Generally, beta values are claculated by dividing the methylated signal by the sum of the unmethylated and methylated signals from a DNA methylation microrarray. In the level 3 TCGA data, this calculation has already been done. The sample data used here contains averaged beta values of probes that belong to promoter regions of single genes. Another possibility would be to use beta values of single probes, whereby the probe names should then be used instead of the gene names as rownames of the matrix. \section{calcPvalues} Next, we want to know if our data is batch affected or not. We therefore calculate false discovery rate adjusted p-values by Kolmogorov-Smirnov (ks) test for every gene in every batch. We could also use any other adjustment method contained in the p.adjust function of the R stats package. If possible, we can run the method in parallel mode which is also the default of this paramter. If the method should be run on a machine with just one core, the parallel parameter should simply be set to FALSE. The resulting p-values tell us if a gene from a certain batch contains a batch effect or not. Thereby, every gene with a p-value below 0.01 is assumed as batch affected. <>= pvals <- calcPvalues(ex.data, ex.samples, parallel=TRUE, cores=4, adjusted=TRUE, method="fdr") @ Returned is a data.frame containing p-values for all 10 batches and for every gene. We print out the p-values for 10 genes and 4 batches: <>= pvals[210:220,5:8] @ We can see that most of the p-values are beyond 0.01, but some of the genes have p-values below our threshold value, e.g. the "SPINK2", "SNX21" or "SMOX" genes in batch b136, which all have a p-values < 0.0001 from which we conclude that the beta values of these genes corresponding to batch b136 does not follow the typical distribution for the beta values in the other batches which suggests a batch effect. \section{calcMedians} To see to which extend the found genes from the ks-test are affected by the batch effect, we calculate the median difference (mdif) values for every gene and every batch in a similar way. We consider all genes with mdif values beyond or equal to 0.05 as batch affected since values beyond this threshold would already make a biological difference in the methylation level. Since beta values are bounded between 0 and 1, this threshold indicates more than 5\% of the overall deviation of this [0;1] interval. Returned is again a data.frame containing mdif values for every gene in every batch. We print out again the mdif values for the same genes and batches as before. <>= mdifs <- calcMedians(ex.data, ex.samples, parallel=TRUE, cores=4) mdifs[210:220,5:8] @ We can see that most of the mdif values are smaller than 0.05, but some of the genes have mdif values beyond this threshold value, especially the "SPINK2" gene in batch b136, which has a mdif value of ca. 0.43 which, together with the small pvalue, strongly indicates a batch effect for this gene. \section{calcSummary} Now, we summarize the results of the p-value and mdif calculations. <>= summary <- calcSummary(medians=mdifs, pvalues=pvals) @ This method simply lists all genes that have p-values smaller than or equal to 0.01 and mdif values greater than 0.05, together with the corresponding batch number in a data.frame. Now we have a list of all genes supposed to be batch affected. Let us print the first 10 rows of this summary: <>= summary[1:10,] @ We can see that all of the affected genes are from batch b136. Although we would define this batch as batch affected, some genes could also be found in other batches which are just slightly affected. We therefore would not talk about a batch effect, but the beta values of these found genes can also be corrected. When looking for such genes, we can find 1 in batch b117 and 2 in batch b61. <>= summary[summary$batch != "b136",] @ Overall, we found 54 genes with adjusted p-values below 0.01 and mdif values beyond or equal to 0.05. 51 of these genes were found in batch b136. Alltogether, this strongly indicates a batch effect. \section{calcScore} Outgoing from the summary, we can calculate a score that tells us for every batch if it contains a batch effect or not: <>= score <- calcScore(ex.data, ex.samples, summary, dir=getwd()) score @ The method lists the number of affected genes in terms of mdif and p-values for every gene and delivers us a so called BEscore that tells us if we should correct the data for batch effects or not. The BEscore for batch b136 is 0.752 and tells us that we should correct the data. Additionally, the scoring table is stored as .RData file in the specified directory. As default value, the current working direcotry is used. Details about the scoring system can be found in the documentation of this method. \section{makeBoxplot} We can also visualize the result of the calculated score through a color-coded boxplot \begin{center} <>= makeBoxplot(ex.data, ex.samples, score, bySamples=TRUE, col="standard", main="Example data", xlab="Batch", ylab="Beta value", scoreCol=TRUE) @ \end{center} The boxplot shows the example data separated by samples. The batch numbers are shown on the x-axis in a color that denotes if the batch is affected (red) or not (green) or if there is a slight batch affect assumed (yellow). We can easily recognize that the boxes of batch b136 behave different compared to the other batches. \section{clearBEgenes} Now we decided to correct the data for the found batch effect. Therefore we have to set all previously found affected beta values to NA: <>= cleared.data <- clearBEgenes(ex.data, ex.samples, summary) @ Returned is the input matrix with NA values as defined by the summary. \section{countValuesToPredict} If we already have a matrix with missing entries (not necessarily beta values), we can use this method to count the NA values within the matix: <>= counted <- countValuesToPredict(cleared.data) @ \section{BEclear} This method performs the batch effect correction by predicting all the missing entries we formerly set to NA in our input matrix and returns a matrix that contains all beta values from the original input matrix, together with predicted beta values for the entries formerly set to NA. The correction is done by performing matrix completion using Latent Factor Models based on matrix factorization [1,3]: <>= corrected.data <- BEclear(cleared.data, parallel=TRUE, cores=4, rowBlockSize=60, colBlockSize=60, epochs=50, outputFormat="RData", dir=getwd()) @ The result of the prediction can easily be seen when we again use the makeBoxplot method: <>= makeBoxplot(corrected.data, ex.samples, score, bySamples=TRUE, col="standard", main="Corrected example data", xlab="Batch", ylab="Beta value", scoreCol=FALSE) @ For more details about the prediction and the further parameters please read the documentation of this method. Note that the corrected data is also stored as .Rdata file in the specified directory. \section{findWrongValues \& replaceWrongValues} Sometimes during the prediction, it can happen that values beyond the boundaries of beta values are returned, that means values smaller than zero or greater than one. findWrongValues simply returns a list of these values, together with the position in the output matrix, replaceWrongValues corrects these by simply setting the wrong values to zero or one, respectively. Since we cannot guarantee that we get wrong values during the prediction with the example data, we forego these methods for now. Note that these methods are especially designed for the prediction of DNA methylation data. \section{correctBatchEffect} This method performs most of the previously introduced methods step by step in a logical order to simply correct the input data on the basis of the calculated score or not. <>= result <- correctBatchEffect(ex.data, ex.samples, parallel=TRUE, cores=4, adjusted=TRUE, method="fdr", rowBlockSize=60, colBlockSize=60, epochs=50, outputFormat="RData", dir=getwd()) @ Returned is a list containing the complete output of the previously introduced methods, e.g. result\$medians returns the mdif values containing data.frame. For details see again the documentation of this method. \section{Conclusions} In this tutorial, we have followed the whole procedure of detecting and correcting a batch effect in DNA methylation data by using the methods from the BEclear package. This document is intended to give a first overview about the functionality of the package. More details and references can be found in the corresponding documentation. \section{References} \bibliographystyle{plainnat} [1] E. Candes, B. Recht, Exact matrix completion via convex optimization, Communications of the ACM, 55(6), p. 111-119, 2012. \newline [2] http://cancergenome.nih.gov/ \newline [3] Y. Koren, R. Bell, C. Volinsky, Matrix factorization techniques for recommender systems, IEEE Computer, 42(8), p. 30-37, 2009. \end{document}