% \VignetteIndexEntry{BEclear tutorial}
% \VignetteKeywords{DNA Methylation, Batch Effects, Matrix Completion}
% \VignettePackage{BEclear}
\documentclass[12pt]{article}
<<echo=FALSE>>=
options(width=65)
@

\SweaveOpts{eps=FALSE,echo=TRUE}

\begin{document} 
\SweaveOpts{concordance=TRUE}
\title{BEclear - Correct for batch effects in DNA methylation data}
\author{Markus Merl*\\
*email: \texttt{beclear.package@gmail.com}}
\date{Modified: February 18, 2015  Compiled: \today}
\maketitle
This example guides to the BEclear package to correct for batch effects in DNA
methylation data. The package provides some functions to detect and correct
such batch effects. The core function BEclear is based on Latent Factor Models
[1] and can also be used to predict missing values in any other matrix
containing real numbers.

\tableofcontents

\section{Introduction}
The individual chapters guide through the available methods of the BEclear
package in a logical order following an example of correcting some batch
affected DNA methylation data. This article should only give a small tutorial,
more details about the individual methods can always be found in the help
sections of the BEclear package, e.g. through typing "?calcMedians" on the R
console.
To work with the methods contained in the BEclear package, a matrix or
data.frame with genes as rownames and samples as column names as well as a
samples data.frame with the first column named "sample\_id" and the second
column named "batch\_id" is needed as input.
To run an example workflow, we first have to load the example data into our
workspace. A matrix with 250 genes as rows and 40 samples as columns containing
beta values, as well as a data.frame containing the samples and corresponding
batch names emerges. To get an intuition of how the inuput data for the methods
should look like, we print out the first 10 rows and five columns of the matrix
on the screen and also the first 10 rows of the samples data.frame.
<<echo=TRUE>>=
library(BEclear)
data(BEclearData)

ex.data[1:10,1:5]

ex.samples[1:10,]
@
The beta values stored in the ex.data matrix are obtained from level 3 BRCA
data from the TCGA portal [2]. Generally, beta values are claculated by
dividing the methylated signal by the sum of the unmethylated and methylated
signals from a DNA methylation microrarray. In the level 3 TCGA data, this
calculation has already been done. The sample data used here contains averaged
beta values of probes that belong to promoter regions of single genes. Another
possibility would be to use beta values of single probes, whereby the probe
names should then be used instead of the gene names as rownames of the matrix.

\section{calcPvalues}
Next, we want to know if our data is batch affected or not. We therefore
calculate false discovery rate adjusted p-values by Kolmogorov-Smirnov (ks)
test for every gene in every batch. We could also use any other adjustment
method contained in the p.adjust function of the R stats package. If possible,
we can run the method in parallel mode which is also the default of this
paramter. If the method should be run on a machine with just one core, the
parallel parameter should simply be set to FALSE. The resulting p-values tell
us if a gene from a certain batch contains a batch effect or not. Thereby,
every gene with a p-value below 0.01 is assumed as batch affected.
<<echo=TRUE>>=
pvals <- calcPvalues(ex.data, ex.samples, parallel=TRUE,
        cores=4, adjusted=TRUE, method="fdr")
@
Returned is a data.frame containing p-values for all 10 batches and for every
gene. We print out the p-values for 10 genes and 4 batches:
<<echo=TRUE>>=
pvals[210:220,5:8]
@
We can see that most of the p-values are beyond 0.01, but some of the
genes have p-values below our threshold value, e.g. the "SPINK2", "SNX21" or
"SMOX" genes in batch b136, which all have a p-values < 0.0001 from which we
conclude that the beta values of these genes corresponding to batch b136 does
not follow the typical distribution for the beta values in the other batches
which suggests a batch effect.

\section{calcMedians}
To see to which extend the found genes from the ks-test are affected by the
batch effect, we calculate the median difference (mdif) values for every gene
and every batch in a similar way. We consider all genes with mdif values beyond
or equal to 0.05 as batch affected since values beyond this threshold would
already make a biological difference in the methylation level. Since beta
values are bounded between 0 and 1, this threshold indicates more than 5\% of
the overall deviation of this [0;1] interval. Returned is again a data.frame
containing mdif values for every gene in every batch. We print out again the
mdif values for the same genes and batches as before.
<<echo=TRUE>>=
mdifs <- calcMedians(ex.data, ex.samples, parallel=TRUE,
        cores=4)

mdifs[210:220,5:8]
@
We can see that most of the mdif values are smaller than 0.05, but some of the
genes have mdif values beyond this threshold value, especially the "SPINK2"
gene in batch b136, which has a mdif value of ca. 0.43 which, together with the
small pvalue, strongly indicates a batch effect for this gene.

\section{calcSummary}
Now, we summarize the results of the p-value and mdif calculations.
<<echo=TRUE>>=
summary <- calcSummary(medians=mdifs, pvalues=pvals)
@
This method simply lists all genes that have p-values smaller than or equal to
0.01 and mdif values greater than 0.05, together with the corresponding batch
number in a data.frame. Now we have a list of all genes supposed to be batch
affected. Let us print the first 10 rows of this summary:
<<echo=TRUE>>=
summary[1:10,]
@
We can see that all of the affected genes are from batch b136. Although we
would define this batch as batch affected, some genes could also be found in
other batches which are just slightly affected. We therefore would not talk
about a batch effect, but the beta values of these found genes can also be
corrected. When looking for such genes, we can find 1 in batch b117 and 2 in
batch b61.
<<echo=TRUE>>=
summary[summary$batch != "b136",]
@
Overall, we found 54 genes with adjusted p-values below 0.01 and mdif values
beyond or equal to 0.05. 51 of these genes were found in batch b136.
Alltogether, this strongly indicates a batch effect.

\section{calcScore}
Outgoing from the summary, we can calculate a score that tells us for every
batch if it contains a batch effect or not:
<<echo=TRUE>>=
score <- calcScore(ex.data, ex.samples, summary, dir=getwd())

score
@
The method lists the number of affected genes in terms of mdif and p-values for
every gene and delivers us a so called BEscore that tells us if we should
correct the data for batch effects or not. The BEscore for batch b136 is
0.752 and tells us that we should correct the data. Additionally, the scoring
table is stored as .RData file in the specified directory. As default value,
the current working direcotry is used. Details about the scoring system can be
found in the documentation of this method.

\section{makeBoxplot}
We can also visualize the result of the calculated score through a color-coded
boxplot
\begin{center}
<<fig=TRUE ,echo=TRUE >>=
makeBoxplot(ex.data, ex.samples, score, bySamples=TRUE,
        col="standard", main="Example data", xlab="Batch",
        ylab="Beta value", scoreCol=TRUE)
@
\end{center}
The boxplot shows the example data separated by samples. The batch numbers are
shown on the x-axis in a color that denotes if the batch is affected (red) or
not (green) or if there is a slight batch affect assumed (yellow). We can
easily recognize that the boxes of batch b136 behave different compared to the
other batches.

\section{clearBEgenes}
Now we decided to correct the data for the found batch effect. Therefore we
have to set all previously found affected beta values to NA:
<<echo=TRUE>>=
cleared.data <- clearBEgenes(ex.data, ex.samples, summary)
@
Returned is the input matrix with NA values as defined by the summary.

\section{countValuesToPredict}
If we already have a matrix with missing entries (not necessarily beta values),
we can use this method to count the NA values within the matix:
<<echo=TRUE>>=
counted <- countValuesToPredict(cleared.data)
@

\section{BEclear}
This method performs the batch effect correction by predicting all the missing
entries we formerly set to NA in our input matrix and returns a matrix that
contains all beta values from the original input matrix, together with
predicted beta values for the entries formerly set to NA. The correction is
done by performing matrix completion using Latent Factor Models based on
matrix factorization [1,3]:
<<echo=TRUE, eval=FALSE>>=
corrected.data <- BEclear(cleared.data, parallel=TRUE,
        cores=4, rowBlockSize=60, colBlockSize=60,
        epochs=50, outputFormat="RData", dir=getwd())
@

The result of the prediction can easily be seen when we again use the
makeBoxplot method:
<<fig=TRUE ,echo=TRUE, eval=FALSE >>=
makeBoxplot(corrected.data, ex.samples, score, bySamples=TRUE,
        col="standard", main="Corrected example data",
        xlab="Batch", ylab="Beta value", scoreCol=FALSE)
@
For more details about the prediction and the further parameters please read
the documentation of this method. Note that the corrected data is also stored
as .Rdata file in the specified directory.

\section{findWrongValues \& replaceWrongValues}
Sometimes during the prediction, it can happen that values beyond the
boundaries of beta values are returned, that means values smaller than zero or
greater than one. findWrongValues simply returns a list of these values,
together with the position in the output matrix, replaceWrongValues corrects
these by simply setting the wrong values to zero or one, respectively. Since
we cannot guarantee that we get wrong values during the prediction with the
example data, we forego these methods for now. Note that these methods are
especially designed for the prediction of DNA methylation data.

\section{correctBatchEffect}
This method performs most of the previously introduced methods step by step in
a logical order to simply correct the input data on the basis of the calculated
score or not.
<<echo=TRUE, eval=FALSE>>=
result <- correctBatchEffect(ex.data, ex.samples,
        parallel=TRUE, cores=4, adjusted=TRUE,
        method="fdr", rowBlockSize=60, colBlockSize=60,
        epochs=50, outputFormat="RData", dir=getwd())
@
Returned is a list containing the complete output of the previously introduced
methods, e.g. result\$medians returns the mdif values containing data.frame.
For details see again the documentation of this method.

\section{Conclusions}
In this tutorial, we have followed the whole procedure of detecting and
correcting a batch effect in DNA methylation data by using the methods from the
BEclear package. This document is intended to give a first overview about the
functionality of the package. More details and references can be found in the
corresponding documentation.

\section{References}
\bibliographystyle{plainnat}
[1] E. Candes, B. Recht, Exact matrix completion via convex optimization,
Communications of the ACM, 55(6), p. 111-119, 2012. \newline
[2] http://cancergenome.nih.gov/ \newline
[3] Y. Koren, R. Bell, C. Volinsky, Matrix factorization techniques for
recommender systems, IEEE Computer, 42(8), p. 30-37, 2009.
\end{document}