%\VignetteIndexEntry{Analysing single-cell BS-Seq data with the "BEAT" package}
%\VignettePackage{BEAT}

% To compile this document
% library('cacheSweave');rm(list=ls());Sweave('BEAT.Rnw',driver=cacheSweaveDriver());system("pdflatex BEAT")

%\begin{figure}[ht]
%\centering
%\includegraphics[width=.6\textwidth]{FILENAME}
%\caption{CAPTION}
%\label{FIGURELABEL}
%\end{figure}

\documentclass[12pt,oneside]{article}

\usepackage[T1]{fontenc}
\usepackage[latin9]{inputenc}
\usepackage{geometry}
\geometry{verbose,tmargin=2cm,bmargin=2cm,lmargin=2cm,rmargin=2cm}
\setlength{\parindent}{0bp}
\usepackage{color}
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage{esint}

\begin{document}
\newcommand{\thetitle}{Quantification of DNA Methylation and Epimutations from Bisulfite Sequencing Data -- the BEAT package}

\title{\textsf{\textbf{\thetitle}}}
\author{Kemal Akman$^1$, Achim Tresch\\[1em]Max-Planck-Institute for Plant Breeding Research\\ Cologne, Germany\\
\texttt{$^1$akman@mpipz.mpg.de}}

%%BEAT version 0.99.1
\SweaveOpts{keep.source=TRUE,eps=FALSE,pdf=TRUE,png=TRUE,include=FALSE,width=5.4,height=3.7,resolution=180}

\maketitle

\begin{abstract}
The following example illustrates a standard use case of the BEAT package, which is used for processing and modeling methylated and unmethylated counts of CG positions from Bisulfite sequencing (BS-Seq) for determination of epimutation rates, i.e. the rate of change in DNA methylation status at CG sites between a reference multi-cell sample and single-cell samples.

The input for the package BEAT consists of count data in the form of counts for unmethylated and methylated cytosines per genomic position, which are then grouped into genomic regions of sufficient coverage in order to allow for low-coverage samples to be analyzed. Methylation rates of each region are then modeled using a binomial mixture model in order to adjust for experimental bias, which arises mainly out of incomplete bisulfite conversion (resulting in unmethylated CG positions falsely appearing to be methylated) and sequencing errors (resulting in random errors in methylation status, including methylated CG positions falsely appearing as unmethylated).

This vignette explains the use of the package. For a detailed exposition of the statistical method, please see our upcoming paper.
\end{abstract}

\tableofcontents

\section{Preamble} \label{sec:preamble}

\subsection{Biological Background} \label{subsec:biology}
Bisulfite sequencing (BS-Seq) is a sequence-based method to accurately detect DNA methylation at specific loci, which involves treating DNA with sodium bisulfite~\cite{frommer_genomic_1992}. Bisulfite conversion of unmethylated cytosines into uracil is a relatively simple chemical reaction which has now become a standard in DNA methylation profiling. The key advantage of this method is accuracy, as the degree of methylation at each cytosine can be quantified with great precision~\cite{fraga_dna_2002}. 

Most currently available techniques for determination of DNA methylation can only measure average values obtained from cell populations as a whole, requiring at least 30 ng of DNA, i.e. the equivalent of about 6000 cells~\cite{gu_genome-scale_2010}. Since these population approaches cannot account for cell- and position-specific differences in DNA methylation, which are termed epimutations~\cite{jeggo_azacytidine-induced_1986}, they are unsuitable for the characterization of cellular heterogeneity. However, this heterogeneity plays an important role in differentiation and development, stem cell re-programming, in diseases such as cancer and aging~\cite{egger_epigenetics_2004}. Developing single-cell approaches for measuring DNA methylation is not only be vital to fully understand individual cell-specific changes and complexity of tissue micro-environments, but also for the analysis of clinical samples, such as circulating tumor cells or needle biopsies, when the amount of material is often limited. With the BEAT package, we present a pipeline for the computational analysis part of such single-cell BS-Seq experiments.

\subsection{Package Functionality} \label{subsec:package}
This vignette will illustrate the complete work flow of a DNA methylation analysis, which includes model-based estimation of regional methylation rates from observed BS-Seq methylation counts, calling of methylation status, and the comparison of samples to determine regional differences in methylation status, i.e., epimutation calling. These three steps are implemented by the three public functions of BEAT, namely \textit{positions\_to\_regions}, \textit{generate\_results} and \textit{epimutation\_calls}, respectively.
The functionality of BEAT will be demonstrated using sample data consisting of the first chromosome of mouse liver hepatocytes, one reference sample ('reference.positions.csv') and one single-cell sample to compare against ('sample.positions.csv')
The main challenges in DNA methylation analysis are bias correction, modeling of sampling variance (shot noise due to low count numbers) and assessing the significance of changes.
 
\section{Input format} \label{sec:input}
The package BEAT expects as input one csv per sample with counts for unmethylated and methylated cytosines per genomic position. Such data can be obtained from the output of BS-Seq mapping tools such as bismark (see http://www.bioinformatics.babraham.ac.uk/projects/bismark/) followed by simple script-based data processing. BS-Seq data for methylated- and unmethylated counts per genomic position needs to be formatted into a data.frame object with the columns: 'chr', 'pos', 'meth', 'unmeth' (signifying chromosome name, chromosomal position, as well as methylated- and unmethylated cytosine counts at that chromosomal position).

%
<<showInput>>=
library(BEAT)
localpath <- system.file('extdata', package = 'BEAT')
positions <- read.csv(file.path(localpath, "sample.positions.csv"))
head(positions)
@
%

\section{Configuration and parameters} \label{sec:conf}
BEAT can process multiple samples at a time. For each sample specified under \textit{'sampNames'}, the package BEAT reads this data frame from a csv from the working directory, which is specified by the parameter \textit{'localpath'}. The file should be called <samplename>.positions.csv and should contain the aforementioned csv under the name 'positions'. Result files written by BEAT will have the names <samplename>.results.RData. A distinction is made between two samples whose methylation status is compared against that of the reference.
%% NOTE: The following single-cell distinction is only relevant for conversion of counts to binary counts; this has been done for the samples provided. We need to distinguish between samples that have been generated from a single cell or from a mixture of cells, since these cases require different statistical modeling.
For each sample, the assignment of reference or non-reference status to samples is done via the vector \textit{'is.reference'}, which is indexed by the \textit{'sampNames'} vector and contains one TRUE entry for the reference and one to many FALSE entries for samples to be compared against the reference. Our example data set contains two files, one named 'reference' for a reference consisting of a mixture of cells, and one named 'sample' for a single-cell sample to be compared against the reference.

%
<<setFilepath>>=
sampNames <- c('reference','sample')
sampNames
is.reference <- c(TRUE, FALSE)
@
%

For the modeling part of BEAT, Bisulfite conversion rates have to be specified per sample using the vector \textit{'convrates'}, which is indexed by \textit{'sampNames'}. These conversion rates need to be specified manually by the user. Practically, for mammalian somatic cell samples, these rates can be estimated for each sample by looking at the non-CpG methylation rate per sample and using the inverse value as estimated CpG methylation rate, because non-CpG methylation in these types of cells is expected to be near zero. For example, if non-CpG methylation in a sequenced sample was measured to be 0.1 then BS-conversion rate for that sample would be set to 0.9 when near-zero non-CpG methylation can be expected, as is the case in somatic mammalian cells.

%
<<setSampleData>>=
pplus <- c(0.2, 0.5)
convrates <- 1 - pplus
@
%

The aforementioned values are then set in a parameter object, which is used throughout the further work flow in this package. It provides the following additional options:

\begin{itemize}
\item 1-\textit{convrates} represents the fraction of unmethylated counts that are falsely called as methylated due to incomplete BS-conversion. This parameter is also referred to as 'pplus' in the statistical model.
\item \textit{'pminus'} represents the fraction of methylated counts that are falsely called as unmethylated.
\item \textit{'regionSize'} is the size of regions into which genomic positions are grouped. In single cells, the number of positions with sufficient coverage for reliable statistical predictions may be very small. Therefore, our pipeline applies epimutation calling to regions instead of single positions. Single positions are pooled into regions of appropriate size, i.e., regions containing sufficiently many CpG positions that have a positive read count number in both the reference and the single cell sample (.shared. CpG positions). Our method has then sufficient power to reliably detect epimutation events affecting these regions. For a detailed description of pooling positions into regions, see section \ref{sec:regions}.
\item After pooling CG positions to regions, there may still be regions with low count numbers that do not allow for reliable downstream analysis. Regions with less than \textit{'minCounts'} counts will be removed from further processing, potentially saving significant processing time in further analysis steps.
\end{itemize}

The following parameters are of minor importance, for a first analysis, they can be left at their pre-set values.

\begin{itemize}
\item \textit{'verbose'} is an option that prints additional information during computation when set to TRUE.
\textit{'computeRegions'} is an option that will recompute the regions from given positional input if set to TRUE; otherwise, it will depend on existing region files already present in the \textit{'localpath'} directory (file names ending in regions.RData).
\item \textit{'computeMatrices'} is an option that will recompute the model data from given regions if set to TRUE; otherwise, it will depend on existing model output files already present in the \textit{'localpath'} directory (file names ending in convMat.RData and results.RData.
\item \textit{'writeEpicallMatrix'} is an option that can be set to TRUE to generate epimutation calling output in the form of matrices (one row per genomic position, output format is CSV).
\end{itemize}

%
<<setConfig>>=
params <- makeParams(localpath, sampNames, convrates,
is.reference, pminus = 0.2, regionSize = 10000, minCounts = 5)
params
@
%

\section{Pooling of CG counts into regions} \label{sec:regions}
The supplied methylation counts for individual CG positions are grouped into regions by BEAT for modeling according to the specified \textit{'regionSize'} and \textit{'minCount'} parameters. The function positions\_to\_regions takes as input the samples as csv objects located under <samplename>.positions.csv in the working directory, as discussed above. The function saves a list of data.frames of resulting counts per genomic regions in the current working directory under <samplename>.regions.<regionSize>.<minCounts>.RData, with samplename, regionSize and minCounts replaced by the sample name and the respective parameters given. Each data.frame object contains a list of genomic regions covered by the given samples, consisting of the columns: 'chr', 'start', 'stop', 'meth', 'unmeth' (signifying chromosome name, start of region by chromosomal position, end of region by chromosomal position, as well as methylated- and unmethylated cytosine counts at that chromosomal position).

<<doConvert>>=
positions_to_regions(params)
@
%

\section{Statistical modeling} \label{sec:model}
Most of details about the statistical model explained in this section is also explained in our upcoming paper on the computational analysis for single-cell BS-Seq analysis. We repeat the statistical details in this section for the sake of completeness of the description of the computational and statistical details of our pipeline. Taking as input the region-based counts computed in the step above, the underlying model of the BEAT package then computes corrected counts, taking into account especially incomplete conversion rates (taken from the \textit{'convrates'} parameter) and the estimated sequencing error (specified as the \textit{'pminus'} parameter). The model is described briefly as follows. 

%%% BEGIN MODEL DESCRIPTION FROM SUPPLEMENTS
We have derived a Bayesian statistical model that gives detailed information
about the methylation rate in a region of multiple CpG positions which
is described below. Apart from estimating the methylation rate, it
provides measures of confidence for this estimate, it can test regions
for high or low methylation. On the basis of of these tests it is
later possible to give a precise definition of a regional epimutation
event. For multi-cell samples, we assume that all counts at a single
CpG position were obtained from pairwise different bisulfite converted
DNA template strands and represent independent observations. This
certainly holds in good approximation, because the number of available
DNA template strands typically supersedes the read coverage at this
position by far. For single cell samples, we encounter the opposite
situation: There are at most two template DNA strands available, and
for many CpG positions this number is reduced further through DNA
degradation. Multiple reads covering one CpG position are therefore
highly dependent. We combine multiple counts at one position to one
single (non-)methylation call. For different CpG position, these calls
are then independent observations. First, fix one region, i.e. some
set of CpG positions. The number of counts at a given position is
the number of reads mapping to that position. Let $n$ denote the
total number of counts at all CpG positions in the given region, and
let $k$ (respectively $n-k$) of them indicate methylation (respectively
non-methylation). Let $r$ be the (unknown) methylation rate at the
given position. Then, assuming independence of the single counts as
mentioned above, the actual number $j$ of counts originating from
methylated CpGs in this region follows a binomial distribution,
\begin{equation}
P(j\mid n,r)=Bin(j;n,r)\label{eq:basismodell}
\end{equation}


Let the false positive rate $p_{+}$ be the global rate of false methylation
counts, which is identical to the non-conversion rate of non-methylated
cytosines. Conversely, let the false negative rate $p_{-}$ be the
global rate of false non-methylation counts, which is identical to
the inappropriate conversion rate of methylated cytosines. One can
find an upper bound for $p_{+}$ by considering all methylation counts
at non-CpG positions as false positives (resulting from non-conversion
of presumably unmethylated cytosines). This leads to an estimate
of $p_{+}=0.2,$ 0.51, 0.41, 0.44, 0.39 in the experiments Liver,
H1, H2, H3 and H4, respectively. Bear in mind that in single cell
bisulfite experiments, the limited DNA amount requires a particularly
mild bisulfite treatment, which increases the false positive rate
relative to standard bisulfite sequencing procedures.
In the literature, false negative rates were not described, an estimate
of $p_{-}=0.01$ is reported\textsuperscript{9}.
We chose a conservative value of $p_{-}=0.2$, which takes into account
potential errors originating from mapping artifacts or sequencing
errors. Due to failed or inappropriate conversion, the number $k$
of counts indicating methylation differs from the actual number $j$
of counts originating from methylated CpGs. Given the true number
of methylation counts $j$, the the observed methylation counts $k$
are the sum of the number $m$ of correctly identified methylations
and the number $k-m$ if incorrectly identified methylations (false
positives). Hence, the probability distribution of $k$ is a convolution
of two binomial distributions, 
\begin{eqnarray}
P(k\mid j,n;\, p_{+},p_{-}) & = & \sum_{m=0}^{k}P(m\mid j,1-p_{-})\cdot P(k-m\mid n-j,p_{+})\nonumber \\
 & = & \sum_{m=0}^{k}\underset{=:C_{m,j}^{1}}{\underbrace{Bin(m;j,1-p_{-})}}\cdot\underset{=:C_{n-j,k-m}^{2}}{\underbrace{Bin(k-m;n-j,p_{+})}}\label{eq:number of meth calls}
\end{eqnarray}
In (\ref{eq:number of meth calls}), we use the convention that $Bin(m;j,p)=0$
whenever $m>j$. Thus, given $n$ reads, $k$ methylation counts,
the likelihood function for $r$ is a mixture of Bionomial distributions,
\begin{eqnarray}
P(k\mid n,r;\, p_{+},p_{-}) & = & \sum_{j=0}^{n}P(k,j\mid n,r,p_{+},p_{-})\nonumber \\
 & = & \sum_{j=0}^{n}P(k\mid j,n,r,p_{+},p_{-})\cdot P(j\mid n,r,p_{+},p_{-})\nonumber \\
 & = & \sum_{j=0}^{n}P(k\mid j,n,p_{+},p_{-})\cdot P(j\mid n,r)\nonumber \\
 & \stackrel{(\ref{eq:basismodell},\ref{eq:number of meth calls})}{=} & \sum_{j=0}^{n}\sum_{m=0}^{k}C_{m,j}^{1}C_{n-j,k-m}^{2}\cdot Bin(j;n,r)\label{eq:Read model}
\end{eqnarray}


In our Bayesian approach, we furthermore need to specify a prior for
$r$ to calculate the posterior distribution of $r$. Recall the beta
distribution(s), which is a 2-parameter family of continuous probability
distributions defined the unit interval $[0,1]$,
\[
Beta(r;\alpha,\beta)\propto r^{\alpha-1}(1-r)^{\beta-1}\ ,\ \mbox{for }\alpha,\beta>0,\, r\in(0,1)\ ,
\]
We assume that a fraction of $\lambda_{m}$ postions are essentially
methylated, i.e., and that their rate $r$ follows a $Beta(r;\alpha=r_{m}\cdot w_{m},\beta_{m}=(1-r_{m})\cdot w_{m})$
distribution, having an expectation value for $r$ of $\frac{\alpha_{m}}{\alpha_{m}+\beta_{m}}=r_{m}$.
Here, we set $r_{m}=0.7$. The additional parameter
$w_{m}$ weights the strength of the prior relative to the strength
of the likelihood. Since (the confidence into/ the knowledge about)
our prior distribution of methylation rates is rather weak, we want
our procedure to be strongly data-driven, therefore we choose a low
$w_{m}$, $w_{m}=0.5$. A fraction of $\lambda_{u}=1-\lambda_{m}$
is essentially unmethylated, and their rate is assumed to follow a
$Beta(r;\alpha_{u}=r_{u}\cdot w_{u},\beta_{u}=(1-r_{u})\cdot w_{u})$
distribution, having an expectation value for $r$ of $\frac{\alpha_{u}}{\alpha_{u}+\beta_{u}}=r_{u}$,
where we set $r_{u}=0.2$ and $w_{u}=0.5$. Thus,
the prior distribution $\pi(r)$ is a 2-Beta mixture distribution,

\begin{eqnarray}
\pi(r;\alpha_{m},\beta_{m},\alpha_{u},\beta_{u},\lambda_{m}) & = & \sum_{s\in\{m,u\}}\lambda_{s}Beta(r;\alpha_{s},\beta_{s})\label{eq: Beta mixture prior}
\end{eqnarray}


The pragmatic reason for choosing a Beta mixture as a prior distribution
is the fact that the Beta distribution is the conjugate prior of the
Binomial distribution\textsuperscript{15}, such that for some normalizing
constant $D_{j,n}^{\alpha,\beta}$, 

\begin{eqnarray}
Bin(j;n,r)\cdot Beta(r;\alpha,\beta) & = & D_{j,n}^{\alpha,\beta}\cdot Beta(r;;j+\alpha,n-j+\beta)\label{eq:conjugate prior for binomial}
\end{eqnarray}


By virtue of Equation (\ref{eq:conjugate prior for binomial}), we
can write down the posterior distribution of $r$ analytically (Equation
\ref{eq:methylation posterior}). This has the advantage that we can
answer all questions on the posterior distribution of $r$ efficiently
and up to an arbitrary precision. Efficiency is an issue, because
we need to calculate posterior distributions for all regions, which
can easily amount to millions.

\begin{eqnarray}
 &  & P(r\mid k,n;\, p_{+},p_{-};\alpha_{m},\beta_{m},\alpha_{u},\beta_{u},\lambda_{m})\nonumber \\
 &  & \qquad\ =\ N^{-1}\cdot P(k\mid n,r;\, p_{+},p_{-})\cdot\pi(r;\alpha_{m},\beta_{m},\alpha_{u},\beta_{u})\\
 &  & \qquad\stackrel{(\ref{eq:Read model},\ref{eq: Beta mixture prior})}{=}\ N^{-1}\cdot\sum_{j=0}^{n}\sum_{m=0}^{k}C_{m,j}^{1}C_{n-j,k-m}^{2}\cdot Bin(j;n,r)\cdot\sum_{s\in\{m,u\}}\lambda_{s}Beta(r;\alpha_{s},\beta_{s})\nonumber \\
 &  & \qquad\ \stackrel{(\ref{eq:conjugate prior for binomial})}{=}\ N^{-1}\cdot\sum_{j=0}^{n}\sum_{m=0}^{k}C_{m,j}^{1}C_{n-j,k-m}^{2}\cdot\left(\sum_{s\in\{m,u\}}\lambda_{s}D_{j,n}^{\alpha_{s},\beta_{s}}Beta(r;j+\alpha_{s},n-j+\beta_{s})\right)\label{eq:methylation posterior}
\end{eqnarray}
In the above equation, $N$ is a normalization constant,
\begin{equation}
N=\sum_{j=0}^{n}\sum_{m=0}^{k}C_{m,j}^{1}C_{n-j,k-m}^{2}\cdot\sum_{s}\lambda_{s}D_{j,n}^{\alpha_{s},\beta_{s}}\label{eq:Normalization constant}
\end{equation}


The ingredients for the construction of the posterior distribution
are visualized in Figure (\ref{fig:Density Figure}). 

\begin{figure}
\centering{}\includegraphics[scale=0.55]{\string"MethDensities\string".png}\caption{\label{fig:Density Figure}Plot of the likelihood functions for three
different observations $(k,n)$ (left), the Beta mixture prior distribution
(middle) and the corresponding three posteriors (right). The number
$n$ of counts is set to $8$, of which $k=2$ (blue), $k=5$ (grey)
and $k=7$ (red) are methylation counts. The unknown parameter $p_{+}$was
determined empirically from the false non-CpG methylation, which reflects
the incomplete conversion rate, as follows: L1: 0.2, H1: 0.51, H2:
0.41, H3: 0.44, H4: 0.39. $p_{-}$was set to 0.2 as a very robust
choice. The beta mixture prior was set as described in the text.}
\end{figure}


\subsection{Methylation statistics derived from read counts}

For each region under consideration, we obtain an individual posterior
distribution $P(r\mid k,n,p_{+},p_{-})$. With this posterior at hand,
it is an easy task to calculate the expected methylation rate $\hat{r}$
in the corresponding region,
\begin{equation}
\hat{r}=\int_{0}^{1}r\cdot P(r\mid k,n,p_{+},p_{-})\, dr\label{eq: Expectation value}
\end{equation}
It is customary to provide a Bayesian measure of uncertainty of this
estimate, a so-called credible interval. A credible interval is an
interval which contains the estimate ($\hat{r})$ and in which a prescribed
probability mass of the posterior is located. One can construct a
90\% credible interval $[m,M]$ as the shortest interval containing
$\hat{r}$ such that $P(r\in[n,M]\mid k,n,p_{+},p_{-})=0.9$. Moreover,
we call a region \emph{highly methylated} if
\begin{equation}
P(r>0.7\mid k,n,p_{+},p_{-})>c\label{eq: highly methylated}
\end{equation}
for some stringency level $c$ which we set to $0.75$ here. The false
negative methylation calling rates were set to $p_{-}=0.1$ for all
samples, and the false positive calling rates were determined by $p_{+}=1-\mbox{CH methylation rate}$
for each sample separately. A region is said to show \emph{increased
methylation} if

\[
P(r>0.5\mid k,n,p_{+},p_{-})>c
\]


Analogously, a region is called \emph{sparsely methylated} if 
\[
P(r<0.3\mid k,n,p_{+},p_{-})>c
\]


and a region with \emph{decreased methylation} satisfies
\begin{equation}
P(r<0.5\mid k,n,p_{+},p_{-})>c\label{eq: decreased methylation}
\end{equation}


By definition, any highly methylated region has increased methylation,
and every sparsely methylated region shows decreased methylation.
For $c>0.5$, high and sparse methylation calls are mutually exclusive.
Regions that are neither highly nor sparsely methylated are called
\emph{ambiguous}. 

\begin{figure}
\centering{}\includegraphics[scale=0.45]{\string"MethCalling\string".png}\caption{\label{fig:Methylation figure}Illustration of the the results of
our statistical modeling applied to regions of size $d=1000$ in the
Liver sample. In each plot, $n$ (on the x-axis) denotes the total
number of counts mapping to that region, of which $k$ (on the y-axis)
are counts indicating methylation. Left: Using the Liver-specific
estimates of the false positive rate $p_{+}=0.2$ and the false negative
rate $p_{-}=0.1$ and the methylation prior in Equation (\ref{eq: Beta mixture prior}),
we obtain for each admissible pair $(k,n)$ a methylation rate estimates
$\hat{r}$ from Equation (\ref{eq: Expectation value}). Colors correspond
to methylation rate, ranging from deep blue (zero methylation) to
deep red (full methylation). Middle: The red respectively blue area
defines the pairs $(k,n)$ which satisfy our criteria for high respectively
sparse methylation. Right: The red respectively blue area defines
the pairs $(k,n)$ which satisfy our criteria for increased respectively
decreased methylation. Note that strict methylation calls are only
made when at least $n=5$ counts were observed.}
\end{figure}


The time-critical step is the calculation of the region-specific posterior
distribution $P(r\mid k,n,p_{+},p_{-})$, and the quantities related
to it (Equations \ref{eq: Expectation value}-\ref{eq: decreased methylation}).
Since $k$ and $n$ vary for each region, and the number of regions
is large, we save a lot of time by pre-calculating all required quantities
for a set of values $n=1,...,45$, $k=0,...,n$. The statistics for,
on average, 80\% of all regions can then be looked
up and do not need to be re-computed. The running times for $d=250,500,...$
on the mouse genome took less than a minute plus $t=25$ min for the
pre-computing of each of the five samples, which did not vary substantially
with region size.
%%% END MODEL DESCRIPTION FROM SUPPLEMENTS

The model generates posterior probabilities for each pair of methylated and total count values given the conversion rate and sequencing error. These posteriors are saved in the working directory for each sample in an R object under the file name~\linebreak
<samplename>.convMat.<pminus>.<pplus>.RData, while the corrected counts along with corrected methylation rate estimates and other model data are saved as data.frames under the file name <samplename>.results.<pminus>.<pplus>.RData, where pminus and pplus are replaced by the respective parameters. For each sample, the model computes a data.frame consisting of the columns: 'chr', 'start', 'stop', 'meth', 'unmeth', 'methstate' and 'methest', signifying chromosome name, start of region by chromosomal position, end of region by chromosomal position, methylated- and unmethylated cytosine counts at that chromosomal position, methylation state (which is $1$ for regions called methylated, $-1$ for unmethylated regions, and $0$ for regions that are neither) and methylation estimate (i.e., the model-estimated methylation level for that region). The object contains further columns which are used for internal procedures and irrelevant for the package user.

<<doModel>>=
generate_results(params)
@
%

\section{Epimutation calling} \label{sec:epimutations}
Finally, epimutations are called by comparing two results objects from the previous step.

The function 'epimutation\_calls' compares one sample to a reference sample. A methylating epimutation is called for each genomic region common to both samples when it was called as unmethylated in the reference and as methylated in the other sample by the model. Accordingly, the region is called as demethylating epimutation when called as methylated in the reference and as unmethylated in the other sample. Epimutation rates are then computed as the frequency of epimutation events in relation to the total regions shared between each sample and the reference. The resulting epimutations are saved as data.frames, for each sample one object for methylating epimutations and one for demethylating epimutations, in the working directory as<sampleName>.methEpicalls.<regionSize>.<minCounts>.p+=<pplus>.p-=<pminus>.RData and <sampleName>.demethEpicalls.<regionSize>.<minCounts>.p+=<pplus>.p-=<pminus>.RData, respectively.
Each data.frame object contains a list of genomic regions covered by the given samples, consisting of the columns: 'chr', 'pos', 'endpos', 'meth', 'unmeth', 'methstate', signifying chromosome name, start of region by chromosomal position, end of region by chromosomal position, methylated- and unmethylated cytosine counts at that chromosomal position and the methylation state, which is $1$ for methylated positions and $-1$ for unmethylated positions.

<<doEpimutations>>=
epiCalls <- epimutation_calls(params)
head(epiCalls$methSites$singlecell,3)
head(epiCalls$demethSites$singlecell,3)
@
%

\bibliographystyle{unsrt}
\bibliography{BEAT}

\end{document}