%\VignetteIndexEntry{Analysing single-cell BS-Seq data with the "BEAT" package} %\VignettePackage{BEAT} % To compile this document % library('cacheSweave');rm(list=ls());Sweave('BEAT.Rnw',driver=cacheSweaveDriver());system("pdflatex BEAT") %\begin{figure}[ht] %\centering %\includegraphics[width=.6\textwidth]{FILENAME} %\caption{CAPTION} %\label{FIGURELABEL} %\end{figure} \documentclass[12pt,oneside]{article} \usepackage[T1]{fontenc} \usepackage[latin9]{inputenc} \usepackage{geometry} \geometry{verbose,tmargin=2cm,bmargin=2cm,lmargin=2cm,rmargin=2cm} \setlength{\parindent}{0bp} \usepackage{color} \usepackage{amsmath} \usepackage{graphicx} \usepackage{esint} \begin{document} \newcommand{\thetitle}{Quantification of DNA Methylation and Epimutations from Bisulfite Sequencing Data -- the BEAT package} \title{\textsf{\textbf{\thetitle}}} \author{Kemal Akman$^1$, Achim Tresch\\[1em]Max-Planck-Institute for Plant Breeding Research\\ Cologne, Germany\\ \texttt{$^1$akman@mpipz.mpg.de}} %%BEAT version 0.99.1 \SweaveOpts{keep.source=TRUE,eps=FALSE,pdf=TRUE,png=TRUE,include=FALSE,width=5.4,height=3.7,resolution=180} \maketitle \begin{abstract} The following example illustrates a standard use case of the BEAT package, which is used for processing and modeling methylated and unmethylated counts of CG positions from Bisulfite sequencing (BS-Seq) for determination of epimutation rates, i.e. the rate of change in DNA methylation status at CG sites between a reference multi-cell sample and single-cell samples. The input for the package BEAT consists of count data in the form of counts for unmethylated and methylated cytosines per genomic position, which are then grouped into genomic regions of sufficient coverage in order to allow for low-coverage samples to be analyzed. Methylation rates of each region are then modeled using a binomial mixture model in order to adjust for experimental bias, which arises mainly out of incomplete bisulfite conversion (resulting in unmethylated CG positions falsely appearing to be methylated) and sequencing errors (resulting in random errors in methylation status, including methylated CG positions falsely appearing as unmethylated). This vignette explains the use of the package. For a detailed exposition of the statistical method, please see our upcoming paper. \end{abstract} \tableofcontents \section{Preamble} \label{sec:preamble} \subsection{Biological Background} \label{subsec:biology} Bisulfite sequencing (BS-Seq) is a sequence-based method to accurately detect DNA methylation at specific loci, which involves treating DNA with sodium bisulfite~\cite{frommer_genomic_1992}. Bisulfite conversion of unmethylated cytosines into uracil is a relatively simple chemical reaction which has now become a standard in DNA methylation profiling. The key advantage of this method is accuracy, as the degree of methylation at each cytosine can be quantified with great precision~\cite{fraga_dna_2002}. Most currently available techniques for determination of DNA methylation can only measure average values obtained from cell populations as a whole, requiring at least 30 ng of DNA, i.e. the equivalent of about 6000 cells~\cite{gu_genome-scale_2010}. Since these population approaches cannot account for cell- and position-specific differences in DNA methylation, which are termed epimutations~\cite{jeggo_azacytidine-induced_1986}, they are unsuitable for the characterization of cellular heterogeneity. However, this heterogeneity plays an important role in differentiation and development, stem cell re-programming, in diseases such as cancer and aging~\cite{egger_epigenetics_2004}. Developing single-cell approaches for measuring DNA methylation is not only be vital to fully understand individual cell-specific changes and complexity of tissue micro-environments, but also for the analysis of clinical samples, such as circulating tumor cells or needle biopsies, when the amount of material is often limited. With the BEAT package, we present a pipeline for the computational analysis part of such single-cell BS-Seq experiments. \subsection{Package Functionality} \label{subsec:package} This vignette will illustrate the complete work flow of a DNA methylation analysis, which includes model-based estimation of regional methylation rates from observed BS-Seq methylation counts, calling of methylation status, and the comparison of samples to determine regional differences in methylation status, i.e., epimutation calling. These three steps are implemented by the three public functions of BEAT, namely \textit{positions\_to\_regions}, \textit{generate\_results} and \textit{epimutation\_calls}, respectively. The functionality of BEAT will be demonstrated using sample data consisting of the first chromosome of mouse liver hepatocytes, one reference sample ('reference.positions.csv') and one single-cell sample to compare against ('sample.positions.csv') The main challenges in DNA methylation analysis are bias correction, modeling of sampling variance (shot noise due to low count numbers) and assessing the significance of changes. \section{Input format} \label{sec:input} The package BEAT expects as input one csv per sample with counts for unmethylated and methylated cytosines per genomic position. Such data can be obtained from the output of BS-Seq mapping tools such as bismark (see http://www.bioinformatics.babraham.ac.uk/projects/bismark/) followed by simple script-based data processing. BS-Seq data for methylated- and unmethylated counts per genomic position needs to be formatted into a data.frame object with the columns: 'chr', 'pos', 'meth', 'unmeth' (signifying chromosome name, chromosomal position, as well as methylated- and unmethylated cytosine counts at that chromosomal position). % <>= library(BEAT) localpath <- system.file('extdata', package = 'BEAT') positions <- read.csv(file.path(localpath, "sample.positions.csv")) head(positions) @ % \section{Configuration and parameters} \label{sec:conf} BEAT can process multiple samples at a time. For each sample specified under \textit{'sampNames'}, the package BEAT reads this data frame from a csv from the working directory, which is specified by the parameter \textit{'localpath'}. The file should be called .positions.csv and should contain the aforementioned csv under the name 'positions'. Result files written by BEAT will have the names .results.RData. A distinction is made between two samples whose methylation status is compared against that of the reference. %% NOTE: The following single-cell distinction is only relevant for conversion of counts to binary counts; this has been done for the samples provided. We need to distinguish between samples that have been generated from a single cell or from a mixture of cells, since these cases require different statistical modeling. For each sample, the assignment of reference or non-reference status to samples is done via the vector \textit{'is.reference'}, which is indexed by the \textit{'sampNames'} vector and contains one TRUE entry for the reference and one to many FALSE entries for samples to be compared against the reference. Our example data set contains two files, one named 'reference' for a reference consisting of a mixture of cells, and one named 'sample' for a single-cell sample to be compared against the reference. % <>= sampNames <- c('reference','sample') sampNames is.reference <- c(TRUE, FALSE) @ % For the modeling part of BEAT, Bisulfite conversion rates have to be specified per sample using the vector \textit{'convrates'}, which is indexed by \textit{'sampNames'}. These conversion rates need to be specified manually by the user. Practically, for mammalian somatic cell samples, these rates can be estimated for each sample by looking at the non-CpG methylation rate per sample and using the inverse value as estimated CpG methylation rate, because non-CpG methylation in these types of cells is expected to be near zero. For example, if non-CpG methylation in a sequenced sample was measured to be 0.1 then BS-conversion rate for that sample would be set to 0.9 when near-zero non-CpG methylation can be expected, as is the case in somatic mammalian cells. % <>= pplus <- c(0.2, 0.5) convrates <- 1 - pplus @ % The aforementioned values are then set in a parameter object, which is used throughout the further work flow in this package. It provides the following additional options: \begin{itemize} \item 1-\textit{convrates} represents the fraction of unmethylated counts that are falsely called as methylated due to incomplete BS-conversion. This parameter is also referred to as 'pplus' in the statistical model. \item \textit{'pminus'} represents the fraction of methylated counts that are falsely called as unmethylated. \item \textit{'regionSize'} is the size of regions into which genomic positions are grouped. In single cells, the number of positions with sufficient coverage for reliable statistical predictions may be very small. Therefore, our pipeline applies epimutation calling to regions instead of single positions. Single positions are pooled into regions of appropriate size, i.e., regions containing sufficiently many CpG positions that have a positive read count number in both the reference and the single cell sample (.shared. CpG positions). Our method has then sufficient power to reliably detect epimutation events affecting these regions. For a detailed description of pooling positions into regions, see section \ref{sec:regions}. \item After pooling CG positions to regions, there may still be regions with low count numbers that do not allow for reliable downstream analysis. Regions with less than \textit{'minCounts'} counts will be removed from further processing, potentially saving significant processing time in further analysis steps. \end{itemize} The following parameters are of minor importance, for a first analysis, they can be left at their pre-set values. \begin{itemize} \item \textit{'verbose'} is an option that prints additional information during computation when set to TRUE. \textit{'computeRegions'} is an option that will recompute the regions from given positional input if set to TRUE; otherwise, it will depend on existing region files already present in the \textit{'localpath'} directory (file names ending in regions.RData). \item \textit{'computeMatrices'} is an option that will recompute the model data from given regions if set to TRUE; otherwise, it will depend on existing model output files already present in the \textit{'localpath'} directory (file names ending in convMat.RData and results.RData. \item \textit{'writeEpicallMatrix'} is an option that can be set to TRUE to generate epimutation calling output in the form of matrices (one row per genomic position, output format is CSV). \end{itemize} % <>= params <- makeParams(localpath, sampNames, convrates, is.reference, pminus = 0.2, regionSize = 10000, minCounts = 5) params @ % \section{Pooling of CG counts into regions} \label{sec:regions} The supplied methylation counts for individual CG positions are grouped into regions by BEAT for modeling according to the specified \textit{'regionSize'} and \textit{'minCount'} parameters. The function positions\_to\_regions takes as input the samples as csv objects located under .positions.csv in the working directory, as discussed above. The function saves a list of data.frames of resulting counts per genomic regions in the current working directory under .regions...RData, with samplename, regionSize and minCounts replaced by the sample name and the respective parameters given. Each data.frame object contains a list of genomic regions covered by the given samples, consisting of the columns: 'chr', 'start', 'stop', 'meth', 'unmeth' (signifying chromosome name, start of region by chromosomal position, end of region by chromosomal position, as well as methylated- and unmethylated cytosine counts at that chromosomal position). <>= positions_to_regions(params) @ % \section{Statistical modeling} \label{sec:model} Most of details about the statistical model explained in this section is also explained in our upcoming paper on the computational analysis for single-cell BS-Seq analysis. We repeat the statistical details in this section for the sake of completeness of the description of the computational and statistical details of our pipeline. Taking as input the region-based counts computed in the step above, the underlying model of the BEAT package then computes corrected counts, taking into account especially incomplete conversion rates (taken from the \textit{'convrates'} parameter) and the estimated sequencing error (specified as the \textit{'pminus'} parameter). The model is described briefly as follows. %%% BEGIN MODEL DESCRIPTION FROM SUPPLEMENTS We have derived a Bayesian statistical model that gives detailed information about the methylation rate in a region of multiple CpG positions which is described below. Apart from estimating the methylation rate, it provides measures of confidence for this estimate, it can test regions for high or low methylation. On the basis of of these tests it is later possible to give a precise definition of a regional epimutation event. For multi-cell samples, we assume that all counts at a single CpG position were obtained from pairwise different bisulfite converted DNA template strands and represent independent observations. This certainly holds in good approximation, because the number of available DNA template strands typically supersedes the read coverage at this position by far. For single cell samples, we encounter the opposite situation: There are at most two template DNA strands available, and for many CpG positions this number is reduced further through DNA degradation. Multiple reads covering one CpG position are therefore highly dependent. We combine multiple counts at one position to one single (non-)methylation call. For different CpG position, these calls are then independent observations. First, fix one region, i.e. some set of CpG positions. The number of counts at a given position is the number of reads mapping to that position. Let $n$ denote the total number of counts at all CpG positions in the given region, and let $k$ (respectively $n-k$) of them indicate methylation (respectively non-methylation). Let $r$ be the (unknown) methylation rate at the given position. Then, assuming independence of the single counts as mentioned above, the actual number $j$ of counts originating from methylated CpGs in this region follows a binomial distribution, \begin{equation} P(j\mid n,r)=Bin(j;n,r)\label{eq:basismodell} \end{equation} Let the false positive rate $p_{+}$ be the global rate of false methylation counts, which is identical to the non-conversion rate of non-methylated cytosines. Conversely, let the false negative rate $p_{-}$ be the global rate of false non-methylation counts, which is identical to the inappropriate conversion rate of methylated cytosines. One can find an upper bound for $p_{+}$ by considering all methylation counts at non-CpG positions as false positives (resulting from non-conversion of presumably unmethylated cytosines). This leads to an estimate of $p_{+}=0.2,$ 0.51, 0.41, 0.44, 0.39 in the experiments Liver, H1, H2, H3 and H4, respectively. Bear in mind that in single cell bisulfite experiments, the limited DNA amount requires a particularly mild bisulfite treatment, which increases the false positive rate relative to standard bisulfite sequencing procedures. In the literature, false negative rates were not described, an estimate of $p_{-}=0.01$ is reported\textsuperscript{9}. We chose a conservative value of $p_{-}=0.2$, which takes into account potential errors originating from mapping artifacts or sequencing errors. Due to failed or inappropriate conversion, the number $k$ of counts indicating methylation differs from the actual number $j$ of counts originating from methylated CpGs. Given the true number of methylation counts $j$, the the observed methylation counts $k$ are the sum of the number $m$ of correctly identified methylations and the number $k-m$ if incorrectly identified methylations (false positives). Hence, the probability distribution of $k$ is a convolution of two binomial distributions, \begin{eqnarray} P(k\mid j,n;\, p_{+},p_{-}) & = & \sum_{m=0}^{k}P(m\mid j,1-p_{-})\cdot P(k-m\mid n-j,p_{+})\nonumber \\ & = & \sum_{m=0}^{k}\underset{=:C_{m,j}^{1}}{\underbrace{Bin(m;j,1-p_{-})}}\cdot\underset{=:C_{n-j,k-m}^{2}}{\underbrace{Bin(k-m;n-j,p_{+})}}\label{eq:number of meth calls} \end{eqnarray} In (\ref{eq:number of meth calls}), we use the convention that $Bin(m;j,p)=0$ whenever $m>j$. Thus, given $n$ reads, $k$ methylation counts, the likelihood function for $r$ is a mixture of Bionomial distributions, \begin{eqnarray} P(k\mid n,r;\, p_{+},p_{-}) & = & \sum_{j=0}^{n}P(k,j\mid n,r,p_{+},p_{-})\nonumber \\ & = & \sum_{j=0}^{n}P(k\mid j,n,r,p_{+},p_{-})\cdot P(j\mid n,r,p_{+},p_{-})\nonumber \\ & = & \sum_{j=0}^{n}P(k\mid j,n,p_{+},p_{-})\cdot P(j\mid n,r)\nonumber \\ & \stackrel{(\ref{eq:basismodell},\ref{eq:number of meth calls})}{=} & \sum_{j=0}^{n}\sum_{m=0}^{k}C_{m,j}^{1}C_{n-j,k-m}^{2}\cdot Bin(j;n,r)\label{eq:Read model} \end{eqnarray} In our Bayesian approach, we furthermore need to specify a prior for $r$ to calculate the posterior distribution of $r$. Recall the beta distribution(s), which is a 2-parameter family of continuous probability distributions defined the unit interval $[0,1]$, \[ Beta(r;\alpha,\beta)\propto r^{\alpha-1}(1-r)^{\beta-1}\ ,\ \mbox{for }\alpha,\beta>0,\, r\in(0,1)\ , \] We assume that a fraction of $\lambda_{m}$ postions are essentially methylated, i.e., and that their rate $r$ follows a $Beta(r;\alpha=r_{m}\cdot w_{m},\beta_{m}=(1-r_{m})\cdot w_{m})$ distribution, having an expectation value for $r$ of $\frac{\alpha_{m}}{\alpha_{m}+\beta_{m}}=r_{m}$. Here, we set $r_{m}=0.7$. The additional parameter $w_{m}$ weights the strength of the prior relative to the strength of the likelihood. Since (the confidence into/ the knowledge about) our prior distribution of methylation rates is rather weak, we want our procedure to be strongly data-driven, therefore we choose a low $w_{m}$, $w_{m}=0.5$. A fraction of $\lambda_{u}=1-\lambda_{m}$ is essentially unmethylated, and their rate is assumed to follow a $Beta(r;\alpha_{u}=r_{u}\cdot w_{u},\beta_{u}=(1-r_{u})\cdot w_{u})$ distribution, having an expectation value for $r$ of $\frac{\alpha_{u}}{\alpha_{u}+\beta_{u}}=r_{u}$, where we set $r_{u}=0.2$ and $w_{u}=0.5$. Thus, the prior distribution $\pi(r)$ is a 2-Beta mixture distribution, \begin{eqnarray} \pi(r;\alpha_{m},\beta_{m},\alpha_{u},\beta_{u},\lambda_{m}) & = & \sum_{s\in\{m,u\}}\lambda_{s}Beta(r;\alpha_{s},\beta_{s})\label{eq: Beta mixture prior} \end{eqnarray} The pragmatic reason for choosing a Beta mixture as a prior distribution is the fact that the Beta distribution is the conjugate prior of the Binomial distribution\textsuperscript{15}, such that for some normalizing constant $D_{j,n}^{\alpha,\beta}$, \begin{eqnarray} Bin(j;n,r)\cdot Beta(r;\alpha,\beta) & = & D_{j,n}^{\alpha,\beta}\cdot Beta(r;;j+\alpha,n-j+\beta)\label{eq:conjugate prior for binomial} \end{eqnarray} By virtue of Equation (\ref{eq:conjugate prior for binomial}), we can write down the posterior distribution of $r$ analytically (Equation \ref{eq:methylation posterior}). This has the advantage that we can answer all questions on the posterior distribution of $r$ efficiently and up to an arbitrary precision. Efficiency is an issue, because we need to calculate posterior distributions for all regions, which can easily amount to millions. \begin{eqnarray} & & P(r\mid k,n;\, p_{+},p_{-};\alpha_{m},\beta_{m},\alpha_{u},\beta_{u},\lambda_{m})\nonumber \\ & & \qquad\ =\ N^{-1}\cdot P(k\mid n,r;\, p_{+},p_{-})\cdot\pi(r;\alpha_{m},\beta_{m},\alpha_{u},\beta_{u})\\ & & \qquad\stackrel{(\ref{eq:Read model},\ref{eq: Beta mixture prior})}{=}\ N^{-1}\cdot\sum_{j=0}^{n}\sum_{m=0}^{k}C_{m,j}^{1}C_{n-j,k-m}^{2}\cdot Bin(j;n,r)\cdot\sum_{s\in\{m,u\}}\lambda_{s}Beta(r;\alpha_{s},\beta_{s})\nonumber \\ & & \qquad\ \stackrel{(\ref{eq:conjugate prior for binomial})}{=}\ N^{-1}\cdot\sum_{j=0}^{n}\sum_{m=0}^{k}C_{m,j}^{1}C_{n-j,k-m}^{2}\cdot\left(\sum_{s\in\{m,u\}}\lambda_{s}D_{j,n}^{\alpha_{s},\beta_{s}}Beta(r;j+\alpha_{s},n-j+\beta_{s})\right)\label{eq:methylation posterior} \end{eqnarray} In the above equation, $N$ is a normalization constant, \begin{equation} N=\sum_{j=0}^{n}\sum_{m=0}^{k}C_{m,j}^{1}C_{n-j,k-m}^{2}\cdot\sum_{s}\lambda_{s}D_{j,n}^{\alpha_{s},\beta_{s}}\label{eq:Normalization constant} \end{equation} The ingredients for the construction of the posterior distribution are visualized in Figure (\ref{fig:Density Figure}). \begin{figure} \centering{}\includegraphics[scale=0.55]{\string"MethDensities\string".png}\caption{\label{fig:Density Figure}Plot of the likelihood functions for three different observations $(k,n)$ (left), the Beta mixture prior distribution (middle) and the corresponding three posteriors (right). The number $n$ of counts is set to $8$, of which $k=2$ (blue), $k=5$ (grey) and $k=7$ (red) are methylation counts. The unknown parameter $p_{+}$was determined empirically from the false non-CpG methylation, which reflects the incomplete conversion rate, as follows: L1: 0.2, H1: 0.51, H2: 0.41, H3: 0.44, H4: 0.39. $p_{-}$was set to 0.2 as a very robust choice. The beta mixture prior was set as described in the text.} \end{figure} \subsection{Methylation statistics derived from read counts} For each region under consideration, we obtain an individual posterior distribution $P(r\mid k,n,p_{+},p_{-})$. With this posterior at hand, it is an easy task to calculate the expected methylation rate $\hat{r}$ in the corresponding region, \begin{equation} \hat{r}=\int_{0}^{1}r\cdot P(r\mid k,n,p_{+},p_{-})\, dr\label{eq: Expectation value} \end{equation} It is customary to provide a Bayesian measure of uncertainty of this estimate, a so-called credible interval. A credible interval is an interval which contains the estimate ($\hat{r})$ and in which a prescribed probability mass of the posterior is located. One can construct a 90\% credible interval $[m,M]$ as the shortest interval containing $\hat{r}$ such that $P(r\in[n,M]\mid k,n,p_{+},p_{-})=0.9$. Moreover, we call a region \emph{highly methylated} if \begin{equation} P(r>0.7\mid k,n,p_{+},p_{-})>c\label{eq: highly methylated} \end{equation} for some stringency level $c$ which we set to $0.75$ here. The false negative methylation calling rates were set to $p_{-}=0.1$ for all samples, and the false positive calling rates were determined by $p_{+}=1-\mbox{CH methylation rate}$ for each sample separately. A region is said to show \emph{increased methylation} if \[ P(r>0.5\mid k,n,p_{+},p_{-})>c \] Analogously, a region is called \emph{sparsely methylated} if \[ P(r<0.3\mid k,n,p_{+},p_{-})>c \] and a region with \emph{decreased methylation} satisfies \begin{equation} P(r<0.5\mid k,n,p_{+},p_{-})>c\label{eq: decreased methylation} \end{equation} By definition, any highly methylated region has increased methylation, and every sparsely methylated region shows decreased methylation. For $c>0.5$, high and sparse methylation calls are mutually exclusive. Regions that are neither highly nor sparsely methylated are called \emph{ambiguous}. \begin{figure} \centering{}\includegraphics[scale=0.45]{\string"MethCalling\string".png}\caption{\label{fig:Methylation figure}Illustration of the the results of our statistical modeling applied to regions of size $d=1000$ in the Liver sample. In each plot, $n$ (on the x-axis) denotes the total number of counts mapping to that region, of which $k$ (on the y-axis) are counts indicating methylation. Left: Using the Liver-specific estimates of the false positive rate $p_{+}=0.2$ and the false negative rate $p_{-}=0.1$ and the methylation prior in Equation (\ref{eq: Beta mixture prior}), we obtain for each admissible pair $(k,n)$ a methylation rate estimates $\hat{r}$ from Equation (\ref{eq: Expectation value}). Colors correspond to methylation rate, ranging from deep blue (zero methylation) to deep red (full methylation). Middle: The red respectively blue area defines the pairs $(k,n)$ which satisfy our criteria for high respectively sparse methylation. Right: The red respectively blue area defines the pairs $(k,n)$ which satisfy our criteria for increased respectively decreased methylation. Note that strict methylation calls are only made when at least $n=5$ counts were observed.} \end{figure} The time-critical step is the calculation of the region-specific posterior distribution $P(r\mid k,n,p_{+},p_{-})$, and the quantities related to it (Equations \ref{eq: Expectation value}-\ref{eq: decreased methylation}). Since $k$ and $n$ vary for each region, and the number of regions is large, we save a lot of time by pre-calculating all required quantities for a set of values $n=1,...,45$, $k=0,...,n$. The statistics for, on average, 80\% of all regions can then be looked up and do not need to be re-computed. The running times for $d=250,500,...$ on the mouse genome took less than a minute plus $t=25$ min for the pre-computing of each of the five samples, which did not vary substantially with region size. %%% END MODEL DESCRIPTION FROM SUPPLEMENTS The model generates posterior probabilities for each pair of methylated and total count values given the conversion rate and sequencing error. These posteriors are saved in the working directory for each sample in an R object under the file name~\linebreak .convMat...RData, while the corrected counts along with corrected methylation rate estimates and other model data are saved as data.frames under the file name .results...RData, where pminus and pplus are replaced by the respective parameters. For each sample, the model computes a data.frame consisting of the columns: 'chr', 'start', 'stop', 'meth', 'unmeth', 'methstate' and 'methest', signifying chromosome name, start of region by chromosomal position, end of region by chromosomal position, methylated- and unmethylated cytosine counts at that chromosomal position, methylation state (which is $1$ for regions called methylated, $-1$ for unmethylated regions, and $0$ for regions that are neither) and methylation estimate (i.e., the model-estimated methylation level for that region). The object contains further columns which are used for internal procedures and irrelevant for the package user. <>= generate_results(params) @ % \section{Epimutation calling} \label{sec:epimutations} Finally, epimutations are called by comparing two results objects from the previous step. The function 'epimutation\_calls' compares one sample to a reference sample. A methylating epimutation is called for each genomic region common to both samples when it was called as unmethylated in the reference and as methylated in the other sample by the model. Accordingly, the region is called as demethylating epimutation when called as methylated in the reference and as unmethylated in the other sample. Epimutation rates are then computed as the frequency of epimutation events in relation to the total regions shared between each sample and the reference. The resulting epimutations are saved as data.frames, for each sample one object for methylating epimutations and one for demethylating epimutations, in the working directory as.methEpicalls...p+=.p-=.RData and .demethEpicalls...p+=.p-=.RData, respectively. Each data.frame object contains a list of genomic regions covered by the given samples, consisting of the columns: 'chr', 'pos', 'endpos', 'meth', 'unmeth', 'methstate', signifying chromosome name, start of region by chromosomal position, end of region by chromosomal position, methylated- and unmethylated cytosine counts at that chromosomal position and the methylation state, which is $1$ for methylated positions and $-1$ for unmethylated positions. <>= epiCalls <- epimutation_calls(params) head(epiCalls$methSites$singlecell,3) head(epiCalls$demethSites$singlecell,3) @ % \bibliographystyle{unsrt} \bibliography{BEAT} \end{document}