%\VignetteIndexEntry{easyRNASeq}
%\VignetteKeywords{RNA-Seq}
%\VignettePackage{easyRNASeq}
\documentclass[11pt,a4paper]{article}

% Bioc style
<<style, eval=TRUE, echo=FALSE, results=tex>>=
BiocStyle::latex()
@
% for eanbling comments
\usepackage{verbatim}

% new commands
\newcommand{\pref}[1]{\ref{#1} (page \pageref{#1})}
\newcommand{\eg}{\textit{e.g.} }
\newcommand{\ie}{\textit{i.e.} }
\newcommand{\vs}{\textit{vs.} }
\newcommand{\cf}{\textit{c.f.} }
\newcommand{\etc}{\textit{etc.} }

% Sweave options
\SweaveOpts{keep.source=TRUE}

% title
\title{easyRNASeq, an overview}
\author{Nicolas Delhomme}

% Doc
\begin{document}

\maketitle

\tableofcontents

\begin{comment}
For a faster vignette generation, many R section that do not need to
be run are set to eval=FALSE. For testing purpose, change them all to
TRUE.
\end{comment}

\newpage

\section{Before you start}

If you are completely new to the R/Bioconductor
\cite{Gentleman:2004p2013} packages dealing with Next-Generation
Sequencing, you might want to start by doing the tutorial available in
the \Robject{RnaSeqTutorial} vignette in the data package
\Biocexptpkg{RnaSqTutorial}. The same holds true if you're interested in
understanding the details behind the content of the present
vignette. If neither nor, just go ahead.

\subparagraph{}
Moreover, if you find \Biocpkg{easyRNASeq} useful and apply it in the
frame of your research for a publication, please cite it:

\subparagraph{}
easyRNASeq: a bioconductor package for processing RNA-Seq data\\
Nicolas Delhomme; Ismael Padioleau; Eileen E. Furlong; Lars M. Steinmetz\\
Bioinformatics 2012; doi: 10.1093/bioinformatics/bts477\\

\section{Changes to the vignette}

The last changes are reported in red.

\paragraph{1.8.4}
Converted an existing paragraph on annotation warning(s) into a section to add
additional details - see section \pref{p:W}. Extended the first use case on
synthetic transcript generation - see section \pref{synth:trx} - to introduce
how to check for annotation overlaps. Added another section detailing input file
characteristics and their impact on counting - see section \pref{p:Ciri}.
Finally, started a FAQ section - see section \pref{FAQ}.

\paragraph{1.8.2}
Added a use case that describes how to create a set of "gene models"
in a manner more efficient than the current implementation within the
package. As "gene model" can actually have different meanings depending
on the context, I switched to call these \textbf{synthetic transcripts}; \ie what
we create really are synthetic transcripts combining all exons of their
corresponding gene. See the use case in section \pref{synth:trx} for how this
is performed. This example was first introduced at the EMBO 2013
Practical Course: Analysis of High Throughput Sequencing Data, see
\url{http://www.ebi.ac.uk/training/course/embo-practical-course-analysis-high-throughput-sequencing-data-1}
for the lectures (there are links in the time-table) and here for
my materials: \url{https://umu.box.com/s/f37nl4u1p0fj4m58dxxm}.

\paragraph{1.3.14}
A new section: \ref{s:ytc} was added describing foreseen changes to
the package, see page \pageref{s:ytc}. Additional changes to the
vignette use-case where made thanks to Richard Friedman's suggestions,
see section \ref{p:pasohs}, page \pageref{p:pasohs}.

\paragraph{1.3.10 - 1.3.13}
Some vignette discrepancies have been corrected. Thanks to Richard
Friedman for spotting them, see paragraph \ref{p:W}, page \pageref{p:W}.
Some additional information about the 'gtf' and 'gff' format has been
added after fixing the bug reported by Tomasz Kulinski. See
section \ref{ss:usage}, page \pageref{ss:usage}.

\paragraph{1.3.9}
Not much change to the vignette, but for the fact that the
\Biocpkg{easyRNASeq} application note got published in
\textbf{Bioinformatics} \cite{Delhomme:2012p4678}.

\paragraph{1.3.5 - 1.3.8}
The vignette has been updated to report function call changes to the
\Rfunction{easyRNASeq} function:
\begin{itemize}
  \item the format argument defaults to bam
  \item the chr.sizes can be derived from the bam file header
  \item the addition of the knownOrganisms function
  \item support for variable length reads
  \item read files can be processed in parallel
\end{itemize}
Sections affected are:
\pref{ss:usage}, \pref{p:args}, \pref{p:input}, \pref{ss:diffInput},
\pref{perf},\pref{p:pasohs},\pref{p:dwai}

\paragraph{1.3.1 - 1.3.4}
The vignette has been updated to:
\begin{itemize}
  \item first: enhance its readability.
  \item second: introduce a new \textbf{Use Case} section
\end{itemize}
Sections affected are:
\pref{s:eRS}, \pref{ss:usage}, \pref{p:annot}, \pref{ss:diffAnn},
\pref{ss:demultiplex}, \pref{s:useCases}

\newpage

\section{Introduction}

The present document describes the use of the \Biocpkg{easyRNASeq} package to ease
the processing of RNA-seq data in \R{}/\Bioconductor{}. RNA-seq
\cite{Mortazavi:2008p740} was introduced as a new method to perform
Gene Expression Analysis, using the advantages of the high
throughput of \emph{Next-Generation Sequencing} (NGS)
machines.
\newline

The goal of this vignette is, first, to show an example of the
\Rfunction{easyRNASeq} function that generates a count table for the
selected genic features of interest, \emph{i.e.} exons, transcripts,
gene models, \emph{etc.} using the read data and the genic and genomic
annotations. This overall process is described in figure
\ref{fig:fig1}, page \pageref{fig:fig1}. Shortly, the genomic and
genic annotation will be retrieved from the selected/preferred source
and converted into a \Rclass{RangedData} object. In parallel, the
sequenced reads' information (\textit{e.g.} chromosome, position,
strand, \textit{etc.}) will be retrieved from the alignment file and,
similarly, converted to a \Rclass{GRanges} object. Then,
reads contained in the \Robject{reads} \Rclass{GRanges} object are summarized
per genic annotation contained in the \Robject{annotation} \Rclass{RangedData}
object, which give raise to a count table that, finally, can be corrected or
normalized using additional R packages.
\newline

Second, the \Rfunction{easyRNASeq} function arguments are detailed
and additional information given on how to correct data for visualization,
using \emph{RPKM (Reads Per Kilobase of feature per Million
reads in the library)} counts. Normalization can be performed by using
appropriate statistical models implemented \eg in
the \Biocpkg{DESeq} or \Biocpkg{edgeR} packages.
\bioccomment{For the reasons why I call RPKM a correction and not a normalization,
see \cite{Soneson:2013p5778}.}
\newline

In a third part, more advanced pre-processing is described,
\ie \emph{de-multiplexing}. Note that this step has become a standard procedure
in most sequencing facilities and hence is bound to be deprecated as a function
in the \Biocpkg{easyRNASeq} package in a few release cycles.
\newline

Finally, additional post-processing such as exporting the count table as bed
and wig formatted file, to be visualized into the UCSC genome browser
or a stand alone genome browser are described in the vignette of the
companion data package \Biocexptpkg{RnaSqTutorial}. Note that the information in
this tutorial is getting quickly outdated given the pace of evolution of the NGS
field in general. The basic information it contains are perfectly correct
though. A newer version of that tutorial is in preparation.

\begin{figure}[htpb]
\centerline{\includegraphics[width=0.80\textwidth]{easyRNASeq.png}}
\caption{easyRNAseq Overview. The R packages used for the
  different steps are emphasized in bold face.}
\label{fig:fig1}
\end{figure}
\clearpage

\section{easyRNASeq}
\label{s:eRS}

Throughout this vignette, the data present in the \Biocexptpkg{RnaSeqTutorial}
data package will be used to demonstrate functionalities. Refer to its
documentation for more details; but shortly, it contains different \emph{Drosophila melanogaster}
RNA-Seq dataset saved in different format as well as the related
annotation retrieved from  FlyBase \cite{Tweedie:2009p2014}.
\newline

The main function of the \Biocpkg{easyRNASeq} package is
\Rfunction{easyRNASeq}. That should be the only processing method you
need to know about when using the package. It is essentially a wrapper
around other functions performing the different tasks described in Figure~\ref{fig:fig1}.
These functions are also exported, if you feel you need
to have a look at them. The \Biocpkg{easyRNASeq} package defines an
\Rclass{RNAseq} class for holding the necessary information used
during the processing. The \Rfunction{easyRNASeq} function instantiates
an object of that class and although most arguments to this
function are optional, it is advised to set them - especially
the \emph{readLength} and the \emph{organismName} - to document your analysis.
The \Rcode{organismName} argument is
actually mandatory if you want to retrieve genomic annotation using
\Biocpkg{biomaRt}; in which case, you need to provide the name as
specified in the corresponding \Biocpkg{BSgenome} package,
\emph{i.e.} ``Dmelanogaster'' for the
\Biocpkg{BSgenome.Dmelanogaster.UCSC.dm3} package.
\newline

\warning{As there are numerous RNA-Seq protocol,
each with its own specificities, the \Rfunction{easyRNASeq} function has a large
number of arguments to cope with these. This number of arguments, with
only a few defaults values might be intimidating but we are in the
process of consolidating them. Meanwhile, please refer to the documentation and the examples in
this vignette and the \Biocexptpkg{RnaSqTutorial} one to get familiar
with them. Do not hesitate to ask on the Bioconductor mailing list
if the purpose of some argument were still unclear.}

\subsection{usage}
\label{ss:usage}

\paragraph{Initial example}
In the following, the \Rfunction{easyRNASeq} function is called with
an almost minimal set of parameters - a few optional ones too\dots The
sequencing reads' excerpts used here were obtained from $4$ fruit-fly samples
subjected to a 36 cycles sequencing on an Illumina GAIIx
machine. The annotation were retrieved from FlyBase and
converted into an ``Rdata'' object.
\newline
Here, we're interested in retrieving the
count table of these $4$ samples for every exon in the genome.

<<Load the library and create the object>>=
library(easyRNASeq)
library(RnaSeqTutorial)

count.table <- easyRNASeq(filesDirectory=system.file(
                            "extdata",
                            package="RnaSeqTutorial"),
                          pattern="[A,C,T,G]{6}\\.bam$",
                          readLength=30L,
                          organism="Dmelanogaster",
                          annotationMethod="rda",
                          annotationFile=system.file(
                            "data",
                            "gAnnot.rda",
                            package="RnaSeqTutorial"),
                          count="exons"
                          )
@

<<Show the results>>=
head(count.table)
dim(count.table)
@

In this usage of the \Rfunction{easyRNASeq} function, we're reading bam files
(the default \Rcode{format} argument), present
in the directory passed as first argument to the function (the
\Rcode{filesDirectory} argument) using
a regular expression to match the file names (\Rcode{pattern} argument). The
\Rcode{system.file} function retrieves the actual path on your file system
where the \Biocexptpkg{RnaSeqTutorial} package was installed and in that
particular case of its ``extdata'' sub-directory. The pattern used to
retrieve the files is similar to that of the \Rpackage{base} functions
\Rfunction{dir} and \Rfunction{grep}. Here we look up files, which name
contains 6 times one of the ``A'', ``C'', ``G'' or ``T'' letter,
followed by the ``.bam'' extension. \bioccomment{Note the use of the \$ sign at
the end of the regular expression to signify that the filename to be matched has
to end after '.bam'. This way we prevent fetching the BAI (the BAM index) files.}
Additionally, the number of sequencing cycles
(\Rcode{readLength}) and the \Rcode{organism} that sample
originates from, both optional here are detailed. Then the genomic and genic
information is provided: the exons' annotation
(\Rcode{annotationFile} argument), which were already processed into an ``RData''
file (the \Rcode{annotationMethod} argument). Finally, we precise that we want
the sequenced reads to be summarized by ``exons''; \emph{i.e.} we want
to get a read count value per exon and per sample.
\newline
And that is all. In that one command, you got the exon count table for
your 4 samples!

\subsection{warnings}
\label{p:W}
\warning{
As you could see when running the previous example,  warnings were
emitted and quite rightly so.
\begin{enumerate}
\item about the annotation:
The annotation used here is redundant at two levels. First, some exons overlap;
these are alternative exons from splice variants. Second, the
annotation contains every splice variants, hence some exons are
duplicated. Therefore counting by exons or transcripts using this
annotation results in counting several reads many
times and for that reason a warning is emitted, as multiple countingmay be a
very significant source of error. The ideal
solution is to provide an annotation object that contains no
overlapping features, see the use case in section \pref{synth:trx} for an
approach to reduce/control these annotation related issues. Consult as well the
section \pref{p:Ciri} for related counting issues.
\item about potential naming issue in the input file:
It is not infrequent that sequencing facilities use
variable naming conventions for chromosomes in comparison to those generally
globally accepted for the corresponding organism. Therefore, is it not
infrequent that the globally available annotation uses different chromosome
names than those reported in the alignment files.
\end{enumerate}
These warnings are there to inform you about these potential issues, as they
affect the accuracy of associating reads with genic or genomic features. As
often misconceived, read count is not an easy task; see section \pref{p:Ciri} and
\cite{Liao:2013p5892} for more details.
}

\subsection{arguments}
\label{p:args}
Going back to the \Rfunction{easyRNASeq} function, more arguments can
be set to fine tune it, depending on the input at hand, the desired
output, etc. Most will be detailed in the following, but not all, so check
out \Rcode{?easyRNASeq} for more. One such argument is: \Rcode{filenames},
that offers the possibility to give actual filenames rather than using
a pattern matching approach to find the alignment files. For example, if you're
only interested in two samples out of the four previously used ones, you could do:

<< Create the object with only 2 files, eval=FALSE>>=
count.table <- easyRNASeq(system.file(
                                      "extdata",
                                      package="RnaSeqTutorial"),
                          organism="Dmelanogaster",
                          readLength=30L,
                          annotationMethod="rda",
                          annotationFile=system.file(
                            "data",
                            "gAnnot.rda",
                            package="RnaSeqTutorial"),
                          count="exons",
                          filenames=c("ACACTG.bam", "ACTAGC.bam"))
@

Several arguments  influence the
\Rfunction{easyRNASeq} function. These can be grouped in $5$ categories
concerning:
\begin{itemize}
  \item the input
  \item the annotations
  \item the counts summarization
  \item the output
  \item the optional correction/normalization
\end{itemize}

\paragraph{Input}
\label{p:input}
Using the \Rcode{format} argument, reads can be loaded from BAM
files (default) or any format supported by the \Biocpkg{ShortRead}
\cite{Morgan:2009p739} package. The SAM/BAM format
\cite{Li:2009p1662} is becoming a \emph{de-facto} standard for
storing NGS aligned data. Gapped alignment can be read from BAM files,
set the \Rcode{gapped} argument to ``TRUE'' for that.
\bioccomment{Note that as BAM is now the standard for storing read alignments,
\Biocpkg{easyRNASeq} support of other formats - through the use of
\Biocpkg{ShortRead} will be abandoned in a few release cycles.}

\subsection{Critical information regarding the input}
\label{p:Ciri}
\warning{
As previously mentioned, associating reads with features is not straightforward,
and numerous factors have to be taken into account that affect the counting
accuracy. First, it is critical to know what alignments where reported by
the aligner used. Does the alignment file contains multi-mapping reads; \ie are
reads aligning to numerous loci reported?
Second, it is also critical to understand the gene's annotation in detail. Is it
from a well-established model organism? Are there many splicing isoforms reported
per gene? How do these differ; do they have different exons, different UTRs, \etc?
Understanding pitfalls and caveats from both will help you decide what settings
are required within the frame of your experiment. In most common RNA-Seq
experiment, the goal is to look for differential expression between 2 conditions
within one organism. The most stringent approach to that is:
\begin{enumerate}
\item to use only uniquely mapping reads and discarding others - this is the
approach taken by e.g. HTSeq; see \url{http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html}.
\item to ensure that reads can be assigned to a single feature only - \ie when a
read cannot be confidently assign to a feature; \eg if a read locus overlap
two genes either neighbouring or on opposite strand - discard them. This can be
achieved by 2 means, either by assigning reads to features first and then
filtering the results or filtering the annotation prior to assigning the reads
to prevent any such feature overlap.
\end{enumerate}
As of now, it is important to know that \Biocpkg{easyRNASeq} just reports
warnings if these conditions are unmet. Obtaining the correct set of alignments
for your analysis is specific to your aligner of choice and hence cannot be
described here. Obtaining a set of synthetic transcripts that would fully remove
or diminish the occurence of ambiguous read assignment is detailed in the first
use case of this vignette, see section \pref{synth:trx}.
}

\paragraph{Annotation}
\label{p:annot}

The \Rfunction{easyRNASeq} function currently accepts the following \Rcode{annotationMethod}s:
\begin{itemize}
  \item ``biomaRt'' use the \Biocpkg{biomaRt}
    \cite{Durinck:2005p1990} package to retrieve the annotation
  \item ``env'' use a \Rclass{RangedData} or \Rclass{GRangesList} class object present in the
    environment - see page \pageref{env} for a use case.
  \item ``gff'' reads in a gff version 3 file
  \item ``gtf'' reads in a gtf file as provided by Ensembl \cite{Flicek:2011p2042}
  \item ``rda'' load an RData object. The object needs to be named
    \Robject{gAnnot} and of class \Rclass{RangedData} or
    \Rclass{GRangesList}. See the use case page \pageref{grnglist} to prepare
    such an object using the \Biocpkg{GenomicFeatures}. ``.rda'' is
    the filename extension of serialized R data object (\textit{i.e.} object
    written to disk using the \Rfunction{save} function). The filename
    extension for these files used to be and still is
    sometimes``.RData''. The actual filename extension used is of no
    importance to \Rfunction{easyRNASeq} - see page \pageref{rda} for a use case.
\end{itemize}
The structure of the \Rclass{RangedData} object that has to be
provided when using the ``rda'' or ``env'' \Rcode{annotationMethods}
is described in the section \pref{ss:diffAnn}. For using a \Rclass{GRangesList}, see the
use case page \pageref{grnglist}.
More details on the gff3 and gtf format can be obtained from:
\begin{itemize}
\item gff3: http://www.sequenceontology.org/gff3.shtml
\item gtf: http://mblab.wustl.edu/GTF22.html
\end{itemize}
Unlike for the gtf, the gff3 ninth column holding the attributes has
no mandatory tags. Some of them are however ``predefined''. As
exemplified further down (see \ref{p:gI}, \pageref{p:gI}), we expect
these ``predefined'' attributes to be used and we rely on the ``ID'',
``Parent'' and ``Name'' to identify the relation between a gene and
its exons and transcripts.

\paragraph{Summarization}

The reads can be counted/summarized by:
\begin{itemize}
  \item exons
  \item features (any features such as introns, enhancers, \emph{etc.})
  \item transcripts
  \item geneModels: they consists of the set of disjoint ``synthetic''
    exons that fully overlap every known exons and UTRs of a
    gene; \emph{I.e.} every alternate exon will be split into
    separate ``synthetic'' exons and these exons will be grouped into
    a set that correspond to a single gene.
\end{itemize}

When you use the ``geneModels'' summarization, no reads will be
counted twice at the exception of those mapping overlapping exons of
different genes, whether on the same strand or not. The
\Rfunction{easyRNASeq} function does not deal with these situation as it does
not yet support stranded reads, hence as previously a warning will be emitted
if this occurs. The implementation of the geneModel generation is actually
outdated - it is rather slow - and is being re-implemented as described in
section \pref{synth:trx}. Note that the geneModel summarization as it is now,
is deprecated.

\paragraph{Output}

The results can be exported in five different formats:
\begin{itemize}
\item count table (the default, a n (features) x  m (samples) \Robject{matrix}).
\item a \Biocpkg{DESeq} \cite{Anders:2010p1659}
  \Rclass{CountDataSet} class object. Useful to perform further
  analyses using the \Biocpkg{DESeq} package.
\item an \Biocpkg{edgeR} \cite{Robinson:2010p775} \Rclass{DGEList}
  class object. Useful to perform further analyses using the
  \Biocpkg{edgeR} package.
\item an \Rclass{RNAseq} class object. Useful for performing
  additional pre-processing without re-loading the reads and
  annotations.
\item as an instance of the \Rclass{RangedSummarizedExperiment} class.
  See section \pref{s:ytc} for additional details.
\end{itemize}

\paragraph{Correction}

The results can optionally be ``corrected'' as \emph{Reads per Kilobase
  of feature per Million reads in the library} (RPKM,
\cite{Mortazavi:2008p740}), useful for visualization but not for differential
expression, see \cite{}.
\bioccomment{Note that the \Rcode{vst} and \Rcode{voom} approaches from the
\Biocpkg{DESeq2} and \Biocpkg{limma} packages, respectively offer more powerful
approaches for correcting the data prior to visualization. These will be
included in future version of the \Biocpkg{easyRNASeq} package.}

\paragraph{Normalization}
The normalization step is also
available when the results are exported in the \Biocpkg{DESeq} or
\Biocpkg{edgeR} formats. This generates plots to assess
the validity of the normalization and help decide how to proceed with
any downstream analysis.
\newpage

\subsection{different input}
\label{ss:diffInput}

The default input format of the \Rfunction{easyRNASeq} function is the
BAM format, as exemplified before. It does at the moment support older
formats through the \Biocpkg{ShortRead} package, but as BAM is nowadays standard,
this support will eventually be phased out in a few release cycles.

In the first example, we read a BAM file containing gapped
alignments. In that particular case, TopHat \cite{Trapnell:2009p156}
was used to look for splicing events. Note the use of the
\Rcode{gapped} boolean argument.

<<read a gapped  bam file, eval=FALSE>>=
count.table <- easyRNASeq(system.file(
                                      "extdata",
                                      package="RnaSeqTutorial"),
                          organism="Dmelanogaster",
                          readLength=36L,
                          annotationMethod="rda",
                          annotationFile=system.file(
                            "data",
                            "gAnnot.rda",
                            package="RnaSeqTutorial"),
                          gapped=TRUE,
                          count="exons",
                          filenames="gapped.bam")
@

The next example show how to load an Illumina 'export' file, which was
Illumina's default output before the CASAVA v1.8 release in spring
2011. The \Rcode{format} argument is set to ``aln'' and an additional
argument: \Rcode{type} is provided that precise the input file
format. See the \Biocpkg{ShortRead} \Rfunction{readAligned} help page
for all available formats. A final argument: the \Rcode{filter}
argument, essential to filter usable reads is also given. This argument is
crucial as export files contain every read, including those
not aligning and those not passing the original quality filter
performed during the Illumina Base Calling, a.k.a. the
chastity filter. For more details, see the section ``Filtering the
Data'' in the \Biocexptpkg{RnaSqTutorial} vignette. Finally, also note the
\Rcode{chr.sizes} argument; it precises the sizes of the chromosomes using a
named integer vector.Indeed, these cannot be determined programatically
unlike for the BAM format where the chromosome sizes can be extracted from the
header.

<<read an export, eval=FALSE>>=
count.table <- easyRNASeq(system.file(
                                      "extdata",
                                      package="RnaSeqTutorial"),
                          organism="Dmelanogaster",
                          chr.sizes=seqlengths(Dmelanogaster),
                          readLength=36L,
                          annotationMethod="rda",
                          annotationFile=system.file(
                            "data",
                            "gAnnot.rda",
                            package="RnaSeqTutorial"),
                          format="aln",
                          type="SolexaExport",
                          count="exons",
                          pattern="subset_export",
                          filter=compose(
                            chromosomeFilter(regex="chr"),
                            chastityFilter()))
@

\newpage

\subsection{different annotation}
\label{ss:diffAnn}

This sections describes how to use different annotation
sources to retrieve genic information. It also describes
the expected format of these annotations.

\bioccomment{Important Note: there are many derivative of the GFF3 and GTF
file formats, some of which not strictly applying the convention
these files should implement. Using incorrectly formatted file is bound to fail,
but most of the time a relevant error message will be reported. GTF
files as produced by \eg Ensembl and GFF3 files as produced by \eg Flybase are
correct implementation of these formats. If you encounter problems with an
annotation file, please try first to locate and correct them. For the GFF
format, the example file present in the \Biocexptpkg{RnaSqTutorial}
package can be used as a reference. If you have installed this package the
following gives you that gff file path on your machine:}

<<GFF3, eval=FALSE>>=
system.file(
            "extdata",
            "Dmel-mRNA-exon-r5.52.gff3",
            package="RnaSeqTutorial")
@

\bioccomment{If you do not manage to identify the issue, please contact the
Bioconductor mailing list for help.}

\paragraph{biomaRt}
\label{p:biomaRt}
The following is an example of how to use biomaRt to retrieve annotation.
<<load annotation from biomaRt, eval=FALSE>>=
rnaSeq <- easyRNASeq(system.file(
                                      "extdata",
                                      package="RnaSeqTutorial"),
                          organism="Dmelanogaster",
                          readLength=36L,
                          annotationMethod="biomaRt",
                          gapped=TRUE,
                          count="exons",
                          filenames="gapped.bam",
                          outputFormat="RNAseq")
@

Once the annotation is fetched, it can be retrieved provided that the
\Rcode{outputFormat} argument was set to \Rclass{RNAseq}. This implies
a complete S4 object of class \Rclass{RNAseq} is returned and not only
the count table (see \pref{subsec:diffOut} for details on the different output).
The \Rfunction{genomicAnnotation} accessor allows then to access the annotation
stored in the \Rclass{RNAseq} object.

<<access the annotation, eval=FALSE>>=
gAnnot <- genomicAnnotation(rnaSeq)
@

The \Robject{gAnnot} - of class \Rclass{RangedData} - can then be saved
in a \file{.rda} file for a faster re-processing using the \Rfunction{easyRNASeq}
function with the arguments \Rcode{annotationMethod} and \Rcode{annotationFile}
set to "rda" and to the \file{rda} filepath, respectively. \bioccomment{Note, however
that it is necessary to save that object under the name ``gAnnot''.} Finally, it
would be important to pre-process that new annotation in the manner described
in section \pref{synth:trx} for reason described in section \pref{p:Ciri}

\paragraph{genomeIntervals}
\label{p:gI}
The next example shows how to perform the same provided that you have a gtf or
gff3 formatted file at hand. While the gtf format is more constrained, gff
formatted file will likely have varying gff attributes (its ninth last column).
This column contains ``key=value'' pairs separated by semi-colons. We depend on
specific keys to be present and these are \textit{ID}, \textit{Parent}, and
\textit{Name}, all strongly suggested by the gff3 format specification.
When parsing a gff3 file, we're only considering feature of type 'exon' (the
gff3 third column). Then, we expect the \textit{ID} key to have the
format: \textit{geneID:exonNumber}; both the exon and gene annotation
being derived from it. The \textit{Parent} key should contain the transcript ID.
If one exon is part of several transcripts, these can be concatenated
using commas without space. Finally, the \textit{Name} key should
contain the gene name, but its presence is actually facultative for the
processing. As we are using the \Rfunction{readGff3} function from the
\Biocpkg{genomeIntervals} package, it is as well essential that the key=value
pairs are separated by semi-colons \textbf{whitout} space. Have a look at the
\file{annot.gff} file used in the next code chunck for an example of a gff3 file
formatted as the \Rfunction{easyRNASeq} expects it.

<<load the annotation from a gff, eval=FALSE>>=
count.table <- easyRNASeq(system.file(
                                      "extdata",
                                      package="RnaSeqTutorial"),
                          organism="Dmelanogaster",
                          readLength=36L,
                          annotationMethod="gff",
                          annotationFile=system.file(
                            "extdata",
                            "Dmel-mRNA-exon-r5.52.gff3",
                            package="RnaSeqTutorial"),
                          gapped=TRUE,
                          count="exons",
                          filenames="gapped.bam")
@

\paragraph{RangedData}

Internally, the \Rfunction{easyRNAseq} function converts the
retrieved annotation - when executed with the
\Rcode{annotationMethod} argument ``gff'', ``gtf'', or ``biomaRt'' - into an
object of the \Rclass{RangedData} class. Using the \Rcode{annotationMethod}
argument ``env'' or ``rda'' requires that you provide such an object. The
following example shows the expected structure of that object.

<<RangedData structure>>=
library(IRanges)
gAnnot <- RangedData(
                     IRanges(
                             start=c(10,30,100),
                             end=c(21,53,123)),
                          space=c("chr01","chr01","chr02"),
                          strand=c("+","+","-"),
                          transcript=c("trA1","trA2","trB"),
                          gene=c("gA","gA","gB"),
                          exon=c("e1","e2","e3"),
                          universe = "Hs19"
                          )
gAnnot
@

In that object, we describe the genomic location of three exons
(chromosome, position and strand) as well as their transcript and gene
membership. Nothing more is required. Note that the names'
spelling is \textbf{essential} for the \Rfunction{easyRNAseq} function
to work properly. Indeed, the \Rcode{count} argument will be used to
lookup the proper values in the annotation, \eg for summarizing by
``exons'', an ``exon'' column in the \Rclass{RangedData} object is
mandatory;  a ``feature'' one for counting ``features'', a
``transcript'' one for ``transcripts'', \etc.
For a peek at the \Rclass{RangedData} class \Robject{gAnnot}
object used in the \Biocpkg{RnaSeqTutorial} tutorial, do:

<<look at the gAnnot, eval=FALSE>>=
library(RnaSeqTutorial)
data(gAnnot)
gAnnot
@

\paragraph{GRangesList}

The \Biocpkg{easyRNASeq} package support as well annotation provided
as a \Rclass{GRangesList} class object (from the
\Biocpkg{GenomicRanges} package). Converting a \Rclass{RangedData}
class object into a \Rclass{GRangesList} class object is pretty straightforward.

<<GRangesList coercion>>=
grngs <- as(gAnnot,"GRanges")
grngsList<-split(grngs,seqnames(grngs))
grngsList
@

The advantage of doing so is that the \Rclass{RangedData} class might
get deprecated in the future. The next advantage is that the
\Rclass{GRangesList} class is strand-aware.

\paragraph{Remember}

\bioccomment{Generating the proper annotation is probably the most important step
in processing your RNA-Seq sample. Mind the warnings produced by the
\Rfunction{easyRNASeq} function, they might be annoying, but there are
there for a good reason: to help.}

\newpage

\subsection{different output}
\label{subsec:diffOut}

\paragraph{matrix}

This is the default output - a matrix of m features by n samples - that has been
exemplified throughout the beginning of this vignette.

\paragraph{RNAseq}

This has also been introduced previously, see section \pref{p:biomaRt}, but
shortly - as shown below - the only change that need doing is to use the
\Rclass{RNAseq} value of the \Rfunction{outputFormat} argument.

<<easyRNASeq output>>=
rnaSeq <- easyRNASeq(system.file(
                                 "extdata",
                                 package="RnaSeqTutorial"),
                     organism="Dmelanogaster",
                     readLength=30L,
                     annotationMethod="rda",
                     annotationFile=system.file(
                       "data",
                       "gAnnot.rda",
                       package="RnaSeqTutorial"),
                     count="exons",
                     pattern="[A,C,T,G]{6}\\.bam$",
                     outputFormat="RNAseq")
@

\paragraph{CountDataSet, DGEList, RangedSummarizedExperiment}

More details on how to generate \Rclass{CountDataSet}
(\Biocpkg{DESeq}) or \Rclass{DGEList} (\Biocpkg{edgeR}) are given in
section \pref{subsec:optNorm}.
\newline
The integration of the \Rclass{SummarizedExperiment} \Rcode{outputFormat} in
the \Biocpkg{easyRNASeq} package is recent, but at term it will be the new
\Rcode{outputFormat} default. See section \pref{s:ytc} for details.

\newpage

\subsection{different summarization}
\label{subsec:diffNorm}

As introduced previously, there are four possible counting method accesible
through an \Rfunction{easyRNASeq} call: by
``exons'', ``features'', ``transcripts'' and ``genes''. However, to
perform multiple summarization on the same data, \emph{e.g.} by
``exons'' and ``transcripts'', calling the \Rfunction{easyRNASeq}
twice is inefficient as the read files and the annotation twice would have to
be processed twice. To avoid this, specific functions are available, which are
applicable to an object of class \Rclass{RNAseq}, see details about that class in
section \pref{subsec:diffOut}. These functions are:
\begin{enumerate}
  \item \Rfunction{exonCounts}
  \item \Rfunction{featureCounts}
  \item \Rfunction{transcriptCounts}
  \item \Rfunction{geneCounts} This function takes an additional parameter
  defining the kind of gene
    summarization, either \Rcode{bestExons} or
    \Rcode{geneModels}. The \Rcode{bestExons}
    summarization will return per gene, the highest exon count. The
    \Rcode{geneModels} summarization first calculate a gene
    model and then return the read count for it. Rather than using the genes
    / geneModels paradigm, consider using synthetic transcripts as described in
    section \pref{synth:trx}.
  \item \Rfunction{readCounts} a function to access the different count
    tables stored in the \Rclass{RNAseq} object.
\end{enumerate}

For example, to summarize the reads per transcripts, apply the
\Rfunction{transcriptCounts} function on the previously generated
\Robject{rnaSeq} object - see section \pref{subsec:diffOut} - and use
the \Rfunction{readCounts} function to access the transcript count table.

<<Transcript counts>>=
rnaSeq <- transcriptCounts(rnaSeq)
head(readCounts(rnaSeq,'transcripts'))
@

Summarizing by transcript is frequently inherently biased as exons are often
part of different splicing variants. For that reason, it might be better to
summarize the data per genes - or synthetic transcripts- as described in section
\pref{synth:trx}.
\newpage

\subsection{optional correction or normalization}
\label{subsec:optNorm}

In this section, different count \emph{correction} and \emph{normalization} are
described. First the RPKM, a common sense correction that take the genic
feature size (exon, transcript, gene model,...) and the total
number of reads sequenced in the library into account. Its use is not
recommended for doing any kind of differential expression analysis -
see \cite{Soneson:2013p5778} - but
is adequate for visualization; \eg to create tracks to be displayed
in a genome browser.
Actually for state of the art visualization, one can use a \Rfunction{voom}
(from the \Biocpkg{limma} package) or \Rfunction{vst} (from the \Biocpkg{DESeq2}
package) approach.

Here is an example on how to get a RPKM count table:
<<RPKM normalization, eval=FALSE>>=
count.table <- easyRNASeq(system.file(
                                      "extdata",
                                      package="RnaSeqTutorial"),
                          organism="Dmelanogaster",
                          readLength=30L,
                          annotationMethod="rda",
                          annotationFile=system.file(
                            "data",
                            "gAnnot.rda",
                            package="RnaSeqTutorial"),
                          count="exons",
                          filenames=c("ACACTG.bam", "ACTAGC.bam",
                            "ATGGCT.bam", "TTGCGA.bam"),
                          normalize=TRUE
                          )
@

In addition, \Rfunction{easyRNASeq} generated raw count tables can be
post-corrected using the \Rfunction{RPKM} method:

<<RPKM counts>>=
feature.size = width(genomicAnnotation(rnaSeq))
names(feature.size) = genomicAnnotation(rnaSeq)$exon
lib.size=c(
  "ACACTG.bam"=56643,
  "ACTAGC.bam"=42698,
  "ATGGCT.bam"=55414,
  "TTGCGA.bam"=60740)
head(RPKM(readCounts(rnaSeq,summarization="exons")$exons,
          NULL,
          lib.size=lib.size,
          feature.size=feature.size))
@

The same can be directly done on \Rclass{RNAseq} class objects.

<<RPKM on RNAseq>>=
head(RPKM(rnaSeq,from="transcripts"))
@

For normalizing the data, numerous solution are available in \Bioconductor{},
see \cite{Soneson:2013p5778} for details. Only the \Biocpkg{edgeR} and
\Biocpkg{DESeq} packages are currently imported by \Biocpkg{easyRNASeq}.

\paragraph{DESeq}
\label{par:DESeq}
To be able to normalize the data using \Biocpkg{DESeq} (or
\Biocpkg{edgeR} for that matter), one needs to define the
samples' ``conditions'', \textit{e.g.} ``disease'' vs. ``healthy''. To
ensure traceability, the \Biocpkg{easyRNASeq} package require the
conditions to be a \emph{named vector} where the names are the raw
data filenames.

<<define the conditions>>=
conditions=c("A","A","B","B")
names(conditions) <- c("ACACTG.bam", "ACTAGC.bam",
                       "ATGGCT.bam", "TTGCGA.bam")
@

Once the \Robject{conditions} has been defined, pass it to the \Rcode{conditions}
argument of the \Rfunction{easyRNASeq} function and set the \Rcode{outputFormat}
argument to DESeq. An additional \Rcode{normalize} argument is available to
trigger or not the count normalization implemented in the
\Biocpkg{DESeq} package. In either case, a \Robject{CountDataSet} object is
returned. Additional argument to the \Biocpkg{DESeq} function can be provided
through the \dots arguments of the \Rfunction{easyRNASeq} function. \eg in the
following example, we precise the kind of fit (\textit{local}) that needs to be
performed by the \Rfunction{estimateDispersion} function of the \Biocpkg{DESeq}
package. Since we have few data, the default fit fails and reports an error
telling us to change the \Rcode{fitType} argument.

<<DESeq normalization, eval=FALSE>>=
countDataSet <- easyRNASeq(system.file(
                                      "extdata",
                                      package="RnaSeqTutorial"),
                          organism="Dmelanogaster",
                          readLength=30L,
                          annotationMethod="rda",
                          annotationFile=system.file(
                            "data",
                            "gAnnot.rda",
                            package="RnaSeqTutorial"),
                          count="exons",
                          filenames=c("ACACTG.bam", "ACTAGC.bam",
                            "ATGGCT.bam", "TTGCGA.bam"),
                          normalize=TRUE,
                          outputFormat="DESeq",
                          conditions=conditions,
                          fitType="local"
                          )
@

\bioccomment{Note that setting the \Rcode{normalize} argument to \Rcode{TRUE}
generates diagnostics plots as detailed in the \Biocpkg{DESeq} vignette.
The plot produced in the present examples are irrelevant as the dataset is
too small. You can turn the plotting off by setting the \Rcode{plot}
argument to \Rcode{FALSE}. Have a look at the \Biocpkg{DESeq}
vignette (essential if you plan to use \Biocpkg{DESeq} anyway!) for the plot
explanation.}

\paragraph{edgeR}

The same procedure can be done using the \Biocpkg{edgeR} package functionalities.

<<edgeR normalization, eval=FALSE>>=
dgeList <- easyRNASeq(system.file(
                                  "extdata",
                                  package="RnaSeqTutorial"),
                      organism="Dmelanogaster",
                      readLength=30L,
                      annotationMethod="rda",
                      annotationFile=system.file(
                        "data",
                        "gAnnot.rda",
                        package="RnaSeqTutorial"),
                      count="exons",
                      filenames=c("ACACTG.bam", "ACTAGC.bam",
                        "ATGGCT.bam", "TTGCGA.bam"),
                      normalize=TRUE,
                      outputFormat="edgeR",
                      conditions=conditions
                      )
@

As for the former paragraph about \Biocpkg{DESeq} (see section \pref{par:DESeq}),
the diagnostics plots generated are only semi-relevant. Check out the
\Biocpkg{edgeR} manual for more details about these plots. Note that producing
the plots is rather slow.

\paragraph{Next steps}

At this stage you are done with the normalization and what's ahead of
you: calling differential expression, exporting track files for
visualization, etc. is not the scope of the \Biocpkg{easyRNASeq}
package. This one has a few more functionalities, the most important
of which will be described in the next section. To proceed with your
data analysis, check the relevant package vignettes (\Biocpkg{DESeq},
\Biocpkg{edgeR}) for differential expression analysis and the
\Biocexptpkg{RnaSqTutorial} for examples of track files generation using
the \Biocpkg{rtracklayer} package.

\subsection{Performance}
\label{perf}
The geneModel generation, the read counting and
summarization can be parallelized. Use the \Rcode{nbCore}
argument to set the number of CPU cores to
use. There are a number of word of caution:
\begin{enumerate}
\item CPU cores means CPU cores, \textit{i.e.} located on the same
  physical machine. Do not expect it to work across machines.
\item there's no load nor memory management, so watch out yourself for
  it. Memory will scale linearly with the number and size of read libraries
  (e.g. bam files) you process in parallel.
\item It's using the R 'parallel' package, so it's supported on all
  three main OS. However, some cosmetic reporting might get lost.
\end{enumerate}

\newpage

\section{Advanced usage}

\subsection{De-multiplexing samples}
\label{ss:demultiplex}
Nowadays, NGS machines produces huge quantity of``raw'' reads
(40M to 150M reads per Illumina MiSeq or Illumina HiSeq
lanes respectively), that the coverage obtained per
lane for the transcriptome of ``small'' genome-sized organisms is for
a single sample essentially a waste of resource. \eg one
lane of HiSeq results in an approximate 3,000X coverage of the
\textit{Saccharomyces cerevisiae} genome which is 12Mb large.
Therefore, techniques to have several samples running as a
\emph{single} library have been
created \cite{Lefrancois:2009p778,Smith:2010p777}, using 4-6bp
barcodes to uniquely identify samples; this is called
\textbf{multiplexing}. A 30X coverage per sample of 48 yeast samples can hence
be obtained froman average Illumina GenomeAnalyzer GAIIx run (105bp read,
single end). This
approach is very advantageous to researchers, especially in term of
costs, but it adds an additional layer of pre-processing that is not
as trivial as one may think.  Extracting the barcodes
would be fairly straightforward, but for the average 0.1 percent
sequencing error rate that introduces a lot of multiplicity in the
actual barcodes present in the samples. A proper design of the
barcodes, maximizing the Hamming
distance\cite{Hamming:1950p825,Pilcher:2008p824} is an essential step
for a proper de-multiplexing.
\newline
There are two kinds of barcoding, the one described in
\cite{Lefrancois:2009p778} where the barcode is part of the read
sequence and the one developed by Illumina, where the barcode is read
in a separate sequencing reaction after the first mate sequencing.
\newline
The data used in the following example has been sequenced using the
Illumina protocol. We'll look at the specificities that this introduce.

\bioccomment{This pre-processing procedure has to be
applied on the raw reads before any alignment is performed.}
Most
often, one would use the ``fastq'' formatted file as input. In the
particular case of the Illumina protocol, the barcodes can be retrieved
from the fastq ID lines or from the Illumina export format. The export
format is normally the result of aligning the reads with ELAND, the
Illumina aligner, but it can be generated as well without aligning the
reads, \ie the export file may not contain any alignment
information.
\bioccomment{The \Biocpkg{ShortRead} package offers a quick functionality to
access the barcode field of an export file, which we take advantage of in the
next example; the \Rcode{withAll} argument of the \Rfunction{readAligned}
function.}

\begin{comment}
Are you sure the Hamming distance is the best? There is a lot of
literature on error-correcting codes and I think there may be better
and less naive methods, \eg taking into account error rates and correlations.
\end{comment}

<<cleanup, echo=FALSE>>=
file2clean=c(
  "ACACTG.fastq",
  "ACTAGC.fastq",
  "ATGGCT.fastq",
  "TTGCGA.fastq")
silent <- sapply(
                 file2clean,
                 function(f2c){
                   if(file.exists(f2c)){file.remove(f2c)}
                 })
@

<<Demultiplexing sample>>=
alns <- readAligned(
                    system.file(
                                "extdata",
                                package="RnaSeqTutorial"),
                    pattern="multiplex_export",
                    filter=compose(
                      chastityFilter(),
                      nFilter(2),
                      chromosomeFilter(regex="chr")),
                    type="SolexaExport",
                    withAll=TRUE)
@

Note the use of the \Rcode{withAll} argument. It is essential to get
the barcode, since this data was multiplexed using the
Illumina protocol. For the Illumina protocol, the barcode is read in a
separate sequencing reaction and its sequence is reported as a
field of the \emph{export} file. This field is not parsed by
default by the \Rfunction{readAligned} function to save time and
memory. It becomes available when the data is loaded using
the \Rcode{withAll} argument and is afterwards accessible in the
metadata of the returned object. It can be accessed using the
following command:

<<Illumina example, eval=FALSE>>=
alignData(alns)$multiplexIndex
@

where \Robject{alns} is the object of the \Biocpkg{ShortRead} \Rclass{AlignedRead} class.
\newline

In the following, we will look at the other kind of barcoding, where
the barcode is part of the sequenced sequence. First, we will create
some plots to evaluate the barcoding efficiency.

<<barcodes>>=
barcodes=c("ACACTG","ACTAGC","ATGGCT","TTGCGA")
@
<<barcodePlot, fig=TRUE, height=6, eps=FALSE>>=
barcodePlot(alns,
            barcodes=barcodes,
            type="within",
            barcode.length=6,
            show.barcode=20,
            main="All samples",
            xlim=c(0,0.5))
@

% \begin{figure}[htbp]
% \begin{center}
% \includegraphics[width=0.8\textwidth]{easyRNASeq-barcodePlot}
% \end{center}
% \caption{Barcode occurencies in the whole dataset}
% \label{fig:barcodePlot}
% \end{figure}

All the barcodes seem to be almost equally distributed. Every one has
a proportion close to 25\%. Just the ``ACTAGG'' seems to have been
either less amplified during the library preparation or simply had a
lower concentration than the other or generated less
clusters. Overall, only a low percentage of them cannot be surely
assigned. Once this has been verified, the sample can be
\emph{demultiplexed}.

<<demultiplex>>=
dem.alns <- demultiplex(alns,
                        barcodes=barcodes,
                        edition.dist=2,
                        barcodes.qty=4,
                        type="within")
@

<<check the objects, eval=FALSE>>=
dem.alns$reads[[1]]
dem.alns$barcodes[[1]]
@

There remain to write the extracted data to file, or proceed it within
R. Again performing some validation plot is important to ensure that
the \emph{de-multiplexing} process succeeded.

<<demPlot, fig=TRUE, height=8, width=8, eps=FALSE>>=
par(mfrow=c(2,2))
barcode.frequencies <- lapply(
                              names(dem.alns$barcodes),
                              function(barcode,alns){
                                barcodePlot(
                                            alns$barcodes[[barcode]],
                                            barcodes=barcode,
                                            type="within",barcode.length=6,
                                            show.barcode=20,
                                            main=paste(
                                              "Expected barcode:",
                                              barcode))
                              },dem.alns)
@

% \begin{figure}[htbp]
% \begin{center}
% \includegraphics[width=0.8\textwidth]{easyRNASeq-demPlot}
% \end{center}
% \caption{Barcode occurences independantly per sample}
% \label{fig:demPlot}
% \end{figure}

In the previous figure, we can observe the demultiplexed barcodes.
It is important to assess wether
within a given sample there was a barcode bias. Generally, visually
assessing the data is important to make sure that the raw data you are
looking at agrees with the experimental design that was
performed. This is discussed in the \Robject{RnaSeqTutorial} vignette
of the \Biocexptpkg{RnaSqTutorial}.
\newline
Finally, we can save the demultiplexed files to disk.
<<write the file>>=
status <- lapply(
                 names(dem.alns$barcodes),
                 function(barcode,alns){
                   writeFastq(
                              alns$reads[[barcode]],
                              file=paste(
                                barcode,
                                "fastq",sep="."))
                 },dem.alns)
@

<<cleanup, echo=FALSE>>=
file2clean=c(
  "ACACTG.fastq",
  "ACTAGC.fastq",
  "ATGGCT.fastq",
  "TTGCGA.fastq",
  "Rplots.pdf")
silent <- sapply(
                 file2clean,
                 function(f2c){
                   if(file.exists(f2c)){file.remove(f2c)}
                 })
@

The fastq files are now ready to be aligned with your preferred aligner -
alternatively, other R packages
such as \Biocpkg{Rsubread}, \Biocpkg{gmapR}, \Biocpkg{QuasR} can be used
to align them -
and the resulting bam files processed with the \Rfunction{easyRNASeq}
function as described in the first part of the present vignette! Enjoy.

\newpage
\section{Yet to come}
\label{s:ytc}
The \Rfunction{easyRNASeq} can now return a
\Rclass{RangedSummarizedExperiment} object. The aim is to have it be the
only output in the next development version of the
\Biocpkg{easyRNASeq} package (version $1.9.x$). This in order to
consolidate the kind of objects used for Next-Generation Sequencing in
the Bioconductor repository.

<<count function example, eval=FALSE>>=
## creating a RangedSummarizedExperiment from 4 bam files
sumExp <- easyRNASeq(filesDirectory=system.file(
                  "extdata",
                  package="RnaSeqTutorial"),
                pattern="[A,C,T,G]{6}\\.bam$",
                readLength=30L,
                organism="Dmelanogaster",
                annotationMethod="rda",
                annotationFile=system.file(
                  "data",
                  "gAnnot.rda",
                  package="RnaSeqTutorial"),
                count="exons",
                     outputFormat="SummarizedExperiment"
                )
## the counts
assays(sumExp)
## the sample info
colData(sumExp)
## the 'features' info
rowRanges(sumExp)
@

See the \Rclass{RangedSummarizedExperiment} class defined in the
\Biocpkg{SummarizedExperiment} package for more details on last
three accessors used in the example.
\paragraph{}
Support for \Biocpkg{DESeq2} is also planned for version $1.10.x$ as well as
additional visualization taking advantage of the voom/vst transformations.

\newpage

\section{Use Cases}
\label{s:useCases}
The following use cases have been created to answer user request from
the Bioconductor mailing list. We want to thank Francesco Lescai,
Wade Davis and Richard Friedman for asking pertinent questions that
helped us make this vignette better.
\newline
The first use case shows how to create an annotation that contains
synthetic transcripts; \emph{i.e.} a transcript combining every single
exons of a gene into a single abiological splice-variant. The aim is to
mesure expression at the gene level, while avoiding counting biases
that may be introduced when working with multiple splice variants sharing
common exons. If you are interested in such splice variants expression, you
need to look for more appropriate analysis expression pipeline (RSEM, MMSEQ, BitSeq,
\emph{etc.})
\textcolor{red}{}
\newline
The second use case examplifies how to use the \Rfunction{easyRNASeq} to
get \Biocpkg{DESeq} normalized data from two human samples. It
introduces as well how to use the \Biocpkg{GenomicFeatures} package to
retrieve annotations. Note that the data for this example
is not readily available. If you want to reproduce this example, you
will need to get 2 (at least) fastq files from either the SRA or ENA websites and
align them against the human genome using your aligner of choice and
convert the alignment into the BAM format. An example of how to
achieve this in R is described in the beginning of the use case.
\newline
The third use case describe how to combine different annotation
(chromosomic and genic), when for example the chromosome names in the
aligned file(s) are different from the annotation retrieved using
\Biocpkg({biomaRt}. In both use case, we will assume the
\Biocpkg{easyRNASeq} library as already been loaded as in:

<<load the lib, eval=FALSE>>=
library(easyRNASeq)
@

\subsection{Creating a set of synthetic transcripts}
\label{synth:trx}

The gff3 is retrieved from FlyBase \cite{Tweedie:2009p2014}:
\url{ftp://ftp.flybase.net/releases/FB2013_06/dmel_r5.54/gff/dmel-all-no-analysis-r5.54.gff.gz}.
As the file is rather large, only the line containing "gene", "mRNA" and "exon"
are kept using an \emph{awk} script.
\begin{verbatim}
awk '{if($3=="gene"){print}; if($3=="mRNA"){print}; if($3=="exon"){print}}'
                        < dmel-all-no-analysis-r5.54.gff > dmel_r5-54.gff3
\end{verbatim}

Set your working environment to the directory where you created dmel\_r5-54.gff3.
<<synthetic-transcript, eval=FALSE>>=
## lib - loaded by easyRNASeq already
library(genomeIntervals)

## read in the gff
gff <- readGff3("dmel_r5-54.gff3")

## look at the gff
gff
nrow(gff[gff$type=="exon",])
nrow(gff[gff$type=="mRNA",])
nrow(gff[gff$type=="gene",])

## map the mRNA ID to the gene ID
transcriptGeneMapping <- data.frame(
  getGffAttribute(gff[gff$type=="mRNA",],"ID"),
  getGffAttribute(gff[gff$type=="mRNA",],"Parent"))
head(transcriptGeneMapping)

## extract the exons IRangesList
## ie exons are grouped by gene
## and their coordinates are stored as an IRanges object
sel <- gff$type=="exon"
rngList<- split(IRanges(start=gff[sel,1],end=gff[sel,2]),
                transcriptGeneMapping[match(
                  sapply(strsplit(
                    getGffAttribute(gff[sel,],"Parent"),
                    ","),"[",1),
                  transcriptGeneMapping$ID),"Parent"])
rngList

## check what's the gene with the max number of exons
mostExons <- rev(names(table(elementNROWS(rngList))))[1]
mostExons

## work the magic; collapse the genes IRanges
rngList<- reduce(rngList)
rngList

## what's the max now?
rev(names(table(elementNROWS(rngList))))[1]

## create the gff
## get the gff information; here we simply duplicate the
## first exon of every gene by the number of synthetic
## exons per gene. The content will be updated afterwards.
exons <- gff[sel,]
syntheticGeneModel<- exons[rep(
  match(names(rngList),
  transcriptGeneMapping[
    match(sapply(
      strsplit(getGffAttribute(exons,"Parent"),
               ","),"[",1),
    transcriptGeneMapping$ID),"Parent"]),
  elementNROWS(rngList)),]

## update the coordinates
syntheticGeneModel[,1]<- unlist(start(rngList))
syntheticGeneModel[,2]<- unlist(end(rngList))

## change the source
levels(syntheticGeneModel$source)<- "inhouse"

## get the exon number for the minus strand
exonNumber<- lapply(elementNROWS(rngList),":",1)

## reverse them on the plus strand
sel<- strand(syntheticGeneModel)[cumsum(elementNROWS(rngList))] == "+"
exonNumber[sel]<- sapply(exonNumber[sel],rev)

## update the attributes
syntheticGeneModel$gffAttributes<- paste("ID=",
  rep(names(rngList),elementNROWS(rngList)),
  ":",unlist(exonNumber),";Parent=",
  rep(names(rngList),elementNROWS(rngList)),".0",sep="")

## write the file
writeGff3(syntheticGeneModel,file="dmel_synthetic_transcript_r5-54.gff3")
@

%% remember that we could do that as a GRange and add the transcript column
%## convert it into a GRangesList object
%sel <- syntheticGeneModel$type=="exon"
%annot <- split(GRanges(seqnames=seq_name(syntheticGeneModel[sel]),
%                       ranges=IRanges(start=syntheticGeneModel[sel,1],
%                                      end=syntheticGeneModel[sel,2]),
%                       strand=strand(syntheticGeneModel[sel])),
%               getGffAttribute(syntheticGeneModel,"Parent")[sel,1]
%)

%## finally save an rda object
%save(annot,file="dmel_synthetic_transcript_r5-52.rda")

Now we can use the 'gff3' file as our annotation. As we have created synthetic
transcripts, we can directly use the 'transcripts' value for the \Rclass{count}
argument of the \Rfunction{easyRNASeq} function
as follows:

<<easyTrx, eval=FALSE>>=
sumExp <- easyRNASeq(
  filesDirectory=system.file(
    "extdata",
    package="RnaSeqTutorial"),
  pattern="[A,C,T,G]{6}\\.bam$",
  readLength=30L,
  chr.sel="chr2L",
  organism="Dmelanogaster",
  annotationMethod="gff",
  annotationFile="dmel_synthetic_transcript_r5-54.gff3",
  count="transcripts",
  outputFormat="SummarizedExperiment"
)
@

%% NOTE THAT THE chr.sel is necessary as the length of the ranges
%% is larger than those in the BAM file. I.e. the coordinates have
%%  changed in that version of Flybase.

That's it. Done much faster than using the 'geneModels' \Rclass{summarization}
argument. This approach is currently being ported to the development version of
\Biocpkg{easyRNASeq}. There remains a caveat not addressed by this procedure:
genes overlapping on the same or opposite strands will still generate double-
countings. Adequate warnings would be emitted. A refinement of the method above
would be to restrict these genes to their non-overlapping intervals. Commonly
the overlaps encompass UTRs and can be safely ignored. In a few case however,
it might be difficult to decide wheter to keep or drop them. Note that if the
result is returned as an \Rclass{RNAseq} or \Rclass{RangedSummarizedExperiment}
class object, the overlapping genes are flagged by a boolean. To access this
information in an \Rclass{RNAseq} object, do:

<<Access the overlap flag, eval=FALSE>>=
genomicAnnotation(rnaSeq)$overlap
@

and from a \Rclass{RangedSummarizedExperiment} object:

<<Access the overlap flag from an SE, eval=FALSE>>=
rowRanges(sumExp)$overlap
@

\subsection{Processing a set of human samples}
\label{p:pasohs}

\paragraph{Getting the data}
If you already have a set of bam files resulting of short read
alignments against the human genome, you can just proceed to the next
paragraph. If not here is an example on how to retrieve and align data
in R. Note that the \Biocpkg{Rsubread} is only supported on the linux
platform. You'll need to use your aligner of choice and generate BAM
files if you're using another platform. However you should still be
able to download the SRA/ENA files. The files listed in this example
are from the \cite{Jaager:2012p4708} study, but where simply selected for their
small size. Still downloading and creating all the necessary file
requires times and a computer with a sufficient amount of memory to
index the human genome.
<<bam prep, eval=FALSE>>=
library(BSgenome.Hsapiens.UCSC.hg19)
library(GEOquery)
library(SRAdb)
library(Rsamtools)
library(Rsubread)

## create a temp dir
dir.create(file.path(getwd(),"tmp1234"))

## change the working directory
setwd(file.path(getwd(),"tmp1234"))

## init SRA
sqlfile <- getSRAdbFile()

## init a connection
sra_con <- dbConnect(SQLite(),sqlfile)

## list the files
acc <- c("SRR490224","SRR490225")
getFASTQinfo( in_acc = acc, srcType = 'ftp' )

## get the read files
getSRAfile( in_acc=acc, sra_con, destDir = getwd(),
           fileType = 'fastq', srcType = 'ftp' )

## close the connection
dbDisconnect( sra_con )

## write the human genome sequences
writeXStringSet(Reduce(append,
                       lapply(seqnames(Hsapiens),
                              function(nam){
                                dss <- DNAStringSet(unmasked(Hsapiens[[nam]]))
                                names(dss) <- nam
                                dss
                              })),
                file="hg19.fa")

## create the indexes
dir.create("indexes")
buildindex(basename=file.path("indexes","hg19"),
           reference="hg19.fa")

## align the reads
sapply(dir(pattern="*\\.gz$"),function(fil){
  ## decompress the files
  gunzip(fil)

  ## align
  align(index=file.path("indexes","hg19"),
        readfile1=sub("\\.gz$","",fil),
        nsubreads=2,TH1=1,
        output_file=sub("\\.fastq\\.gz$","\\.sam",fil)
        )

  ## create bam files
  asBam(file=sub("\\.fastq\\.gz$","\\.sam",fil),
        destination=sub("\\.fastq\\.gz$","",fil),
        indexDestination=TRUE)
})

@

Note that this has generated a number of files that
  you should clean up afterwards, i.e. delete the ``tmp1234'' folder,
  once you are done with the use case.

\paragraph{Processing the data}
First, we start by retrieveing the size of the chromosomes. This is an
important information for calculating any feature count. Actually,
neither the BSgenome nor any related packages are required for
\Biocpkg{easyRNASeq} to run. As they are the easiest way to access
genomic information such as chromosome lengths within the
R/Bioconductor framework, they are made available to the
\Rfunction{easyRNASeq} for that purpose. However, providing a simple
named vector is sufficient and therefore is easyRNASeq not limited to
existing \Biocpkg{BSgenome} organisms. The chromosome size is
essential for one reason: to provide a complete representation of the
data. When counting reads per features, one get counts for these
features that have at least one read aligning to them, i.e. every
feature having no reads will be missed. One could then either return
only those features having counts or returning a value of 0 counts for
those that do not. We do not find these solutions satisfying and to
ensure that we provide coherent data, we return the counts for every
feature present on the chromosomes. For that purpose, the chromosome
size is essential as it allows us to define those features on a
chromosome that are located between the last feature having a number a
count bigger or equal to one and the end of the chromosome - features,
which would otherwise be ignored. It is as well a mean to monitor that
the provided annotation is pertinent.
As of version $1.3.5$, when using the 'bam' format, it
  is even easier. Setting the 'chr.sizes' argument to \textbf{auto}
  will result in the chromosome sizes to be retrieved from the BAM
  header. However, the purpose of this use case is to, at least partly,
  demonstrate the importance of using appropriate annotations and
  therefore the use of the 'chr.sizes' argument is demonstrated.
\label{chrsize}

<<chromosome size, eval=FALSE>>=
library(BSgenome.Hsapiens.UCSC.hg19)
chr.sizes=seqlengths(Hsapiens)
@

Then, we list the BAM files.

<<bam files name, eval=FALSE>>=
bamfiles=dir(getwd(),pattern="*\\.bam$")
@

We can now run the \Rfunction{easyRNASeq} function, fetching the
annotation using \Biocpkg{biomaRt}. As this is time consuming (about
10 minutes from an average network) and
since we might want to work on these annotation to avoid
double-counting, we first ask the function to return an instance of the
\Rclass{RNAseq} class. Then, using the genomicAnnotation accessor, we
extract the retrieved annotation.

<<Fetching with biomaRt, eval=FALSE>>=
rnaSeq <- easyRNASeq(filesDirectory=getwd(),
                    organism="Hsapiens",
                    readLength=58L,
                    annotationMethod="biomaRt",
                    count="exons",
                    filenames=bamfiles[1],
                    outputFormat="RNAseq"
                    )
gAnnot <- genomicAnnotation(rnaSeq)
@

As one can see by looking at this object, it contains $434$
``chromosomes'', most of which are of no interest to us. For that
reason, we first filter it and then save it to the disk for later
re-use. The ``.rda'' extension is a synonym of the ``.RData'' one and
identifies a serialized R data file.
\label{rda}

<<filtering the annot, eval=FALSE>>=
gAnnot <- gAnnot[space(gAnnot) %in% paste("chr",c(1:22,"X","Y","M"),sep=""),]
save(gAnnot,file="gAnnot.rda")
@

Now using the modified, saved annotation, we can get a count table as follows:

<<get a count table, eval=FALSE>>=
countTable <- easyRNASeq(filesDirectory=getwd(),
                        organism="Hsapiens",
                        readLength=58L,
                        annotationMethod="rda",
                        annotationFile="gAnnot.rda",
                        count="exons",
                        filenames=bamfiles[1]
                        )
@

Another way to retrieve the annotation is to use the
\Biocpkg{GenomicFeatures} library. \Biocpkg{easyRNASeq} does not yet supports
that library automatically, but it can take \Robject{GRangesList}
object as input; objects that are derivatives of the ones returned by the functions of the
\Biocpkg{GenomicFeatures} package. A few changes needs to be done to
the obtained object so that it can be used by the
\Rfunction{easyRNASeq}: first, a metadata element needs to be updated
and then the object needs to be converted into a \Robject{GRangesList}.
\label{grnglist}

<<using different annotation, eval=FALSE>>=
library(GenomicFeatures)
hg19.tx <- makeTxDbFromUCSC(genome="hg19", tablename="refGene")

gAnnot <- exons(hg19.tx)

colnames(elementMetadata(gAnnot)) <- "exon"

gAnnot <- split(gAnnot,seqnames(gAnnot))
@

As previously, this annotation can either be saved to disk or
be used directly, using the \Rcode{annotationMethod} "env" and
\Rcode{annotationObject} arguments. Additionaly, one can select chromosome of
interest using the \Rcode{chr.sel} argument. It accepts a vector of
chromosome names. In the following, we subset for the chromosome ``chr1'' only.
\label{env}
\label{chr.sel}

<<sub setting, eval=FALSE>>=
countTable <- easyRNASeq(filesDirectory=getwd(),
                        organism="Hsapiens",
                        readLength=58L,
                        annotationMethod="env",
                        annotationObject=gAnnot,
                        count="exons",
                        filenames=bamfiles[1],
                        chr.sel="chr1"
                        )
@

Note that in the previous call, we removed the
  'chr.sizes' argument, just to demonstrate that the chr.sizes can be
  retrieved from the bam files. Removing the argument is the same as
  setting it to its default value: 'auto'

Alternatively, one can use the \Rcode{outputFormat} argument
function to get a \Rclass{RangedSummarizedExperiment} object back. Mind the
comments in section \ref{s:ytc}, page \pageref{s:ytc} though.

<<using the count function, eval=FALSE>>=
sumExp <- easyRNASeq(filesDirectory=getwd(),
                    organism="Hsapiens",
                    readLength=58L,
                    annotationMethod="env",
                    annotationObject=gAnnot,
                    count="exons",
                    filenames=bamfiles[1],
                    chr.sel="chr1",
                     outputFormat="SummarizedExperiment"
                    )
@

Finally, instead of returning a count table, we can get a
\Robject{CountDataSet} instance from the \Biocpkg{DESeq}
package. This will perform the normalization of the data and generate
some Quality Assessment plots. In the present example, it would not yield
very sensitive results as we have no replicates (biological). These
are important to \Biocpkg{DESeq} to accurately model the technical and biological variance.
With no replicates for every condition, the dispersion will be based
on a "pooled" estimate making the differential expression call lose
sensitivity. In addition, \Biocpkg{DESeq} is in such cases using a
conservative approach (which is good) so you'd get even less
significant results. Here, as we have no replicates, we need to pass
additional arguments (the last two) that will be tunnelled to the \Biocpkg{DESeq}
\Rfunction{estimateDispersions} function. See the \Biocpkg{DESeq} vignette for
more details on these.

<<DESeq run, eval=FALSE>>=
conditions <- c("A","B")
names(conditions) <- bamfiles

countDataSet <- easyRNASeq(filesDirectory=getwd(),
                          organism="Hsapiens",
                          annotationMethod="env",
                          annotationObject=gAnnot,
                          count="exons",
                          filenames=bamfiles,
                          chr.sel="chr1",
                          outputFormat="DESeq",
                          normalize=TRUE,
                          conditions=conditions,
                          fitType="local",
                          method="blind"
                        )
@

Note that as the read length differs between the two
files: $58$ and $76$, the \Rcode{readLength} argument was removed
from the previous function call. The read size is then identified
automatically.

\subsection{Dealing with annotation inconsistencies}
\label{p:dwai}
This use case shows how to deal with inconsistent annotations,
\textit{e.g.} when the chromosome names present in the aligned file
are different from those that can be retrieved from an annotation
source such as \Biocpkg{biomaRt}.
\newline
First, we have a look at the data, in this case some Illumina export
files. Reading in the data using the \Biocpkg{ShortRead} package is
quite resource demanding as the whole sequences are loaded in memory.
Then we look at the chromosome names. These differs from what we
expect - UCSC standards - as they have an additional ``.fa'' extension.

<<read the data, eval=FALSE>>=
aln <- readAligned("data",type="SolexaExport",pattern="*.txt.gz")
gc()
levels(chromosome(aln))
@

<<fake the data, echo=FALSE>>=
c("chr1.fa","chr10.fa","...","chrY.fa")
@

They are different from what \Biocpkg{biomaRt} will return as well:
\textit{i.e.} 1:19, X, Y and MT plus others. This triple inconsistency will
be a problem for \Biocpkg{easyRNASeq}. If there were only two sets of
names, using the "custom" chromosome name map by-pass (see the Details
section of the ``?easyRNASeq'' help page) would solve the
issue. However, in the present particular case, as we are retrieving
the annotation from \Biocpkg{biomaRt}, we need to precise the name of
the organism, which circumvent the chromosome name mapping
by-pass. The solution is to first fetch the annotation, modify it and
save it as an R data file. Some of the retrieved annotation are ``NT''
contigs. There are only a few of them, so instead of filtering them
out, we just ignore them.

<<fetch the annotation, eval=FALSE>>=
obj <- fetchAnnotation(new('RNAseq',
                          organismName="Mmusculus"
                          ),
                      method="biomaRt")

gAnnot <- genomicAnnotation(obj)

length(grep("NT_",space(gAnnot)))

@

<<fake number 2, echo=FALSE>>=
1181
@

<<modify the annotation, eval=FALSE>>=
names(gAnnot) <- paste("chr",names(gAnnot),".fa",sep="")
@

As described previously, see page \pageref{rda} we can save the
annotation to a file or use it directly, using the
\Rcode{AnnotationMethod} ``env'' argument. In that later case,
since we did not process the annotation, numerous warnings concerning
possible multiple countings of reads will be raised. Double counting
reads is not what ones want, see section \pref{synth:trx} on how to adress that.
\newline
Note that \Biocpkg{easyRNASeq} does support some chromosome names
conversion by default. The list of organism for which this is possible
can be listed using the knownOrganisms function.
\newline
Before going on, we do some cleanup as some of the objects we have
generated take large amounts of memory.

<<clean, eval=FALSE>>=
rm(aln,obj)
gc()
@

As on page \pageref{chrsize} we get the chromosome sizes.
Again, note that the use of the BSgenome is not mandatory. It's just
easy as they are available in Bioconductor. Typing in your own
chromosome sizes named vector is as valid.

<<chrsize, eval=FALSE>>=
library(BSgenome.Mmusculus.UCSC.mm9)
chr.sizes<- seqlengths(Mmusculus)
@

We can now create the chromosome name mapping.

<<create the chr.map, eval=FALSE>>=
chr.map <- data.frame(
                      from=paste("chr",c(1:19,"X","Y"),".fa",sep=""),
                      to=paste("chr",c(1:19,"X","Y"),sep=""))
@

Using this chromosome map, we can now summarize the reads per feature
of interest. Here we want to look for gene models, so we set the
\Rcode{count} and \Rcode{summarization} arguments. Note again that
the \Rcode{summarization} argument will be deprecated in the near
future and its values merged with the \Rcode{count} ones. The approach
described in section \pref{synth:trx} is a faster way to achieve the same.
\newline
Asking to get back an \Rclass{RNAseq} instance allows us to look at the gene
models defined by \Biocpkg{easyRNASeq}. This offers the possibility
to clean them to avoid multiple counting.
\newline
As we have Illumina export data at hand, we need to define a set
of Filter to ensure that the data is read properly. Indeed, the export
file contains all the reads, so the one that do not pass the chastity
filter have to be removed. In addition, some of the other reads are
for internal QC, and they have no position. For that reason, we need
to filter those too out.
\newline
Reading in export data is more resource exhausting that reading in bam
files, as we are loading in the sequence and quality information as
well, whereas we do not need them.
\newline
We will now look through three different set of parameters that will
stepwise reduce the number of warnings emitted. These warnings are
there to help you understand the different pitfalls that the
\Rfunction{easyRNASeq} helps you avoiding when analysing your RNA-Seq
data. The first approach generates a lot of warnings, because of the
differing annotations, since we are using the entire set of annotation
we got. As we do not want to generate all these warnings, these code
lines are not evaluated. To get a feel about these warnings, they
will look as the follows:

\begin{verbatim}
Warning messages:
1: In .convertToUCSC(names(genomicAnnotation(obj)), organismName(obj),  :
Your custom map does not define a mapping for the following
chromosome names: chrMT.fa
2: In easyRNASeq(filesDirectory = headDir, organism = "custom", chr.map = chr.map,  :
There are 6096 synthetic exons as determined from your annotation that overlap!
This implies that some reads will be counted more than once!
Is that really what you want?
\end{verbatim}

Let us start with the full set of annotation.

<<no filter, eval=FALSE>>=
rnaSeq <- easyRNASeq(filesDirectory="data",
                    organism="custom",
                    chr.map=chr.map,
                    chr.sizes=chr.sizes,
                    filter=compose(
                      naPositionFilter(),
                      chastityFilter()),
                    readLength=50L,
                    annotationMethod="env",
                    annotationObject=gAnnot,
                    format="aln",
                    count="genes",
                    summarization= "geneModels",
                    filenames="1-Feb_ATCACG_L003_R1_001_export.txt.gz",
                    outputFormat="RNAseq",
                    nbCore=2
                    )
@

To reduce the number of warnings emitted, we can select for the
chromosome we are interested in as previously done on page \pageref{chr.sel}.

<<selected chr, eval=FALSE>>=
rnaSeq <- easyRNASeq(filesDirectory="data",
                    organism="custom",
                    chr.map=chr.map,
                    chr.sizes=chr.sizes,
                    chr.sel=chr.map$from,
                    filter=compose(
                      naPositionFilter(),
                      chastityFilter()),
                    readLength=50L,
                    annotationMethod="env",
                    annotationObject=gAnnot,
                    format="aln",
                    count="genes",
                    summarization= "geneModels",
                    filenames="1-Feb_ATCACG_L003_R1_001_export.txt.gz",
                    outputFormat="RNAseq",
                    nbCore=2
                    )
@

Finally, to further reduce the warnings, we can manipulate the
\Robject{RangedData} object to remove the unnecessary annotation.

<<remove annotation, eval=FALSE>>=
sel <- grep("NT_",names(gAnnot))
gAnnot <- RangedData(ranges=ranges(gAnnot)[-sel,],values=values(gAnnot)[-sel,])
colnames(gAnnot) <- gsub("values\\.","",colnames(gAnnot))
@

This last call will only generate two warnings, one that could be
easily dealt with (a complaint about the chrMT). The other one is
about double counting and it requires to adapt the annotation.

<<final call, eval=FALSE>>=
rnaSeq <- easyRNASeq(filesDirectory="data",
                    organism="custom",
                    chr.map=chr.map,
                    chr.sizes=chr.sizes,
                    chr.sel=chr.map$from,
                    filter=compose(
                      naPositionFilter(),
                      chastityFilter()),
                    readLength=50L,
                    annotationMethod="env",
                    annotationObject=gAnnot,
                    format="aln",
                    count="genes",
                    summarization= "geneModels",
                    filenames="1-Feb_ATCACG_L003_R1_001_export.txt.gz",
                    outputFormat="RNAseq",
                    nbCore=2
                    )
@

\newpage

\section{FAQ}
\label{FAQ}
\warning{
\subsection{Does \Rfunction{easyRNASeq} support reads of variable length?}
Yes it does. The base-pair coverage extracted from the read mapping is actually
weighted by the read length to determine better read count estimates.
\subsection{Does the \Rfunction{easyRNASeq} support paired-end reads?}
No it does not look at the pair information; hence read counts for PE data will
be on average $2x$ larger than those returned from other methods. In the context
of DE, this is of little significance though.
}
\newpage

\section{Session Information}

The version number of R\cite{ref:R} and Bioconductor
\cite{Gentleman:2004p2013} packages loaded for generating the
vignette were:

<<SessionInfo, echo=FALSE>>=
library(easyRNASeq)
library(BSgenome.Dmelanogaster.UCSC.dm3)
library(RnaSeqTutorial)
sessionInfo()
@

\newpage

\section{Final remarks}
\label{sec:finalRem}
RNA-seq is still maturating and a lot of new developments are to be
expected. If you have any questions, comments, feel free to contact me:
nicolas.delhomme \emph{at} umu \emph{dot} se.

\begin{comment}
  The following code is just to make sure that the functionality of
  strand and reduce are kept, despite the NAMESPACE clashes between
  imported packages.
\end{comment}

<<some test, echo=FALSE>>=
grngs = GRanges(seqnames=c("chr1", "chr2"),
  ranges=IRanges(start=1:2, end=2:3),
  strand=c("+","-"))
silent <- strand(grngs)
silent <- reduce(grngs)
@

\newpage

\bibliography{Bibliography/References}

\end{document}