%\VignetteIndexEntry{OncoSimulR Overview}
%\VignetteDepends{OncoSimulR}
%\VignetteKeywords{OncoSimulR simulation cancer oncogenetic trees}
%\VignettePackage{OncoSimulR}
%\VignetteEngine{knitr::knitr}
\documentclass[a4paper,11pt]{article}
<<echo=FALSE,results='hide',error=FALSE>>=
require(knitr, quietly = TRUE)
opts_knit$set(concordance = TRUE)
options(width = 68)
##opts_knit$set(stop_on_error = 2L)
@ 
\usepackage{amsmath}
%% \usepackage[authoryear,round,sort]{natbib}
\usepackage{threeparttable}
\usepackage{array}
%%\usepackage{hyperref} %% not if using BiocStyle
%%ditto
%\usepackage{geometry}
%\geometry{verbose,a4paper,tmargin=23mm,bmargin=26mm,lmargin=28mm,rmargin=28mm}
\usepackage{url}
\usepackage{xcolor}
%\definecolor{light-gray}{gray}{0.72}
\newcommand{\cyan}[1]{{\textcolor {cyan} {#1}}}
\newcommand{\blu}[1]{{\textcolor {blue} {#1}}}
\newcommand{\Burl}[1]{\blu{\url{#1}}}
\usepackage{gitinfo}


%%\SweaveOpts{echo=TRUE}

%\usepackage{tikz}
%\usetikzlibrary{arrows,shapes,positioning}

\usepackage[latin1]{inputenc}


%Uncomment for BioC
%\usepackage{datetime}
%\newdateformat{mydate}{\THEDAY-\monthname[\THEMONTH]-\THEYEAR}

<<style-knitr, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@


%%\title{Using OncoSimulR: a package for simulating cancer progression data,
%%including drivers and passengers, and allowing for order restrictions.}

%%\author{Ramon Diaz-Uriarte\\
%%Dept. Biochemistry, Universidad Aut\'onoma de Madrid \\ 
%%Instituto de Investigaciones Biom\'edicas ``Alberto Sols'' (UAM-CSIC)\\
%%Madrid, Spain\\
%%{\small \texttt{ramon.diaz@iib.uam.es}} \\
%%{\small \texttt{rdiaz02@gmail.com}} \\
%%{\small \Burl{http://ligarto.org/rdiaz}} \\
%%}


%% FIXME: the homozigos, etc.


\bioctitle[\textit{OncoSimulR: genetic simulation with arbitrary
  epistasis}]{OncoSimulR: forward genetic simulation in asexual
  populations with arbitrary epistatic interactions and a focus on modeling
  tumor progression.}

%% \bioctitle{Using OncoSimulR: a package for simulating cancer progression data,
%%   including drivers and passengers, and allowing for order restrictions.}

\author{Ramon Diaz-Uriarte\\
  Dept. Biochemistry, Universidad Aut\'onoma de Madrid \\ 
  Instituto de Investigaciones Biom\'edicas ``Alberto Sols'' (UAM-CSIC)\\
  Madrid, Spain{\footnote{ramon.diaz@iib.uam.es, rdiaz02@gmail.com}} \\
%% {\footnote{rdiaz02@gmail.com}} \\
{\small \Burl{http://ligarto.org/rdiaz}} \\
 }
%% \date{\the\year-\the\month-\the\day}
%% \date{\mydate\today}

%% \date{\gitAuthorDate\ {\footnotesize (Release\gitRels: Rev:
%%     \gitAbbrevHash)}}


 %% I use x.y.z and gitinfo does not deal with this well in gitVtags et al.
%% \date{\gitAuthorDate\ {\footnotesize (Version:\gitVtags, Rev: \gitAbbrevHash)}}

\date{\gitAuthorDate\ {\footnotesize (Rev: \gitAbbrevHash)}}


\begin{document}
\maketitle

%% Remember to add BiocStyle to Suggests
%%
%% I get lots of problems, so will try later.
%% <<style, eval=TRUE, echo=FALSE, results=tex>>=
%% BiocStyle::latex()
%% @

\tableofcontents


\section{Introduction}\label{intro}

OncoSimulR was originally developed to simulate tumor progression using
several models of tumor progression with emphasis on allowing users to set
restrictions in the accumulation of mutations as specified, for example,
by Oncogenetic Trees (OT; \cite{Desper1999JCB, Szabo2008}) or Conjunctive
Bayesian Networks (CBN; \cite{Beerenwinkel2007, Gerstung2009,
  Gerstung2011}), with the possibility of adding passenger mutations to
the simulations and several types of sampling.


Since then, OncoSimulR has been vastly extended to allow you to specify
other types of restrictions in the accumulation of genes, as in
the ``semimonotone'' model of Farahani and Lagergren
\cite{FarahaniLagergren2013} and the XOR models of Korsunsky and
collaborators \cite{Korsunsky2014}. Moreover, different fitness effects
related to the order in which mutations appear can also be incorporated,
involving arbitrary numbers of genes. This is different from
``restrictions in the accumulation of mutations''. With order effects, 
shown empirically in a recent cancer paper by Ortmann and collaborators
\cite{Ortmann2015}, the effect of having both mutations ``A'' and ``B''
differs depending on whether ``A'' appeared before or after ``B''. More
generally, now OncoSimulR also allows you to specify arbitrary epistatic
interactions between arbitrary collections of genes and to model, for
example, synthetic mortality or synthetic viability (again, involving an
arbitrary number of genes, some of which might also depend on other genes,
or show order effects with other genes). Moreover, it is possible to
specify the above interactions in terms of modules, not genes. This idea
is discussed in, for example, \cite{Raphael2014a, Gerstung2011}: the
restrictions encoded in, say, CBNs or OT can be considered to apply not to
genes, but to modules, where each module is a set of genes (and the
intersection between modules is the empty set) that performs a specific
biological function. Modules, then, play the role of a ``union operation''
over the set of genes in a module. In addition, arbitrary numbers of genes
without interactions (and with fitness effects coming from any
distribution you might want) are also possible.


The models so far implemented are all continuous time models, which are
simulated using the BNB algorithm of Mather et al.\ \cite{Mather2012}. The
core of the code is implemented in C++, providing for fast execution.
Finally, to help with simulation studies, code to simulate random graphs
of the kind often seen in CBN, OTs, etc, is also available.

\subsection{Key features of OncoSimulR}\label{key}

As mentioned above, OncoSimulR is now a very general package for forward
genetic simulation, with applicability well beyond tumor progression. This
is a summary of some of the key features:


\begin{itemize}
  
  \item You can specify arbitrary interactions between genes, with
    arbitrary fitness effects, with  explicit support for:
    \begin{itemize}
    \item Restrictions in the accumulations of mutations, as specified by
      Oncogenetic Trees (OTs), Conjunctive Bayesian Networks (CBNs),
      semimonotone progression networks, and XOR relationships.
      
    \item Epistatic interactions, including, but not limited to, synthetic
      viability and synthetic lethality.
    \item Order effects.
    \end{itemize}
  \item You can add passenger mutations.
    \item More generally, you can add arbitrary numbers of non-interacting
      genes with arbitrary fitness effects.
  
    \item You can allow for deviations from the OT, CBN, semimonotone, and
      XOR models, specifying a penalty for such deviations (the $s_h$
      parameter).
      
    \item You can conduct multiple simulations, and sample from them with
      different temporal schemes and using both whole tumor or single cell
      sampling. 
  
    \item Right now, three different models are available, two that lead to
      exponential growth, one of them loosely based on Bozic et al.\
      \cite{Bozic2010}, and another that leads to logistic-like growth, based
      on McFarland et al.\ \cite{McFarland2013}.
      
    \item Code in C++ is available (though not yet callable from R) for
      using several other models, including the one from Beerenwinkel and
      collaborators \cite{Beerenwinkel2007b}.
      
    \item You can use very large numbers of genes (e.g., see an example of
      50000 in section \ref{mcf50070} ).
      
    \item Simulations are generally very fast as I use C++ to implement
      the BNB algorithm.
      
    \item You can obtain the true sequence of events and the phylogenetic
      relationships between clones.
      
\end{itemize}


Further details about the motivation for wanting to simulate data this way
in the context of tumor progression can be found in
\cite{Diaz-Uriarte2015}, where additional comments about model parameters
and caveats are discussed. Are there similar programs? The Java program by
\cite{Reiter2013a} offers somewhat similar functionality to the previous
version of OncoSimulR, but it is restricted to at most four drivers
(whereas v.1 of OncoSimulR allowed for up to 64), you cannot use arbitrary
CBNs or OTs (or XORs or semimonotone graphs) to specify restrictions,
there is no allowance for passengers, and a single type of model (a
discrete time Galton-Watson process) is implemented. The current
functionality of OncoSimulR goes well beyond the the previous version
(and, thus, also the TPT of \cite{Reiter2013a}) allowing you to specify
all types of fitness effects in other general forward genetic simulators
such as FFPopSim \cite{Zanini2012}, and some that, to our knowledge (e.g.,
order effects) are not available from any genetics simulator.


\subsection{Steps in using OncoSimulR}


Using this package will often involve the following steps:

\begin{enumerate}
\item Specify the fitness effects: sections \ref{specfit} and \ref{litex}.
\item Simulate cancer progression: section \ref{simul}. You can simulate
  for a single subject or for a set of subjects. You will need to:
  \begin{itemize}
  \item Decide on a model. This basically amounts to choosing a model with
    exponential growth (``Exp'' or ``Bozic'') or a model with
    gompertz-like growth (``McFL''). If exponential growth, you can choose
    whether the the effects of mutations operate on the death rate
    (``Bozic'') or the birth rate (``Exp'')\footnote{It is of course
      possible to do this with the gompertz-like models, but there
      probably is little reason to do it. McFarland et
      al. \cite{McFarland2013} discuss this has little effect on their
      results, for example. In addition, decreasing the death rate will
      more easily lead to numerical problems as shown in section
      \ref{ex-0-death}}.
  \item Specify the other parameters of the simulation (when to stop,
    mutation rate, etc).
  \end{itemize}
  Of course, at least for initial playing around, you can use the defaults.
  
\item Sample from the simulated data: section \ref{sample}, and do
  something with those simulated data (e.g., fit an OT model to
  them). What you do with the data, however, is outside the scope of this
  package.   
\end{enumerate}


Before anything else, let us load the package. We also explicitly load
\Biocpkg{graph} and \CRANpkg{igraph} for the vignette to work (you do not
need that for your usual interactive work). And I set the default color
for vertices in igraph.

<<results="hide">>=
library(OncoSimulR)
library(graph)
library(igraph)
igraph_options(vertex.color = "SkyBlue2")
@ 

<<echo=FALSE, results='hide'>>=
options(width = 68)
@ 

To be explicit, what version are we running?
<<>>=
packageVersion("OncoSimulR")
@ 


\subsection{Two quick examples}\label{quickexample}

Following the above we will run two examples. First a model with a few
genes and \textbf{epistasis}:

<<fig.width=6.5, fig.height=10>>=
## 1. Fitness effects: here we specify a 
##    epistatic model with modules.
sa <- 0.1
sb <- -0.2
sab <- 0.25
sac <- -0.1
sbc <- 0.25
sv2 <- allFitnessEffects(epistasis = c("-A : B" = sb,
                                       "A : -B" = sa,
                                       "A : C" = sac,
                                       "A:B" = sab,
                                       "-A:B:C" = sbc),
                         geneToModule = c(
                             "A" = "a1, a2",
                             "B" = "b",
                             "C" = "c"))
evalAllGenotypes(sv2, order = FALSE, addwt = TRUE)

## 2. Simulate the data. Here we use the "McFL" model and set explicitly
##    parameters for mutation rate, final and initial sizes, etc.
RNGkind("Mersenne-Twister")
set.seed(983)
ep1 <- oncoSimulIndiv(sv2, model = "McFL",
                     mu = 5e-6,
                     sampleEvery = 0.02,
                     keepEvery = 0.5,
                     initSize = 2000,
                     finalTime = 3000,
                     onlyCancer = FALSE)
@ 

%% <<fig.width=6.5, fig.height=10>>=
%%  ## 1. Fitness effects: here we specify a 
%%  ##    epistatic model with modules.
%%  sa <- 0.1
%%  sb <- -0.2
%%  sab <- 0.25
%%  sac <- -0.1
%%  sbc <- 0.25
%%  sv2 <- allFitnessEffects(epistasis = c("-A : B" = sb,
%%                                         "A : -B" = sa,
%%                                         "A : C" = sac,
%%                                         "A:B" = sab,
%%                                         "-A:B:C" = sbc),
%%                           geneToModule = c(
%%                               "Root" = "Root",
%%                               "A" = "a1, a2",
%%                               "B" = "b",
%%                               "C" = "c"))
%%  evalAllGenotypes(sv2, order = FALSE, addwt = TRUE)

%%  ## 2. Simulate the data. Here we use the "McFL" model and set explicitly
%%  ##    parameters for mutation rate, final and initial sizes, etc.
%%  RNGkind("Mersenne-Twister")
%%  set.seed(983)
%%  ep1 <- oncoSimulIndiv(sv2, model = "McFL",
%%                       mu = 5e-6,
%%                       sampleEvery = 0.02,
%%                       keepEvery = 0.5,
%%                       initSize = 2000,
%%                       finalTime = 3000,
%%                       onlyCancer = FALSE)
%% @ 


<<iep1x1,fig.width=6.5, fig.height=9.5>>=
## 3. We will not analyze those data any further. We will only plot them.
##    For the sake of a small plot, we thin the data.
par(mfrow = c(2, 1))
plot(ep1, show = "drivers", xlim = c(0, 1500),
     thinData = TRUE, thinData.keep = 0.5)
## Increase ylim and legend.ncols to avoid overlap of 
## legend with rest of figure
plot(ep1, show = "genotypes", ylim = c(0, 4500), legend.ncols = 4,
     xlim = c(0, 1500),
     thinData = TRUE, thinData.keep = 0.5)
@ 


As a second example, we will use a model where we specify
\textbf{restrictions in the order of accumulation of mutations} using the
pancreatic cancer poset in Gerstung et al.\ \cite{Gerstung2011} (see more
details in section \ref{pancreas}):

<<>>=
## 1. Fitness effects: 
pancr <- allFitnessEffects(
    data.frame(parent = c("Root", rep("KRAS", 4), 
                   "SMAD4", "CDNK2A", 
                   "TP53", "TP53", "MLL3"),
               child = c("KRAS","SMAD4", "CDNK2A", 
                   "TP53", "MLL3",
                   rep("PXDN", 3), rep("TGFBR2", 2)),
               s = 0.1,
               sh = -0.9,
               typeDep = "MN"))
## How does it look like?
plot(pancr)
@ 
<<fig.width=6.5, fig.height=10>>=
## 2. Simulate from it. 
set.seed(1) ## Fix the seed, so we can repeat it
ep2 <- oncoSimulIndiv(pancr, model = "McFL",
                     mu = 1e-6,
                     sampleEvery = 0.02,
                     keepEvery = 1,
                     initSize = 1000,
                     finalTime = 10000,
                     onlyCancer = FALSE)
@ 
<<iep2x2,fig.width=6.5, fig.height=9>>=
## 3. What genotypes and drivers we get? And play with limits
##    to show only parts of the data. We also thin them.
par(mfrow = c(2, 1))
par(cex = 0.7)
plot(ep2, show = "genotypes", xlim = c(2000, 4000), 
     ylim = c(0, 2400),
     thinData = TRUE, thinData.keep = 0.5)
plot(ep2, show = "drivers", addtot = TRUE,
     thinData = TRUE, thinData.keep = 0.5)
@ 


\subsection{Versions}\label{versions}

In this vignette and the documentation I often refer to version 1 (v.1)
and version 2 of OncoSimulR. Version 1 is the version available up to, and
including, BioConductor v.\ 3.1. Version 2 of OncoSimulR is available
starting from BioConductor 3.2 (and, of course, available too from
development versions of BioC). %% If you look at the
%% package version, however, it currently shows as 1.99.x (where x should be
%% a number $\ge 4$). That is because of the versioning scheme of
%% BioConductor. This will become 2.0 in the next release of BioConductor.
So, if you are using the current stable or development version of
BioConductor, or you grab the sources from github
(\Burl{https://github.com/rdiaz02/OncoSimul}) you are using what we call
version 2. %% If you see a version in the package that says ``1.99.x'', where
%% ``x'' is any number, you are too.


\section{Specifying fitness effects}\label{specfit}

\subsection{Introduction to the specification of fitness
  effects}\label{introfit}

With OncoSimulR you can specify different types of effects on fitness:

\begin{itemize}

\item A special type of epistatic effect that is particularly amenable to
  be represented as a graph. In this graph, having, say, ``B'' be a child
  of ``A'' means that B can only accumulate if A is already present.  This
  is what OT \cite{Desper1999JCB, Szabo2008}, CBN \cite{Beerenwinkel2007,
    Gerstung2009, Gerstung2011}, progression networks
  \cite{FarahaniLagergren2013}, and other similar models
  \cite{Korsunsky2014} mean. Details are provided in section
  \ref{posetslong}. Note that this is not an order effect (discussed
  below): the fitness of a genotype from this DAGs is a function of
  whether or not the restrictions in the graph are satisfied, not the
  historical sequence of how they were satisfied.

\item Effects where the order in which mutations are acquired matters, as
  illustrated in section \ref{oe}. There is, in fact, empirical evidence
  of these effects \cite{Ortmann2015}. For instance, the fitness of
  genotype ``A, B'' would differ depending on whether A or B was acquired
  first.

  
\item General epistatic effects (e.g., section \ref{epi}), including
  synthetic viability (e.g., section \ref{sv}) and synthetic
  lethality/mortality (e.g., section \ref{sl}).


\item Genes that have independent effects on fitness (section \ref{noint}).
  
\end{itemize}


Modules (see section \ref{modules0}) allow you to specify any of the above
effects (except those for genes without interactions, as it would not make
sense there) in terms of modules (sets of genes), not individual genes. We
will introduce them right after \ref{posetslong}, and continue using them
thereafter.


\subsubsection{How to specify fitness effects effects}\label{howfit}

A guiding design principle of OncoSimulR is to try to make the
specification of those effects as simple as possible but also as flexible
as possible. 

Conceptually, the simplest way is to specify the mapping of all genotypes
to fitness explicitly. This can be done with OncoSimulR (e.g., see
sections \ref{e2}, \ref{e3} and \ref{theminus} or the example in
\ref{weis1b}), but this only makes sense for subsets of the genes or for
very small genotypes, as you probably do not want to be explicit about the
mapping of $2^k$ genotypes to fitness when $k$ is larger than, say, four
or five, and definitely not when $k$ is 10.


An alternative general approach followed in many genetic simulators is to
specify how particular combinations of alleles modify the wildtype
genotype or the genotype that contains the individual effects of the
interacting genes (e.g., see equation 1 in the supplementary material for
FFPopSim ).  For example, if we specify that ``A'' contributes 0.04, ``B''
contributes 0.03, and ``A:B'' contributes 0.1, that means that the fitness
of the ``A, B'' genotype is that of the wildtype (1, by default), plus
(actually, times ---see section \ref{numfit}) the effects of A, plus
(times) the effects of B, plus (times) the effects of ``A:B''.


As we will see in the examples (e.g., see sections \ref{e2}, \ref{e3},
\ref{exlong}) OncoSimulR makes it simple to be explicit about the mapping
of specific genotypes, while also using the ``how this specific effects
modifies previous effects'' logic, leading to a flexible
specification. This also means that in many cases the same fitness
effects can be specified in several different ways.


\subsection{Numeric values of fitness effects}\label{numfit}

We evaluate fitness using the usual (e.g. \cite{Zanini2012, Gillespie1993,
  Beerenwinkel2007,Datta2013}) multiplicative model: fitness is
$\prod (1 + s_i)$ where $s_i$ is the fitness effect of gene (or gene
interaction) $i$.  In all models except Bozic, this fitness refers to the
growth rate (the death rate being fixed to 1\footnote{You can change this
  if you really want to.}). The
original model of McFarland \cite{McFarland2013} has a slightly different
parameterization, but you can go easily from one to the other (see section
\ref{mcfl}).

For the Bozic model, however, the birth rate is set to 1, and the death
rate then becomes $\prod (1 - s_i)$.


\subsubsection{McFarland parameterization}\label{mcfl}

In the original McFarland model \cite{McFarland2013}, the effects of
drivers contribute to the numerator of the birth rate, and those of the
(deleterious) passengers to the denominator as:
$\frac{(1 + s)^D}{(1 - s_p)^p}$, where $D$ and $P$ are, respectively, the
total number of drivers and passengers in a genotype, and here the fitness
effects of all drivers is the same ($s$) and that of all passengers the
same too ($s_p$). However, we can map from this ratio to the usual product
of terms by using a different value of $s_p$, that we will call
$s_{pp} = -s_p/(1 + s_p)$ (see \cite{McFarland2014-phd}, his eq. 2.1 in
p.\ 9).  This reparameterization applies to v.2. In v.1 we use the same
parameterization as in the original one in McFarland \cite{McFarland2013}.


\subsubsection{No viability of clones and types of models}\label{noviab}

For all models where fitness affects directly the birth rate (for now, all
except Bozic), if you specify that some event (say, mutating gene A) has
$s_A \le -1$, if that event happens then birth rate becomes zero which is
taken to indicate that the clone is not even viable and thus disappears
immediately without any chance for mutation\footnote{This is a shortcut
  that we take because we think that it is what you mean. Note, however,
  that technically a clone with birth rate of 0 might have a non-zero
  probability of mutating before becoming extinct because in the
  continuous time model we use mutation is not linked to reproduction. In
  the present code, we are not allowing for any mutation when birth rate
  is 0. There are other options, but none which I find really better. An
  alternative implementation makes a clone immediately extinct if and only if any
  of the $s_i = -\infty$.  However, we still need to handle the case with
  $s_i < -1$ as a special case. We either make it identical to the case
  with any $s_i = -\infty$ or for any $s_i > -\infty$ we set
  $(1 + s_i) = \max(0, 1 + s_i)$ (i.e., if $s_i < -1$ then
  $(1 + s_i) = 0$), to avoid obtaining negative birth rates (that make no
  sense) and the problem of multiplying an even number of negative
  numbers. I think only the second would make sense as an alternative.}.

Models based on Bozic, however, have a birth rate of 1\footnote{In the C++
  code there is a different model, not directly callable from R for now,
  called ``bozic2'' that is slightly different. These comments apply to
  the model that is right now callable from R} and mutations affect the
death rate. In this case, a death rate larger than birth rate, per se,
does not signal immediate extinction and, moreover, even for death rates
that are a few times larger than birth rates, the clone could mutate
before becoming extinct\footnote{We said ``a few times''. For a clone of
  population size 1 ---which is the size at which all clones start from
  mutation---, if death rate is, say, 90 but birth rate is 1, the
  probability of mutating before becoming extinct is very, very close to
  zero for all reasonable values of mutation rate}. How do we signal
immediate extinction or no viability in this case? You can set the value
of $s = -\infty$. %% Setting a value of
%% $s < -90$ has the same effect.

In general, if you want to identify some mutations or some
combinations of mutations as leading to immediate extinction, no
viability, of the affected clone, set it to $-\infty$ as this would work
even if we later change how birth rates of 0 are handled. Most examples
below evaluate fitness by its effects on the birth rate. You can see one
where we do it both ways in Section \ref{fit-neg-pos}.


\subsection{Genes without interactions}\label{noint}

This is a imple scenario. Each gene, $i$, has a fitness effect $s_i$ if
mutated. The $s_i$ can come from any distribution you want. As an example
let's use three genes. We know there are no order effects, but we will
also see what happens if we examine genotypes as ordered.

<<>>=

ai1 <- evalAllGenotypes(allFitnessEffects(
    noIntGenes = c(0.05, -.2, .1)), order = FALSE)
@ 


We can easily verify the first results:

<<>>=
ai1
@ 

<<>>=

all(ai1[, "Fitness"]  == c( (1 + .05), (1 - .2), (1 + .1),
       (1 + .05) * (1 - .2),
       (1 + .05) * (1 + .1),
       (1 - .2) * (1 + .1),
       (1 + .05) * (1 - .2) * (1 + .1)))

@ 

And we can see that considering the order of mutations (see section
\ref{oe}) makes no difference:
<<>>=

(ai2 <- evalAllGenotypes(allFitnessEffects(
    noIntGenes = c(0.05, -.2, .1)), order = TRUE,
    addwt = TRUE))

@ 

(The meaning of the notation in the output table is as follows: ``WT''
denotes the wild-type, or non-mutated clone. The notation $x > y$ means
that a mutation in ``x'' happened before a mutation in ``y''. A genotype
$x > y\ \_\ z$ means that a mutation in ``x'' happened before a
mutation in ``y''; there is also a mutation in ``z'', but that is a gene
for which order does not matter).


And what if I want genes without interactions but I want modules (see
section \ref{modules0})? Go to section \ref{mod-no-epi}.


\subsection{Restrictions in the order of mutations as extended posets}\label{posetslong}

\subsubsection{AND, OR, XOR relationships}\label{andorxor}
The literature on oncogenetic trees, CBNs, etc, has used graphs as a way
of showing the restrictions in the order in which mutations can
accumulate. The meaning of ``convergent arrows'' in these graphs, however,
differs. In Figure 1 of \cite{Korsunsky2014} we are shown a simple diagram
that illustrates the three basic different meanings of convergent arrows
using two parental nodes. We will illustrate it here with three. Suppose
we focus on node ``g'' in the following figure (we will create it shortly)
<<fig.height=4>>=
data(examplesFitnessEffects)
plot(examplesFitnessEffects[["cbn1"]])
@ 

\begin{itemize}
\item In relationships of the type used in \textbf{Conjunctive} Bayesian
  Networks (CBN) \cite[e.g.]{Gerstung2009}, we are modeling an
  \textbf{AND} relationship, also called \textbf{CMPN} by
  \cite{Korsunsky2014} or \textbf{monotone} relationship by
  \cite{FarahaniLagergren2013}. If the relationship in the graph is fully
  respected, then ``g'' will only appear if all of ``c'', ``d'', and ``e''
  are already mutated.
  
\item \textbf{Semimonotone} relationships \textit{sensu}
  \cite{FarahaniLagergren2013} or \textbf{DMPN} \textit{sensu}
  \cite{Korsunsky2014} are \textbf{OR} relationships: ``g'' will appear if
  one or more of ``c'', ``d'', or ``e'' are already mutated.

\item \textbf{XMPN} relationships (\cite{Korsunsky2014}) are \textbf{XOR}
  relationships: ``g'' will be present only if exactly one of ``c'',
  ``d'', or ``e'' is present. 
\end{itemize}


Note that oncogenetic trees (\cite{Desper1999JCB, Szabo2008}) need not
deal with the above distinctions, since the DAGs are trees: no node has
more than one incoming connection or more than one parent\footnote{OTs and
CBNs have some other technical differences about the underlying model they
assume, such as the exponential waiting time in CBNs. We will not discuss them
here.}.

To have a flexible way of specifying all of these restrictions, we will
want to be able to say what kind of dependency each child
node has on its parents.

\subsubsection{Fitness effects}\label{fitnessposets}

Those DAGs specify dependencies and, as explained in
\cite{Diaz-Uriarte2015}, it is simple to map them to a simple evolutionary
model: any set of mutations that does not conform to the restrictions
encoded in the graph will have a fitness of 0. However, we might not want
to require absolute compliance with the DAG. This means we might want to
allow deviations from the DAG with a corresponding penalization that is,
however, not identical to setting fitness to 0 (again, see
\cite{Diaz-Uriarte2015}). This we can do by being explicit about the
fitness effects of these deviations from the restrictions encoded in the
DAG. We will use below a column of \texttt{s} for the fitness effect when
the restrictions are satisfied and a column of \texttt{sh} when they are
not. (See also \ref{numfit} for the details about the meaning of the
fitness effects).


That way of specifying fitness effects makes it also trivial to use the
model in Hjelm et al.\ \cite{Hjelm2006} where all mutations might be allowed to occur,
but the presence of some mutations increases the probability of occurrence
of other mutations. For example, the values of \texttt{sh} could be all
small positive ones (or for mildly deleterious effects, small negative
numbers), while the values of \texttt{s} are much larger positive numbers.

\subsubsection{Extended posets}
In version 1 of this package we used posets in the sense of
\cite{Beerenwinkel2007, Gerstung2009}, as explained in section \ref{poset}
and in the help for \Rfunction{poset}. Here, we continue using two
columns, that specify parents and children, but we add columns for the
specific values of fitness effects (both s and sh ---i.e., fitness effects
for what happens when restrictions are and are not satisfied) and for the
type of dependency as explained in section \ref{andorxor}.


We can now illustrate the specification of different fitness effects.

\subsubsection{A first conjunction (AND) example}\label{cbn1}

<<>>=

cs <-  data.frame(parent = c(rep("Root", 4), "a", "b", "d", "e", "c"),
                 child = c("a", "b", "d", "e", "c", "c", rep("g", 3)),
                 s = 0.1,
                 sh = -0.9,
                 typeDep = "MN")

cbn1 <- allFitnessEffects(cs)

@ 

(We skip one letter, just to show that names need not be consecutive or
have any particular order.)

We can get a graphical representation using the default ``graphNEL''
<<fig.height=3>>=
plot(cbn1)
@ 

or one using ``igraph'':
<<fig.height=5>>=
plot(cbn1, "igraph")
@ 

%% The vignette crashes if I try to use the layout.

Since this is a tree, the reingold.tilford layout is probably the best
here, so you might want to use that:

<<fig.height=5>>=
library(igraph) ## to make the reingold.tilford layout available
plot(cbn1, "igraph", layout = layout.reingold.tilford)
@ 


And what is the fitness of all genotypes?

<<>>=
gfs <- evalAllGenotypes(cbn1, order = FALSE)

gfs[1:15, ]
@

You can verify that for each genotype, if a mutation is present without
all of its dependencies present, you get a $(1 - 0.9)$ multiplier, and you
get a $(1 + 0.1)$ multiplier for all the rest with its direct parents
satisfied. For example, genotypes ``a'', or ``b'', or ``d'', or ``e'' have
fitness $(1 + 0.1)$, genotype ``a, b, c'' has fitness $(1 + 0.1)^3$, but
genotype ``a, c'' has fitness $(1 + 0.1) (1 - 0.9) = 0.11$.


\subsubsection{A second conjunction example}\label{cbn2}


Let's try a first attempt at a somewhat more complex example, where the
fitness consequences of different genes differ.
<<>>=

c1 <- data.frame(parent = c(rep("Root", 4), "a", "b", "d", "e", "c"),
                 child = c("a", "b", "d", "e", "c", "c", rep("g", 3)),
                 s = c(0.01, 0.02, 0.03, 0.04, 0.1, 0.1, rep(0.2, 3)),
                 sh = c(rep(0, 4), c(-.1, -.2), c(-.05, -.06, -.07)),
                 typeDep = "MN")

try(fc1 <- allFitnessEffects(c1))

@ 

That is an error because the ``sh'' varies within a child, and we do not
allow that for a poset-type specification, as it is ambiguous. If you need
arbitrary fitness values for arbitrary combinations of genotypes, you can
specify them using epistatic effects as in section \ref{epi} and order
effects as in section \ref{oe}.

Why do we need to specify as many ``s'' and ``sh'' as there are rows (or a
single one, that gets expanded to those many) when the ``s'' and ``sh''
are properties of the child node, not of the edges? Because, for ease, we
use a data.frame.

%% (By the way, yes, we convert all factors to strings in the parent, child,
%% and typeDep columns, so no need to specify \texttt{stringsAsFactor = TRUE}).


We fix the error in our specification. Notice that the ``sh'' is not set
to $-1$ in these examples. If you want strict compliance with the poset
restrictions, you should set $sh = -1$ or, better yet, $sh = -\infty$ (see
section \ref{noviab}), but having an $sh > -1$ will lead to fitnesses that
are $> 0$ and, thus, is a way of modeling small deviations from the poset
(see discussion in \cite{Diaz-Uriarte2015}).

%% In these examples, the reason to set ``sh'' to values larger than $-1$ and
%% different among the genes is to allow us to easily see the actual,
%% different, terms that enter into the multiplication of the fitness effects
%% (and, also, to make it easier to catch bugs).

Note that for those nodes that depend only on ``Root'' the type of
dependency is irrelevant.

<<>>=

c1 <- data.frame(parent = c(rep("Root", 4), "a", "b", "d", "e", "c"),
                 child = c("a", "b", "d", "e", "c", "c", rep("g", 3)),
                 s = c(0.01, 0.02, 0.03, 0.04, 0.1, 0.1, rep(0.2, 3)),
                 sh = c(rep(0, 4), c(-.9, -.9), rep(-.95, 3)),
                 typeDep = "MN")

cbn2 <- allFitnessEffects(c1)

@ 

%% We can get a graphical representation using the default ``graphNEL''
%% <<fig.height=3>>=
%% plot(cbn2)
%% @ 

%% or one using ``igraph'':
%% <<fig.height=5>>=
%%plot(cbn2, "igraph", layout = layout.reingold.tilford)
%% @ 

%% (since this is a tree, the reingold.tilford layout is probably the best here).

We could get graphical representations but the figures would
be the same as in the example in section \ref{cbn1}, since the structure
has not changed, only the numeric values.

What is the fitness of all possible genotypes? Here, order of events
\textit{per se} does not matter, beyond that considered in the poset. In
other words, the fitness of genotype ``a, b, c'' is the same no matter how
we got to ``a, b, c''. What matters is whether or not the genes on which
each of ``a'', ``b'', and ``c'' depend are present or not (I only show the first
10 genotypes)

<<>>=
gcbn2 <- evalAllGenotypes(cbn2, order = FALSE)
gcbn2[1:10, ]
@ 


Of course, if we were to look at genotypes but taking into account order
of occurrence of mutations, we would see no differences

<<>>=
gcbn2o <- evalAllGenotypes(cbn2, order = TRUE, max = 1956)
gcbn2o[1:10, ]
@ 

(The \texttt{max = 1956} is there so that we show all the genotypes, even
if they are more than 256, the default.)

You can check the output and verify things are as they should. For instance:

<<>>=
all.equal(
        gcbn2[c(1:21, 22, 28, 41, 44, 56, 63 ) , "Fitness"],
        c(1.01, 1.02, 0.1, 1.03, 1.04, 0.05,
          1.01 * c(1.02, 0.1, 1.03, 1.04, 0.05),
          1.02 * c(0.10, 1.03, 1.04, 0.05),
          0.1 * c(1.03, 1.04, 0.05),
          1.03 * c(1.04, 0.05),
          1.04 * 0.05,
          1.01 * 1.02 * 1.1,
          1.01 * 0.1 * 0.05,
          1.03 * 1.04 * 0.05,
          1.01 * 1.02 * 1.1 * 0.05,
          1.03 * 1.04 * 1.2 * 0.1, ## notice this
          1.01 * 1.02 * 1.03 * 1.04 * 1.1 * 1.2
          ))
@ 

A particular one that is important to understand is

<<>>=
gcbn2[56, ] ## this is d, e, g, c
all.equal(gcbn2[56, "Fitness"], 1.03 * 1.04 * 1.2 * 0.10)
@ 

where ``g'' is taken as if its dependencies are satisfied (as ``c'',
``d'', and ``e'' are present) even when the dependencies of ``c'' are not
satisfied (and that is why the term for ``c'' is 0.9).


\subsubsection{A semimonotone or ``OR'' example}\label{mn1}

We will reuse the above example, changing the type of relationship:
<<>>=

s1 <- data.frame(parent = c(rep("Root", 4), "a", "b", "d", "e", "c"),
                 child = c("a", "b", "d", "e", "c", "c", rep("g", 3)),
                 s = c(0.01, 0.02, 0.03, 0.04, 0.1, 0.1, rep(0.2, 3)),
                 sh = c(rep(0, 4), c(-.9, -.9), rep(-.95, 3)),
                 typeDep = "SM")

smn1 <- allFitnessEffects(s1)

@ 

It looks like this (where edges are shown in blue to denote the
semimonotone relationship):
<<fig.height=3>>=
plot(smn1)
@ 


<<>>=
gsmn1 <- evalAllGenotypes(smn1, order = FALSE)

@ 

Having just one parental dependency satisfied is now enough, in contrast
to what happened before. For instance:

<<>>=
gcbn2[c(8, 12, 22), ]
gsmn1[c(8, 12, 22), ]

gcbn2[c(20:21, 28), ]
gsmn1[c(20:21, 28), ]
@ 


\subsubsection{An ``XMPN'' or ``XOR'' example}\label{xor1}

Again, we reuse the example above, changing the type of relationship:

<<>>=

x1 <- data.frame(parent = c(rep("Root", 4), "a", "b", "d", "e", "c"),
                 child = c("a", "b", "d", "e", "c", "c", rep("g", 3)),
                 s = c(0.01, 0.02, 0.03, 0.04, 0.1, 0.1, rep(0.2, 3)),
                 sh = c(rep(0, 4), c(-.9, -.9), rep(-.95, 3)),
                 typeDep = "XMPN")

xor1 <- allFitnessEffects(x1)

@ 


It looks like this (edges in red to denote the ``XOR'' relationship):
<<fig.height=3>>=
plot(xor1)
@ 

<<>>=

gxor1 <- evalAllGenotypes(xor1, order = FALSE)

@ 


Whenever ``c'' is present with both ``a'' and ``b'', the fitness component
for ``c'' will be $(1 - 0.1)$. Similarly for ``g'' (if more than one of
``d'', ``e'', or ``c'' is present, it will show as $(1 - 0.05)$). For example:

<<>>=
gxor1[c(22, 41), ] 
c(1.01 * 1.02 * 0.1, 1.03 * 1.04 * 0.05)
@ 

However, having just both ``a'' and ``b'' is identical to the case with
CBN and the monotone relationship (see sections \ref{cbn2} and
\ref{mn1}). If you want the joint presence of ``a'' and ``b'' to result in
different fitness than the product of the individual terms, without
considering the presence of ``c'', you can specify that using general
epistatic effects (section
\ref{epi}).%% ; XOR relationships of these kind are, actually,
%% examples of synthetic lethality, which are shown in section \ref{sl}.


We also see a very different pattern compared to CBN (section \ref{cbn2})
here:
<<>>=
gxor1[28, ] 
1.01 * 1.1 * 1.2
@ 

as exactly one of the dependencies for both ``c'' and ``g'' are satisfied.

But 
<<>>=
gxor1[44, ] 
1.01 * 1.02 * 0.1 * 1.2
@ 
is the result of a $0.1$ for ``c'' (and a $1.2$ for ``g'' that has exactly
one of its dependencies satisfied).


\subsubsection{Posets: the three types of relationships}\label{p3}

<<>>=

p3 <- data.frame(parent = c(rep("Root", 4), "a", "b", "d", "e", "c", "f"),
                  child = c("a", "b", "d", "e", "c", "c", "f", "f", "g", "g"),
                  s = c(0.01, 0.02, 0.03, 0.04, 0.1, 0.1, 0.2, 0.2, 0.3, 0.3),
                  sh = c(rep(0, 4), c(-.9, -.9), c(-.95, -.95), c(-.99, -.99)),
                  typeDep = c(rep("--", 4), 
                      "XMPN", "XMPN", "MN", "MN", "SM", "SM"))
fp3 <- allFitnessEffects(p3)
@ 

This is how it looks like:
<<fig.height=3>>=
plot(fp3)
@ 

We can also use ``igraph'':

<<fig.height=6>>=
plot(fp3, "igraph", layout.reingold.tilford)
@ 


<<>>=

gfp3 <- evalAllGenotypes(fp3, order = FALSE)

@ 


Let's look at a few:
<<>>=
gfp3[c(9, 24, 29, 59, 60, 66, 119, 120, 126, 127), ]

c(1.01 * 1.1, 1.03 * .05, 1.01 * 1.02 * 0.1, 0.1 * 0.05 * 1.3,
  1.03 * 1.04 * 1.2, 1.01 * 1.02 * 0.1 * 0.05,
  0.1 * 1.03 * 1.04 * 1.2 * 1.3,
  1.01 * 1.02 * 0.1 * 1.03 * 1.04 * 1.2,
  1.02 * 1.1 * 1.03 * 1.04 * 1.2 * 1.3,
  1.01 * 1.02 * 1.03 * 1.04 * 0.1 * 1.2 * 1.3)

@ 

As before, looking at the order of mutations makes no difference (look at
the test directory to see a test that verifies this assertion).


\subsection{Modules}\label{modules0}
As already mentioned, we can think in all the effects of fitness in terms
not of individual genes but, rather, modules. This idea is discussed in,
for example, \cite{Raphael2014a, Gerstung2011}: the restrictions encoded
in, say, the DAGs can be considered to apply not to genes, but to
modules, where each module is a set of genes (and the intersection between
modules is the empty set). Modules, then, play the role of a ``union
operation'' over sets of genes. Of course, if we can use modules for the
restrictions in the DAGs we should also be able to use them for epistasis
and order effects, as we will see later (e.g., \ref{oemod}).


\subsubsection{What does a module provide}\label{module-what-for}

Modules can provide very compact ways of specifying relationships when you
want to, well, model the existence of modules. For simplicity suppose
there is a module, ``A'', made of genes ``a1'' and ``a2'', and a module
``B'', made of a single gene ``b1''. Module ``B'' can mutate if module
``A'' is mutated, but mutating both ``a1'' and ``a2'' provides no
additional fitness advantage compared to mutating only a single one of
them.  We can specify this as:

<<>>=
s <- 0.2
sboth <- (1/(1 + s)) - 1
m0 <- allFitnessEffects(data.frame(
    parent = c("Root", "Root", "a1", "a2"),
    child = c("a1", "a2", "b", "b"),
    s = s,
    sh = -1,
    typeDep = "OR"),
                        epistasis = c("a1:a2" = sboth))
evalAllGenotypes(m0, order = FALSE, addwt = TRUE)
@ 

Note that we need to add an epistasis term, with value ``sboth''
to capture the idea of ``mutating both ``a1'' and ``a2''
provides no additional fitness advantage compared to mutating only a
single one of them''; see details in section \ref{epi}.


Now, specify it using modules:
<<>>=
s <- 0.2
m1 <- allFitnessEffects(data.frame(
    parent = c("Root", "A"),
    child = c("A", "B"),
    s = s,
    sh = -1,
    typeDep = "OR"),
                        geneToModule = c("Root" = "Root",
                                         "A" = "a1, a2",
                                         "B" = "b1"))
evalAllGenotypes(m1, order = FALSE, addwt = TRUE)
@ 

This captures the ideas directly. The typing savings here are small, but
they can be large with modules with many genes.


%% %% \begin{tabular} {c c c}
%% %%   A & B & Fitness \\
%% %%   \hline
%% %%   wt&wt& 1 \\
%% %%   wt&M& sb \\
%% %%   M&wt& sa\\
%% %%   M&M& sab)\\
%% %%   \hline
%% %% \end{tabular}

%% with A being 1, 2, and B 3, 4.

%% and having in a tree A depends on Root and B depends on A


%% \begin{tabular} {c c c}
%%   model & Fitness satisfied & fitness not satisf\\
%%   \hline
%%   0 , 1 & s \\
%%   0 , 2 & s \\

%%   1 , 3 & s3 & sm \\
%%   2 , 3 & s3 & sm \\

%%   1 , 4 & s3 & sm \\
%%   2 , 4 & s3 & sm \\
  
%%   1 : 2 & s12 \\
%%   3:  4 & s34 \\
  

%%   \hline
%% \end{tabular}

%% just give the specification, the full one.


%% and write equivalendes of s12 as a function of Sa, S34 as a function of
%% Sb, etc.

\subsubsection{Specifying modules}\label{modules}

How do you specify modules? The general procedure is simple: you pass a
vector that makes explicit the mapping from modules to sets of genes. We
just saw an example. There are several additional examples such as
\ref{pm3}, \ref{oemod}, \ref{epimod}.

%% Why do we force you to specify ``Root'' = ``Root''? We could check for it,
%% and add it if it is not present. But we want you to be explicit (and we
%% want to avoid you shooting yourself in the foot having a gene that is not
%% the root of the tree but is called ``Root'', etc).


It is important to note that, once you specify modules, we expect all of
the relationships (except those that involve the non interacting genes) to
be specified as modules. Thus, all elements of the epistasis, posets (the
DAGs) and order effects components should be specified in terms of
modules. But you can, of course, specify a module as containing a single
gene (and a single gene with the same name as the module).


What about the ``Root'' node? If you use a ``restriction table'', that
restriction table (that DAG) must have a node named ``Root'' and in the
mapping of genes to module there \textbf{must} be a first entry that has a
module and gene named ``Root'', as we saw above with \texttt{geneToModule
  = c("Root" = "Root", ...}. We force you to do this to be explicit about
the ``Root'' node. This is not needed (thought it does not hurt) with
other fitness specifications. For instance, if we have a model with two
modules, one of them with two genes (see details in section
\ref{mod-no-epi}) we do not need to pass a ``Root'' as in

<<>>=
fnme <- allFitnessEffects(epistasis = c("A" = 0.1,
                                        "B" = 0.2),
                          geneToModule = c("A" = "a1, a2",
                                           "B" = "b1"))
evalAllGenotypes(fnme, order = FALSE, addwt = TRUE)
@ 

but it is also OK to have a ``Root'' in the \texttt{geneToModule}:

<<>>=
fnme2 <- allFitnessEffects(epistasis = c("A" = 0.1,
                                        "B" = 0.2),
                          geneToModule = c(
                              "Root" = "Root",
                              "A" = "a1, a2",
                              "B" = "b1"))
evalAllGenotypes(fnme, order = FALSE, addwt = TRUE)
@ 


\subsubsection{Modules and posets again: the three types of relationships
  and modules}\label{pm3}


We use the same specification of poset, but add modules. To keep it
manageable, we only add a few genes for some modules, and have some
modules with a single gene. Beware that the number of genotypes is
starting to grow quite fast, though.  We capitalize to differentiate
modules (capital letters) from genes (lowercase with a number), but this
is not needed.


<<>>=
p4 <- data.frame(parent = c(rep("Root", 4), "A", "B", "D", "E", "C", "F"),
                  child = c("A", "B", "D", "E", "C", "C", "F", "F", "G", "G"),
                  s = c(0.01, 0.02, 0.03, 0.04, 0.1, 0.1, 0.2, 0.2, 0.3, 0.3),
                  sh = c(rep(0, 4), c(-.9, -.9), c(-.95, -.95), c(-.99, -.99)),
                  typeDep = c(rep("--", 4), 
                      "XMPN", "XMPN", "MN", "MN", "SM", "SM"))
fp4m <- allFitnessEffects(p4,
                          geneToModule = c("Root" = "Root", "A" = "a1",
                              "B" = "b1, b2", "C" = "c1",
                              "D" = "d1, d2", "E" = "e1",
                              "F" = "f1, f2", "G" = "g1"))
@ 

By default, plotting shows the modules:
<<fig.height=3>>=
plot(fp4m)
@ 

but we can show the gene names instead of the module names:

<<fig.height=3>>=
plot(fp4m, expandModules = TRUE)
@ 

or

<<fig.height=8>>=
plot(fp4m, "igraph", layout = layout.reingold.tilford, 
     expandModules = TRUE)

@ 

We obtain the fitness of all genotypes in the usual way:
<<>>=
gfp4 <- evalAllGenotypes(fp4m, order = FALSE, max = 1024)
@ 

Let's look at a few of those:
<<>>=
gfp4[c(12, 20, 21, 40, 41, 46, 50, 55, 64, 92, 155, 157, 163, 372, 632, 828), ]

c(1.01 * 1.02, 1.02, 1.02 * 1.1, 0.1 * 1.3, 1.03, 
  1.03 * 1.04, 1.04 * 0.05, 0.05 * 1.3,  
  1.01 * 1.02 * 0.1, 1.02 * 1.1, 0.1 * 0.05 * 1.3,
  1.03 * 0.05, 1.03 * 0.05, 1.03 * 1.04 * 1.2, 1.03 * 1.04 * 1.2, 
  1.02 * 1.1 * 1.03 * 1.04 * 1.2 * 1.3)

@ 


\subsection{Order effects} \label{oe}

As explained in the introduction (\ref{intro}), by order effects we mean a
phenomenon such as the one shown empirically by \cite{Ortmann2015}: the
fitness of a double mutant ``A'', ``B'' is different depending on whether
``A'' was acquired before ``B'' or ``B'' before ``A''. This, of course,
can be generalized to more than two genes.

Note that these order effects are different from the order restrictions
discussed in section \ref{posetslong}. In there we might say that
acquiring ``B'' depends or is facilitated by having ``A'' mutated (and,
unless we allowed for multiple mutations, having ``A'' mutated means
having ``A'' mutated before ``B''). However, once you have the genotype
``A, B'', its fitness does not depend on the order in which ``A'' and
``B'' appeared.


\subsubsection{Order effects: three-gene orders}

Consider this case, where three specific three-gene orders and two
two-gene orders (one of them a subset of one of the three) lead to
different fitness compared to the wild-type. We add also modules, to show
its usage (but just limit ourselves to using one gene per module here). 

Order effects are specified using a $x > y$, that means that that order
effect is satisfied when module $x$ is mutated before module $y$.

<<>>=

o3 <- allFitnessEffects(orderEffects = c(
                            "F > D > M" = -0.3,
                            "D > F > M" = 0.4,
                            "D > M > F" = 0.2,
                            "D > M"     = 0.1,
                            "M > D"     = 0.5),
                        geneToModule =
                            c("M" = "m",
                              "F" = "f",
                              "D" = "d") )


(ag <- evalAllGenotypes(o3, addwt = TRUE))
@ 

%% <<>>=
%% o3 <- allFitnessEffects(orderEffects = c(
%%                             "F > D > M" = -0.3,
%%                             "D > F > M" = 0.4,
%%                             "D > M > F" = 0.2,
%%                             "D > M"     = 0.1,
%%                             "M > D"     = 0.5),
%%                         geneToModule =
%%                             c("Root" = "Root",
%%                               "M" = "m",
%%                               "F" = "f",
%%                               "D" = "d") )
%% (ag <- evalAllGenotypes(o3, addwt = TRUE))
%% @ 

(The meaning of the notation in the output table is as follows: ``WT''
denotes the wild-type, or non-mutated clone. The notation $x > y$ means
that a mutation in ``x'' happened before a mutation in ``y''. A genotype
$x > y\ \_\ z$ means that a mutation in ``x'' happened before a
mutation in ``y''; there is also a mutation in ``z'', but that is a gene
for which order does not matter).


The values for the first nine genotypes come directly from the fitness
specifications. The 10th genotype matches $D > F > M$ ($= (1 + 0.4)$)
but also $D > M$ ($(1 + 0.1)$). The 11th matches $D > M > F$ and $D >
M$. The 12th matches $F > D > M$ but also $D > M$. Etc.


\subsubsection{Order effects and modules with multiple genes}\label{oemod}

Consider the following case:
<<>>=

ofe1 <- allFitnessEffects(orderEffects = c("F > D" = -0.3, "D > F" = 0.4),
                          geneToModule =
                              c("F" = "f1, f2",
                                "D" = "d1, d2") )

ag <- evalAllGenotypes(ofe1)

@ 

There are four genes, $d1, d2, f1, f2$, where each $d$ belongs to module
$D$ and each $f$ belongs to module $F$.

What to expect for cases such as $d1 > f1$ or $f1 > d1$ is clear, as shown in

<<>>=
ag[5:16,]
@ 

Likewise, cases such as $d1 > d2 > f1$ or $f2 > f1 > d1$ are clear,
because in terms of modules they map to $ D > F$ or $F > D$: the observed
order of mutation $d1 > d2 > f1$ means that module $D$ was mutated first
and module $F$ was mutated second. Similar for $d1 > f1 > f2$ or
$f1 > d1 > d2$: those map to $D > F$ and $F > D$. We can see the fitness
of those four case in:

<<>>=
ag[c(17, 39, 19, 29), ]
@ 

and they correspond to the values of those order effects, where $F > D =
(1 - 0.3)$ and $D > F = (1 + 0.4)$:

<<>>=
ag[c(17, 39, 19, 29), "Fitness"] == c(1.4, 0.7, 1.4, 0.7)
@ 

What if we match several patterns? For example, $d1 > f1 > d2 > f2$ and
$d1 > f1 > f2 > d2$? The first maps to $D > F > D > F$ and the second to
$D > F > D$. But since we are concerned with which one happened first and
which happened second we should expect those two to correspond to the same
fitness, that of pattern $D > F$, as is the case:

<<>>=
ag[c(43, 44),]
ag[c(43, 44), "Fitness"] == c(1.4, 1.4)
@ 
More generally, that applies to all the patterns that start with one of
the ``d'' genes:
<<>>=
all(ag[41:52, "Fitness"] == 1.4)
@ 

Similar arguments apply to the opposite pattern, $F > D$, which apply to
all the possible gene mutation orders that start with one of the ``f''
genes. For example:
<<>>=
all(ag[53:64, "Fitness"] == 0.7)
@ 


\subsubsection{Order and modules with 325 genotypes}
We can of course have more than two genes per module. This just repeats
the above, with five genes (there are 325 genotypes, and that is why we
pass the ``max'' argument to \Rfunction{evalAllGenotypes}, to allow for
more than the default 256).

<<>>=

ofe2 <- allFitnessEffects(orderEffects = c("F > D" = -0.3, "D > F" = 0.4),
                          geneToModule =
                              c("F" = "f1, f2, f3",
                                "D" = "d1, d2") )
ag2 <- evalAllGenotypes(ofe2, max = 325)

@ 

We can verify that any combination that starts with a ``d'' gene and then
contains at least one ``f'' gene will have  a fitness of $1+0.4$.  And any
combination that starts with an ``f'' gene and contains at least one ``d''
genes will have a fitness of $1 - 0.3$.  All other genotypes have a
fitness of 1:

<<>>=
all(ag2[grep("^d.*f.*", ag2[, 1]), "Fitness"] == 1.4)
all(ag2[grep("^f.*d.*", ag2[, 1]), "Fitness"] == 0.7)
oe <- c(grep("^f.*d.*", ag2[, 1]), grep("^d.*f.*", ag2[, 1]))
all(ag2[-oe, "Fitness"] == 1)
@ 


\subsubsection{Order effects and genes without interactions}

We will now look at both order effects and interactions. To make things
more interesting, we name genes so that the ordered names do split nicely
between those with and those without order effects (this, thus, also
serves as a test of messy orders of names).

<<>>=

foi1 <- allFitnessEffects(
    orderEffects = c("D>B" = -0.2, "B > D" = 0.3),
    noIntGenes = c("A" = 0.05, "C" = -.2, "E" = .1))

@ 

You can get a verbose view of what the gene names and modules are (and
their automatically created numeric codes) by:

<<>>=
foi1[c("geneModule", "long.geneNoInt")]
@ 

We can get the fitness of all genotypes (we set $max = 325$ because that
is the number of possible genotypes):

<<>>=
agoi1 <- evalAllGenotypes(foi1,  max = 325)
head(agoi1)
@ 


Now:
<<>>=
rn <- 1:nrow(agoi1)
names(rn) <- agoi1[, 1]

agoi1[rn[LETTERS[1:5]], "Fitness"] == c(1.05, 1, 0.8, 1, 1.1)

@ 

According to the fitness effects we have specified, we also know that any
genotype with only two mutations, one of which is either ``A'', ``C'' or
``E'' and the other is ``B'' or ``D'' will have the fitness corresponding
to ``A'', ``C'' or ``E'', respectively:

<<>>=
agoi1[grep("^A > [BD]$", names(rn)), "Fitness"] == 1.05
agoi1[grep("^C > [BD]$", names(rn)), "Fitness"] == 0.8
agoi1[grep("^E > [BD]$", names(rn)), "Fitness"] == 1.1
agoi1[grep("^[BD] > A$", names(rn)), "Fitness"] == 1.05
agoi1[grep("^[BD] > C$", names(rn)), "Fitness"] == 0.8
agoi1[grep("^[BD] > E$", names(rn)), "Fitness"] == 1.1
@ 


We will not be playing many additional games with regular expressions, but
let us check those that start with ``D'' and have all the other mutations,
which occupy rows 230 to 253; fitness should be equal (within numerical
error, because of floating point arithmetic) to the order effect of having
``D'' before ``B'' times the other effects
$(1 - 0.3) * 1.05 * 0.8 * 1.1 = 0.7392$

<<>>=
all.equal(agoi1[230:253, "Fitness"] , rep((1 - 0.2) * 1.05 * 0.8 * 1.1, 24))
@ 
and that will also be the value of any genotype with the five mutations
where ``D'' comes before ``B'' such as those in rows 260 to 265, 277, or
322 and 323, but it will be equal to $(1 + 0.3) * 1.05 * 0.8 * 1.1 =
1.2012$ in those where ``B'' comes before ``D''. Analogous arguments apply
to four, three, and two mutation genotypes.


\subsection{Synthetic viability}\label{sv}

Synthetic viability and synthetic lethality (e.g., \cite{Ashworth2011,
  Hartman2001}) are just special cases of epistasis (section \ref{epi})
but we deal with them here separately.

\subsubsection{A simple synthetic viability example}
A simple and extreme example of synthetic viability is shown in the
following table, where the joint mutant has fitness larger than the wild
type, but each single mutant is lethal.


\begin{tabular} {c c c}
  A & B & Fitness \\
  \hline
  wt&wt& 1 \\
  wt&M& 0 \\
  M&wt& 0\\
  M&M& (1 + s)\\
  \hline
\end{tabular}

where ``wt'' denotes wild type and ``M'' denotes mutant.


We can specify this (setting $s = 0.2$) as (I play around with spaces, to
show there is a certain flexibility with them):

<<>>=
s <- 0.2
sv <- allFitnessEffects(epistasis = c("-A : B" = -1,
                                      "A : -B" = -1,
                                      "A:B" = s))
@ 

Now, let's look at all the genotypes (we use ``addwt'' to also get the wt,
which by decree has fitness of 1), and disregard order:

<<>>=
(asv <- evalAllGenotypes(sv, order = FALSE, addwt = TRUE))
@ 

Asking the program to consider the order of mutations of course makes no
difference:

<<>>=
evalAllGenotypes(sv, order = TRUE, addwt = TRUE)
@ 

Another example of synthetic viability is shown in section \ref{misra1b}.

Of course, if multiple simultaneous mutations are not possible in the
simulations, it is not possible to go from the wildtype to the double
mutant in this model where the single mutants are not viable.

\subsubsection{Synthetic viability using Bozic model}\label{fit-neg-pos}

If we were to use the above specification with Bozic's models, we might
not get what we think we should get:

<<>>=
evalAllGenotypes(sv, order = FALSE, addwt = TRUE, model = "Bozic")
@

What gives here? The simulation code would alert you of this (see section
\ref{fit-neg-pos}) in this particular case because there are ``-1'',
which might indicate that this is not what you want. The problem is that
you probably want the Death rate to be infinity (the birth rate was 0, so
no clone viability, when we used birth rates ---section \ref{noviab}).

Let us say so explicitly:

<<>>=
s <- 0.2
svB <- allFitnessEffects(epistasis = c("-A : B" = -Inf,
                                      "A : -B" = -Inf,
                                      "A:B" = s))
evalAllGenotypes(svB, order = FALSE, addwt = TRUE, model = "Bozic")
@


Likewise, values of $s$ larger than one have no effect beyond setting $s =
1$ (a single term of $(1 - 1)$ will drive the product to 0, and as we
cannot allow negative death rates negative values are set to 0):


<<>>=

s <- 1
svB1 <- allFitnessEffects(epistasis = c("-A : B" = -Inf,
                                       "A : -B" = -Inf,
                                       "A:B" = s))

evalAllGenotypes(svB1, order = FALSE, addwt = TRUE, model = "Bozic")


s <- 3
svB3 <- allFitnessEffects(epistasis = c("-A : B" = -Inf,
                                       "A : -B" = -Inf,
                                       "A:B" = s))

evalAllGenotypes(svB3, order = FALSE, addwt = TRUE, model = "Bozic")


@

Of course, death rates of 0.0 are likely to lead to trouble down the road,
when we actually conduct simulations (see section \ref{ex-0-death}).


\subsubsection{Synthetic viability, non-zero fitness, and modules}

This is a slightly more elaborate case, where there is one module and the
single mutants have different fitness between themselves, which is
non-zero. Without the modules, this is the same as in Misra et
al. \cite{Misra2014}, Figure 1b, which we go over in section \ref{misra}.


\begin{tabular} {c c c}
  A & B & Fitness \\
  \hline
  wt&wt& 1 \\
  wt&M& $1 + s_b$ \\
  M&wt& $1 + s_a$\\
  M&M& $1 + s_{ab}$\\
  \hline
\end{tabular}

where $s_a, s_b < 0$ but $s_{ab} > 0$. 


<<>>=
sa <- -0.1
sb <- -0.2
sab <- 0.25
sv2 <- allFitnessEffects(epistasis = c("-A : B" = sb,
                             "A : -B" = sa,
                             "A:B" = sab),
                         geneToModule = c(
                             "A" = "a1, a2",
                             "B" = "b"))
evalAllGenotypes(sv2, order = FALSE, addwt = TRUE)
@ 

And if we look at order, of course it makes no difference:

<<>>=
evalAllGenotypes(sv2, order = TRUE, addwt = TRUE)
@ 

%% And it looks like:

%% <<>>=
%% plot(sv2)
%% @ 

%% a fairly simple plot.

\subsection{Synthetic mortality or synthetic lethality}\label{sl}

In contrast to section \ref{sv}, here the joint mutant has decreased viability:

\begin{tabular} {c c c}
  A & B & Fitness \\
  \hline
  wt&wt& 1 \\
  wt&M& $1 + s_b$ \\
  M&wt& $1 + s_a$\\
  M&M& $1 + s_{ab}$\\
  \hline
\end{tabular}

where $s_a, s_b > 0$ but $s_{ab} < 0$. 


<<>>=
sa <- 0.1
sb <- 0.2
sab <- -0.8
sm1 <- allFitnessEffects(epistasis = c("-A : B" = sb,
                             "A : -B" = sa,
                             "A:B" = sab))
evalAllGenotypes(sm1, order = FALSE, addwt = TRUE)

@ 

And if we look at order, of course it makes no difference:

<<>>=
evalAllGenotypes(sm1, order = TRUE, addwt = TRUE)
@ 

\subsection{Epistasis}\label{epi}

\subsubsection{Epistasis: two alternative specifications}\label{e2}

We want the following mapping of genotypes to fitness:

\begin{tabular} {c c c}
  A & B & Fitness \\
  \hline
  wt&wt& 1 \\
  wt&M& $1 + s_b$ \\
  M&wt& $1 + s_a$\\
  M&M& $1 + s_{ab}$\\
  \hline
\end{tabular}
 
Suppose that the actual numerical values are $s_a = 0.2, s_b = 0.3, s_{ab}
= 0.7$.

We specify the above as follows: 
<<>>=
sa <- 0.2
sb <- 0.3
sab <- 0.7

e2 <- allFitnessEffects(epistasis =
                            c("A: -B" = sa,
                              "-A:B" = sb,
                              "A : B" = sab))
evalAllGenotypes(e2, order = FALSE, addwt = TRUE)

@ 

That uses the ``-'' specification, so we explicitly exclude some patterns:
with ``A:-B'' we say ``A when there is no B''.

But we can also use a specification where we do not use the ``-''. That
requires a different numerical value of the interaction, because now, as
we are rewriting the interaction term as genotype ``A is mutant, B is
mutant'' the double mutant will incorporate the effects of ``A mutant'',
``B mutant'' and ``both A and B mutants''. We can define a new $s_2$ that
satisfies $(1 + s_{ab}) = (1 + s_a) (1 + s_b) (1 + s_2)$ so
$(1 + s_2) = (1 + s_{ab})/((1 + s_a) (1 + s_b))$ and therefore specify as:

<<>>=
s2 <- ((1 + sab)/((1 + sa) * (1 + sb))) - 1

e3 <- allFitnessEffects(epistasis =
                            c("A" = sa,
                              "B" = sb,
                              "A : B" = s2))
evalAllGenotypes(e3, order = FALSE, addwt = TRUE)

@ 

Note that this is the way you would specify effects with FFPopsim
\cite{Zanini2012}. Whether this specification or the previous one with
``-'' is simpler will depend on the model. For synthetic mortality and
viability, I think the one using ``-'' is simpler to map genotype tables
to fitness effects. See also section \ref{e3} and \ref{theminus} and the
example in section \ref{weis1b}.


Finally, note that we can also specify some of these effects by combining
the graph and the epistasis, as shown in section \ref{misra1a} or
\ref{weis1b}.

\subsubsection{Epistasis with three genes and two alternative specifications}\label{e3}

Suppose we have 

\begin{tabular} {c c c c}
  A & B & C & Fitness \\
  \hline
  M & wt & wt & $1 + s_a$ \\
  wt& M & wt& $1 + s_b$ \\
  wt & wt & M & $1 + s_c$ \\
  M & M & wt & $1 + s_{ab}$ \\
  wt& M & M& $1 + s_{bc}$ \\
  M & wt & M & $(1 + s_a) (1 + s_c)$ \\
  M & M & M & $1 + s_{abc}$ \\
  \hline
\end{tabular}

where missing rows have a fitness of 1 (they have been deleted for
conciseness). Note that the mutant for exactly A and C has a fitness that
is the product of the individual terms (so there is no epistasis in that case).


<<>>=
sa <- 0.1
sb <- 0.15
sc <- 0.2
sab <- 0.3
sbc <- -0.25
sabc <- 0.4

sac <- (1 + sa) * (1 + sc) - 1

E3A <- allFitnessEffects(epistasis =
                            c("A:-B:-C" = sa,
                              "-A:B:-C" = sb,
                              "-A:-B:C" = sc,
                              "A:B:-C" = sab,
                              "-A:B:C" = sbc,
                              "A:-B:C" = sac,
                              "A : B : C" = sabc)
                                                )

evalAllGenotypes(E3A, order = FALSE, addwt = FALSE)


@ 

We needed to pass the $s_{ac}$ coefficient explicitly, even if it that
term was just the product. We can try to avoid using the ``-'', however
(but we will need to do other calculations). For simplicity, I use capital
``S'' in what follows where the letters differ from the previous
specification:


<<>>=

sa <- 0.1
sb <- 0.15
sc <- 0.2
sab <- 0.3
Sab <- ( (1 + sab)/((1 + sa) * (1 + sb))) - 1
Sbc <- ( (1 + sbc)/((1 + sb) * (1 + sc))) - 1
Sabc <- ( (1 + sabc)/( (1 + sa) * (1 + sb) * (1 + sc) * (1 + Sab) * (1 + Sbc) ) ) - 1

E3B <- allFitnessEffects(epistasis =
                             c("A" = sa,
                               "B" = sb,
                               "C" = sc,
                               "A:B" = Sab,
                               "B:C" = Sbc,
                               ## "A:C" = sac, ## not needed now
                               "A : B : C" = Sabc)
                                                )
evalAllGenotypes(E3B, order = FALSE, addwt = FALSE)

@ 

The above two are, of course, identical:

<<>>=
all(evalAllGenotypes(E3A, order = FALSE, addwt = FALSE) == 
    evalAllGenotypes(E3B, order = FALSE, addwt = FALSE))
@ 

We avoid specifying the ``A:C'', as it just follows from the individual
``A'' and ``C'' terms, but given a specified genotype table, we need to do
a little bit of addition and multiplication to get the coefficients. 


\subsubsection{Why can we specify some effects with a ``-''?}\label{theminus}
Let's suppose we want to specify the synthetic viability example seen
before:

\begin{tabular} {c c c}
  A & B & Fitness \\
  \hline
  wt&wt& 1 \\
  wt&M& 0 \\
  M&wt& 0\\
  M&M& (1 + s)\\
  \hline
\end{tabular}


where ``wt'' denotes wild type and ``M'' denotes mutant.

If you want to directly map the above table to the fitness table for the
program, to specify the genotype ``A is wt, B is a mutant'' you can
specify it as \texttt{``-A,B''}, not just as \texttt{``B''}. Why? Because
just the presence of a ``B'' is also compatible with genotype ``A is
mutant and B is mutant''.  If you use ``-'' you are explicitly saying what
should not be there so that \texttt{-A,B} is NOT compatible with
\texttt{A, B}. Otherwise, you need to carefully add coefficients.
Depending on what you are trying to model, different specifications might
be simpler. See the examples in section \ref{e2} and \ref{e3}. You have
both options.


\subsubsection{Epistasis: modules}\label{epimod}
There is nothing conceptually new, but we will show an example here:

<<>>=

sa <- 0.2
sb <- 0.3
sab <- 0.7

em <- allFitnessEffects(epistasis =
                            c("A: -B" = sa,
                              "-A:B" = sb,
                              "A : B" = sab),
                        geneToModule = c("A" = "a1, a2",
                                         "B" = "b1, b2"))
evalAllGenotypes(em, order = FALSE, addwt = TRUE)
@ 


Of course, we can do the same thing without using the ``-'', as in section \ref{e2}:

<<>>=
s2 <- ((1 + sab)/((1 + sa) * (1 + sb))) - 1

em2 <- allFitnessEffects(epistasis =
                            c("A" = sa,
                              "B" = sb,
                              "A : B" = s2),
                         geneToModule = c("A" = "a1, a2",
                                         "B" = "b1, b2")
                         )
evalAllGenotypes(em2, order = FALSE, addwt = TRUE)

@ 


\subsection{I do not want epistasis, but I want modules!}
\label{mod-no-epi}

Sometimes you might want something like having several modules, say ``A''
and ``B'', each with a number of genes, but with ``A'' and ``B'' showing
no interaction. 

It is a terminological issue whether we should allow \texttt{noIntGenes}
(no interaction genes), as explained in section \ref{noint} to actually be
modules. The reasoning for not allowing them is that the situation
depicted above (several genes in module A, for example) actually is one of
interaction: the members of ``A'' are combined using an ``OR'' operator
(i.e., the fitness consequences of having one or more genes of A mutated
are the same), not just simply multiplying their fitness; similarly for
``B''. This is why no interaction genes also mean no modules allowed.

So how do you get what you want in this case?  Enter the names of the
modules in the \texttt{epistasis} component but have no term for
\texttt{:}. Let's see an example:


<<>>=

fnme <- allFitnessEffects(epistasis = c("A" = 0.1,
                                        "B" = 0.2),
                          geneToModule = c("A" = "a1, a2",
                                           "B" = "b1, b2, b3"))

evalAllGenotypes(fnme, order = FALSE, addwt = TRUE)

@ 

In previous versions these was possible using the longer, still accepted
way of specifying a \texttt{:} with a value of 0, but this is no longer
needed:

<<>>=
fnme <- allFitnessEffects(epistasis = c("A" = 0.1,
                                        "B" = 0.2,
                                        "A : B" = 0.0),
                          geneToModule = c("A" = "a1, a2",
                                           "B" = "b1, b2, b3"))

evalAllGenotypes(fnme, order = FALSE, addwt = TRUE)

@ 

This can, of course, be extended to more modules.


\subsection{Poset, epistasis, synthetic mortality and viability, order
  effects and genes without interactions, with some modules}\label{exlong}

We will now put together a complex example. We will use the poset from
section \ref{pm3} but will also add:
\begin{itemize}
\item Order effects that involve genes in the poset. In this case, if C
  happens before F, fitness decreases by $1 - 0.1$. If it happens the
  other way around, there is no effect on fitness beyond their individual
  contributions. %%  but if it happens the
  %% other way around it increases by $1 + 0.13$.
\item Order effects that involve two new modules, ``H'' and ``I'' (with
  genes ``h1, h2'' and ``i1'', respectively), so that if H happens before
  I fitness increases by $1 + 0.12$.
\item Synthetic mortality between modules ``I'' (already present in the
  epistatic interaction) and ``J'' (with genes ``j1'' and ``j2''): the
  joint presence of these modules leads to cell death (fitness of 0).
\item Synthetic viability between modules ``K'' and ``M'' (with genes
  ``k1'', ``k2'' and ``m1'', respectively), so that their joint presence
  is viable but adds nothing to fitness (i.e., mutation of both has
  fitness $1$), whereas each single mutant has a fitness of $1 - 0.5$.
\item A set of 5 driver genes ($n1, \ldots, n5$) with fitness that comes
  from an exponential distribution with rate of 10.
\end{itemize}


As we are specifying many different things, we will start by writing each
set of effects separately:


<<>>=

p4 <- data.frame(parent = c(rep("Root", 4), "A", "B", "D", "E", "C", "F"),
                 child = c("A", "B", "D", "E", "C", "C", "F", "F", "G", "G"),
                 s = c(0.01, 0.02, 0.03, 0.04, 0.1, 0.1, 0.2, 0.2, 0.3, 0.3),
                 sh = c(rep(0, 4), c(-.9, -.9), c(-.95, -.95), c(-.99, -.99)),
                 typeDep = c(rep("--", 4), 
                     "XMPN", "XMPN", "MN", "MN", "SM", "SM"))

oe <- c("C > F" = -0.1, "H > I" = 0.12)
sm <- c("I:J"  = -1)
sv <- c("-K:M" = -.5, "K:-M" = -.5)
epist <- c(sm, sv)

modules <- c("Root" = "Root", "A" = "a1",
             "B" = "b1, b2", "C" = "c1",
             "D" = "d1, d2", "E" = "e1",
             "F" = "f1, f2", "G" = "g1",
             "H" = "h1, h2", "I" = "i1",
             "J" = "j1, j2", "K" = "k1, k2", "M" = "m1")

set.seed(1) ## for repeatability
noint <- rexp(5, 10)
names(noint) <- paste0("n", 1:5)

fea <- allFitnessEffects(rT = p4, epistasis = epist, orderEffects = oe,
                         noIntGenes = noint, geneToModule = modules)

@ 


How does it look?

<<fig.height=6.5>>=
plot(fea)
@ 

or

<<fig.height=6.5>>=
plot(fea, "igraph")
@ 


We can, if we want, expand the modules using a ``graphNEL'' graph
<<fig.height=6.5>>=
plot(fea, expandModules = TRUE)
@ 

or an ``igraph'' one
<<fig.height=7.>>=
plot(fea, "igraph", expandModules = TRUE)
@ 


We will not evaluate the fitness of all genotypes, since the number of all
ordered genotypes is $> 7*10^{22}$. We will look at some specific genotypes:

<<>>=
evalGenotype("k1 > i1 > h2", fea) ## 0.5
evalGenotype("k1 > h1 > i1", fea) ## 0.5 * 1.12

evalGenotype("k2 > m1 > h1 > i1", fea) ## 1.12

evalGenotype("k2 > m1 > h1 > i1 > c1 > n3 > f2", fea) 
## 1.12 * 0.1 * (1 + noint[3]) * 0.05 * 0.9

@ 

Finally, let's generate some ordered genotypes randomly:

<<>>=

randomGenotype <- function(fe, ns = NULL) {
    gn <- setdiff(c(fe$geneModule$Gene,
                    fe$long.geneNoInt$Gene), "Root")
    if(is.null(ns)) ns <- sample(length(gn), 1)
    return(paste(sample(gn, ns), collapse = " > "))
}

set.seed(2) ## for reproducibility

evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE)
## Genotype:  k2 > i1 > c1 > n1 > m1
##  Individual s terms are : 0.0755182 -0.9
##  Fitness:  0.107552 
evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE)
## Genotype:  n2 > h1 > h2
##  Individual s terms are : 0.118164
##  Fitness:  1.11816 
evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE)
## Genotype:  d2 > k2 > c1 > f2 > n4 > m1 > n3 > f1 > b1 > g1 > n5 > h1 > j2
##  Individual s terms are : 0.0145707 0.0139795 0.0436069 0.02 0.1 0.03 -0.95 0.3 -0.1
##  Fitness:  0.0725829 
evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE)
## Genotype:  h2 > c1 > f1 > n2 > b2 > a1 > n1 > i1
##  Individual s terms are : 0.0755182 0.118164 0.01 0.02 -0.9 -0.95 -0.1 0.12
##  Fitness:  0.00624418 
evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE)
## Genotype:  h2 > j1 > m1 > d2 > i1 > b2 > k2 > d1 > b1 > n3 > n1 > g1 > h1 > c1 > k1 > e1 > a1 > f1 > n5 > f2
##  Individual s terms are : 0.0755182 0.0145707 0.0436069 0.01 0.02 -0.9 0.03 0.04 0.2 0.3 -1 -0.1 0.12
##  Fitness:  0 
evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE)
## Genotype:  n1 > m1 > n3 > i1 > j1 > n5 > k1
##  Individual s terms are : 0.0755182 0.0145707 0.0436069 -1
##  Fitness:  0 
evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE)
## Genotype:  d2 > n1 > g1 > f1 > f2 > c1 > b1 > d1 > k1 > a1 > b2 > i1 > n4 > h2 > n2
##  Individual s terms are : 0.0755182 0.118164 0.0139795 0.01 0.02 -0.9 0.03 -0.95 0.3 -0.5
##  Fitness:  0.00420528 
evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE)
## Genotype:  j1 > f1 > j2 > a1 > n4 > c1 > n3 > k1 > d1 > h1
##  Individual s terms are : 0.0145707 0.0139795 0.01 0.1 0.03 -0.95 -0.5
##  Fitness:  0.0294308 
evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE)
## Genotype:  n5 > f2 > f1 > h2 > n4 > c1 > n3 > b1
##  Individual s terms are : 0.0145707 0.0139795 0.0436069 0.02 0.1 -0.95
##  Fitness:  0.0602298 
evalGenotype(randomGenotype(fea), fea, echo = TRUE, verbose = TRUE)
## Genotype:  h1 > d1 > f2
##  Individual s terms are : 0.03 -0.95
##  Fitness:  0.0515 


@ 

\subsection{Homozygosity, heterozygosity, oncogenes, tumor suppressors}\label{oncog}

We are using what is conceptually a single linear chromosome. However, you
can use it to model scenarios where the numbers of copies affected matter,
by properly duplicating the genes. 

Suppose we have a tumor suppressor gene, G, with two copies, one from Mom
and one from Dad. We can have a table like:


\begin{tabular} {c c c}
  $G_M$ & $G_D$ & Fitness \\
  \hline
  wt&wt& 1 \\
  wt&M& 1 \\
  M&wt& 1\\
  M&M& $(1 + s)$\\
  \hline
\end{tabular}

where $s > 0$, meaning that you need two hits, one in each copy, to
trigger the clonal expansion.


What about oncogenes? A simple model is that one single hit leads to
clonal expansion and additional hits lead to no additional changes, as in
this table for gene O, where again the M or D subscript denotes the copy
from Mom or from Dad:

\begin{tabular} {c c c}
  $O_M$ & $O_D$ & Fitness \\
  \hline
  wt&wt& 1 \\
  wt&M&  $(1 + s)$\\
  M&wt& $(1 + s)$\\
  M&M& $(1 + s)$\\
  \hline
\end{tabular}

If you have multiple copies you can proceed similarly. As you can see,
these are nothing but special cases of synthetic mortality (\ref{sl}),
synthetic viability (\ref{sv}) and epistasis (\ref{epi}).


\section{Specifying fitness effects: some examples from the literature}\label{litex}
\subsection{Bauer et al}\label{bauer}

In the model of Bauer and collaborators \cite[p.\ 54]{Bauer2014} ``For
cells without the primary driver mutation, each secondary driver mutation
leads to a change in the cell's fitness by $s_P$. For cells with the
primary driver mutation, the fitness advantage obtained with each
secondary driver mutation is $s_{DP}$.''

The proliferation probability is given as $(1 + s_p)^k$ when there are $k$
secondary drivers mutated and no primary diver. If the primary driver is
mutated, then the expression is $\frac{1+S_D^+}{1+S_D^-} (1 + S_{DP})^k$.
They set apoptosis as $1 - proliferation$.  So, ignoring constants such as
$1/2$, and setting $ P = \frac{1+S_D^+}{1+S_D^-}$ we can prepare a table
as (for a largest $k$ of 5 in this example, but can make it arbitrarily
large):

<<>>=

K <- 5
sd <- 0.1
sdp <- 0.15
sp <- 0.05
bauer <- data.frame(parent = c("Root", rep("p", K)),
                    child = c("p", paste0("s", 1:K)),
                    s = c(sd, rep(sdp, K)),
                    sh = c(0, rep(sp, K)),
                    typeDep = "MN")
fbauer <- allFitnessEffects(bauer)

@ 

Note that what we specify as ``typeDep'' is irrelevant (MN, SMN, or XMPN
make no difference).


The fitness effects figure looks like this:
<<fig.height=3>>=
plot(fbauer)
@ 

<<>>=
(b1 <- evalAllGenotypes(fbauer, order = FALSE))[1:10, ]
@ 

Order makes no difference

<<>>=
(b2 <- evalAllGenotypes(fbauer, order = TRUE, max = 2000))[1:15, ]
@ 

And the number of levels is the right one: 11
<<>>=
length(table(b1$Fitness))
length(table(b2$Fitness))
@ 


%% \subsubsection{Bauer et al.\ specified only via epistatic interactions}
%% Yes, do it: as -p,s1, and -p,s2, etc. But much more of a mess.

%% \subsubsection{Adding modules to Bauer et al.}

Can we use modules in this module? Sure, as in any other.


\subsection{Misra et al., 2014}\label{misra}

Figure 1 of Misra et al.\ \cite{Misra2014} presents three scenarios which
are different types of epistasis. %% (I show the fitness scenarios without
%% axis, to replicate as close as possible what they show in their paper)

\subsubsection{Example 1.a}\label{misra1a}
<<echo=FALSE, fig.height=4, fig.width=4>>=

df1 <- data.frame(x = c(1, 1.2, 1.4), f = c(1, 1.2, 1.2),
                 names = c("wt", "A", "B"))
plot(df1[, 2] ~ df1[, 1], axes = TRUE, xlab= "", 
     ylab = "Fitness", xaxt = "n", yaxt = "n", ylim = c(1, 1.21))
segments(1, 1, 1.2, 1.2)
segments(1, 1, 1.4, 1.2)
text(1, 1, "wt", pos = 4)
text(1.2, 1.2, "A", pos = 2)
text(1.4, 1.2, "B", pos = 2)
## axis(1,  tick = FALSE, labels = FALSE)
## axis(2,  tick = FALSE, labels = FALSE)
@ 


In that figure it is evident that the fitness effect of ``A'' and ``B''
are the same. There are two different models depending on whether ``AB''
is just the product of both, or there is epistasis. In the first case
probably the simplest is:

<<>>=
s <- 0.1 ## or whatever number
m1a1 <- allFitnessEffects(data.frame(parent = c("Root", "Root"),
                                     child = c("A", "B"),
                                     s = s,
                                     sh = 0,
                                     typeDep = "MN"))
evalAllGenotypes(m1a1, order = FALSE, addwt = TRUE)
@ 


If the double mutant shows epistasis, as we saw before (section \ref{e2})
we have a range of options. For example:

<<>>=
s <- 0.1
sab <- 0.3
m1a2 <- allFitnessEffects(epistasis = c("A:-B" = s,
                                        "-A:B" = s,
                                        "A:B" = sab))
evalAllGenotypes(m1a2, order = FALSE, addwt = TRUE)
@ 

But we could also modify the graph dependency structure, and we have to
change the value of the coefficient, since that is what multiplies each of
the terms for ``A'' and ``B'': $(1 + s_{AB}) = (1 + s)^2(1 + s_{AB3}) $

<<>>=
sab3 <- ((1 + sab)/((1 + s)^2)) - 1
m1a3 <- allFitnessEffects(data.frame(parent = c("Root", "Root"),
                                     child = c("A", "B"),
                                     s = s,
                                     sh = 0,
                                     typeDep = "MN"),
                          epistasis = c("A:B" = sab3))
evalAllGenotypes(m1a3, order = FALSE, addwt = TRUE)
@ 

And, obviously
<<>>=
all.equal(evalAllGenotypes(m1a2, order = FALSE, addwt = TRUE),
          evalAllGenotypes(m1a3, order = FALSE, addwt = TRUE))
@ 


\subsubsection{Example 1.b}\label{misra1b}

This is a specific case of synthetic viability (see also section \ref{sv}):

<<echo=FALSE, fig.width=4, fig.height=4>>=

df1 <- data.frame(x = c(1, 1.2, 1.2, 1.4), f = c(1, 0.4, 0.3, 1.3),
                 names = c("wt", "A", "B", "AB"))
plot(df1[, 2] ~ df1[, 1], axes = TRUE, xlab= "", ylab = "Fitness",
     xaxt = "n", yaxt = "n", ylim = c(0.29, 1.32))
segments(1, 1, 1.2, 0.4)
segments(1, 1, 1.2, 0.3)
segments(1.2, 0.4, 1.4, 1.3)
segments(1.2, 0.3, 1.4, 1.3)
text(x = df1[, 1], y = df1[, 2], labels = df1[, "names"], pos = c(4, 2, 2, 2))
## text(1, 1, "wt", pos = 4)
## text(1.2, 1.2, "A", pos = 2)
## text(1.4, 1.2, "B", pos = 2)
@ 


Here, $S_A, S_B < 0$, $S_B < 0$, $S_{AB} > 0$ and $(1 + S_{AB}) (1 + S_A) (1 +
S_B) > 1$.

As before, we can specify this in several different ways. The simplest is
to specify all genotypes:
<<>>=
sa <- -0.6
sb <- -0.7
sab <- 0.3
m1b1 <- allFitnessEffects(epistasis = c("A:-B" = sa,
                                        "-A:B" = sb,
                                        "A:B" = sab))
evalAllGenotypes(m1b1, order = FALSE, addwt = TRUE)
@ 

We could also use a tree and modify the ``sab'' for the epistasis, as
before (\ref{misra1a}).


\subsubsection{Example 1.c}\label{misra1c}

The final case, in figure 1.c of Misra et al., is just epistasis, where a
mutation in one of the genes is deleterious (possibly only mildly), in the
other is beneficial, and the double mutation has fitness larger than any
of the other two.


<<echo=FALSE, fig.width=4, fig.height=4>>=

df1 <- data.frame(x = c(1, 1.2, 1.2, 1.4), f = c(1, 1.2, 0.7, 1.5),
                 names = c("wt", "A", "B", "AB"))
plot(df1[, 2] ~ df1[, 1], axes = TRUE, xlab = "", ylab = "Fitness",
     xaxt = "n", yaxt = "n", ylim = c(0.69, 1.53))
segments(1, 1, 1.2, 1.2)
segments(1, 1, 1.2, 0.7)
segments(1.2, 1.2, 1.4, 1.5)
segments(1.2, 0.7, 1.4, 1.5)
text(x = df1[, 1], y = df1[, 2], labels = df1[, "names"], pos = c(3, 3, 3, 2))
## text(1, 1, "wt", pos = 4)
## text(1.2, 1.2, "A", pos = 2)
## text(1.4, 1.2, "B", pos = 2)

@ 

Here we have that $s_A > 0$, $s_B < 0$, $(1 + s_{AB}) (1 + s_A) (1 +
s_B) > (1 + s_{AB})$ so $s_{AB} > \frac{-s_B}{1 + s_B}$


As before, we can specify this in several different ways. The simplest is
to specify all genotypes:
<<>>=
sa <- 0.2
sb <- -0.3
sab <- 0.5
m1c1 <- allFitnessEffects(epistasis = c("A:-B" = sa,
                                        "-A:B" = sb,
                                        "A:B" = sab))
evalAllGenotypes(m1c1, order = FALSE, addwt = TRUE)
@ 

We could also use a tree and modify the ``sab'' for the epistasis, as
before (\ref{misra1a}).


\subsection{Ochs and Desai, 2015}\label{ochsdesai}

In \cite{Ochs2015} the authors present a model shown graphically as (the
actual numerical values are arbitrarily set by me):


<<echo=FALSE, fig.width=4.5, fig.height=3.5>>=

df1 <- data.frame(x = c(1, 2, 3, 4), f = c(1.1, 1, 0.95, 1.2),
                 names = c("u", "wt", "i", "v"))
plot(df1[, 2] ~ df1[, 1], axes = FALSE, xlab = "", ylab = "")
par(las = 1)
axis(2)
axis(1, at = c(1, 2, 3, 4), labels = df1[, "names"], ylab = "")
box()
arrows(c(2, 2, 3), c(1, 1, 0.95),
       c(1, 3, 4), c(1.1, 0.95, 1.2))
## text(1, 1, "wt", pos = 4)
## text(1.2, 1.2, "A", pos = 2)
## text(1.4, 1.2, "B", pos = 2)
@

In their model, $s_u > 0$, $s_v > s_u$, $s_i < 0$, we can only arrive at
$v$ from $i$, and the mutants ``ui'' and ``uv'' can never appear as their
fitness is 0, or $-\infty$, so $s_{ui} = s_{uv} = -1$ (or $-\infty$).

We can specify this combining a graph and epistasis specifications:

<<>>=
su <- 0.1
si <- -0.05
fvi <- 1.2 ## the fitnes of the vi mutant
sv <- (fvi/(1 + si)) - 1
sui <- suv <- -1
od <- allFitnessEffects(
    data.frame(parent = c("Root", "Root", "i"),
               child = c("u", "i", "v"),
               s = c(su, si, sv),
               sh = -1,
               typeDep = "MN"),
    epistasis = c(
        "u:i" = sui,
        "u:v" = suv))
@ 

A figure showing that model is
<<fig.width=3, fig.height=3>>=
plot(od)
@ 

And the fitness of all genotype is
<<>>=
evalAllGenotypes(od, order = FALSE, addwt = TRUE)
@ 


\subsection{Weissman et al., 2009}
In their figure 1a, Weisman et al. \cite{Weissman2009} present this model
(actual numeric values are set arbitrarily)

\subsubsection{Figure 1.a}

<<echo=FALSE, fig.width=4, fig.height=3>>=

df1 <- data.frame(x = c(1, 2, 3), f = c(1, 0.95, 1.2),
                 names = c("wt", "1", "2"))
plot(df1[, 2] ~ df1[, 1], axes = FALSE, xlab = "", ylab = "")
par(las = 1)
axis(2)
axis(1, at = c(1, 2, 3), labels = df1[, "names"], ylab = "")
box()
segments(c(1, 2), c(1, 0.95),
       c(2, 3), c(0.95, 1.2))
## text(1, 1, "wt", pos = 4)
## text(1.2, 1.2, "A", pos = 2)
## text(1.4, 1.2, "B", pos = 2)
@

where the ``1'' and ``2'' refer to the total number of mutations in two
different loci. This is, therefore, very similar to the example in section
\ref{misra1b}. Here we have, in their notation, $\delta_1 < 0$, fitness of
single ``A'' or single ``B'' = $1 + \delta_1$, $S_{AB} > 0$,
$(1 + S_{AB})(1 + \delta_1)^2 > 1$.


\subsubsection{Figure 1.b}\label{weis1b}

In their figure 1b they show

<<echo=FALSE, fig.width=4, fig.height=3>>=

df1 <- data.frame(x = c(1, 2, 3, 4), f = c(1, 0.95, 0.92, 1.2),
                 names = c("wt", "1", "2", "3"))
plot(df1[, 2] ~ df1[, 1], axes = FALSE, xlab = "", ylab = "")
par(las = 1)
axis(2)
axis(1, at = c(1, 2, 3, 4), labels = df1[, "names"], ylab = "")
box()
segments(c(1, 2, 3), c(1, 0.95, 0.92),
       c(2, 3, 4), c(0.95, 0.92, 1.2))
## text(1, 1, "wt", pos = 4)
## text(1.2, 1.2, "A", pos = 2)
## text(1.4, 1.2, "B", pos = 2)
@

Where, as before, 1, 2, 3, denote the total number of mutations over three
different loci and $\delta_1 < 0$, $\delta_2 < 0$, fitness of single
mutant is $(1 + \delta_1)$, of double mutant is $(1 + \delta_2)$ so that
$(1 + \delta_2) = (1 + \delta_1)^2 (1 + s_2)$ and of triple mutant is
$(1 + \delta_3)$, so that
$(1 + \delta_3) = (1 + \delta_1)^3 (1 + s_2)^3 (1 + s_3)$.


We can specify this combining a graph with epistasis:

<<>>=

d1 <- -0.05 ## single mutant fitness 0.95
d2 <- -0.08 ## double mutant fitness 0.92
d3 <- 0.2   ## triple mutant fitness 1.2

s2 <- ((1 + d2)/(1 + d1)^2) - 1
s3 <- ( (1 + d3)/((1 + d1)^3 * (1 + s2)^3) ) - 1

w <- allFitnessEffects(
    data.frame(parent = c("Root", "Root", "Root"),
               child = c("A", "B", "C"),
               s = d1,
               sh = -1,
               typeDep = "MN"),
    epistasis = c(
        "A:B" = s2,
        "A:C" = s2,
        "B:C" = s2,
        "A:B:C" = s3))
@ 

The model can be shown graphically as:
<<fig.width=4, fig.height=4>>=
plot(w)
@ 

And fitness of all genotypes is:

<<>>=
evalAllGenotypes(w, order = FALSE, addwt = TRUE)
@ 


Alternatively, we can directly specify what each genotype adds to the
fitness, given the included genotype. This is basically replacing the
graph by giving each of ``A'', ``B'', and ``C'' directly:

<<>>=
wb <- allFitnessEffects(
    epistasis = c(
        "A" = d1,
        "B" = d1,
        "C" = d1,
        "A:B" = s2,
        "A:C" = s2,
        "B:C" = s2,
        "A:B:C" = s3))

evalAllGenotypes(wb, order = FALSE, addwt = TRUE)
@ 

The plot, of course, is not very revealing and we cannot show that there
is a three-way interaction (only all three two-way interactions):

<<, fig.width=3, fig.height=3>>=
plot(wb)
@ 

As we have seen several times already (sections \ref{e2}, \ref{e3},
\ref{theminus}) we can also give the genotypes directly and, consequently,
the fitness of each genotype (not the added contribution):

<<>>=
wc <- allFitnessEffects(
    epistasis = c(
        "A:-B:-C" = d1,
        "B:-C:-A" = d1,
        "C:-A:-B" = d1,
        "A:B:-C" = d2,
        "A:C:-B" = d2,
        "B:C:-A" = d2,
        "A:B:C" = d3))
evalAllGenotypes(wc, order = FALSE, addwt = TRUE)
@ 


\subsection{Gerstung et al., pancreatic cancer poset}\label{pancreas}
Similar to what we did in v.1 (see section \ref{poset}) we can specify the
pancreatic cancer poset in Gerstung et al.\ \cite{Gerstung2011} (their
figure 2B, left). We use directly the names of the genes, since that is
immediately supported by the new version.

<<fig.width=4>>=

pancr <- allFitnessEffects(
    data.frame(parent = c("Root", rep("KRAS", 4), 
                   "SMAD4", "CDNK2A", 
                   "TP53", "TP53", "MLL3"),
               child = c("KRAS","SMAD4", "CDNK2A", 
                   "TP53", "MLL3",
                   rep("PXDN", 3), rep("TGFBR2", 2)),
               s = 0.1,
               sh = -0.9,
               typeDep = "MN"))

plot(pancr)
@ 

Of course the ``s'' and ``sh'' are set arbitrarily here.

\clearpage
\subsection{Raphael and Vandin's modules}\label{raphael-ex}

In \cite{Raphael2014a}, Raphael and Vandin show several progression models
in terms of modules. We can code the extended poset for the colorectal
cancer model in their Figure 4.a is (s and sh are arbitrary):


<<fig.height = 4>>=

rv1 <- allFitnessEffects(data.frame(parent = c("Root", "A", "KRAS"),
                                    child = c("A", "KRAS", "FBXW7"),
                                    s = 0.1,
                                    sh = -0.01,
                                    typeDep = "MN"),
                         geneToModule = c("Root" = "Root",
                             "A" = "EVC2, PIK3CA, TP53",
                             "KRAS" = "KRAS",
                             "FBXW7" = "FBXW7"))

plot(rv1, expandModules = TRUE, autofit = TRUE)

@ 

We have used the (experimental) \Rcode{autofit} option to fit the labels to the
edges. Note how we can use the same name for genes and modules, but we
need to specify all the modules. 

\clearpage
Their Figure 5b is

<<fig.height=6>>=

rv2 <- allFitnessEffects(data.frame(parent = c("Root", "1", "2", "3", "4"),
                                    child = c("1", "2", "3", "4", "ELF3"),
                                    s = 0.1,
                                    sh = -0.01,
                                    typeDep = "MN"),
                         geneToModule = c("Root" = "Root",
                             "1" = "APC, FBXW7",
                             "2" = "ATM, FAM123B, PIK3CA, TP53",
                             "3" = "BRAF, KRAS, NRAS",
                             "4" = "SMAD2, SMAD4, SOX9",
                             "ELF3" = "ELF3"))

plot(rv2, expandModules = TRUE,   autofit = TRUE)
@ 

%%very poor rendering in the PDF, in separate page, et c.
%% plot(rv2, "igraph", expandModules = TRUE, 
%%       layout = layout.reingold.tilford,
%%       autofit = TRUE,
%%       scale_char = 8)


\clearpage
\section{Running and plotting the simulations}\label{simul}


\subsection{Bauer's example again}\label{bauer2}
We will use the model of Bauer et al., \cite{Bauer2014} that we saw in
section \ref{bauer}.

<<>>=
K <- 5
sd <- 0.1
sdp <- 0.15
sp <- 0.05
bauer <- data.frame(parent = c("Root", rep("p", K)),
                    child = c("p", paste0("s", 1:K)),
                    s = c(sd, rep(sdp, K)),
                    sh = c(0, rep(sp, K)),
                    typeDep = "MN")
fbauer <- allFitnessEffects(bauer)
set.seed(1)
## Use fairly large mutation rate
b1 <- oncoSimulIndiv(fbauer, mu = 5e-5, initSize = 1000) 
@ 


We will now use a variety of plots
<<baux1,fig.width=6.5, fig.height=10>>=
par(mfrow = c(3, 1))
## First, drivers
plot(b1, type = "line", addtot = TRUE)
plot(b1, type = "stacked")
plot(b1, type = "stream")
@ 

<<baux2,fig.width=6.5, fig.height=10>>=
par(mfrow = c(3, 1))
## Next, genotypes
plot(b1, show = "genotypes", type = "line")
plot(b1, show = "genotypes", type = "stacked")
plot(b1, show = "genotypes", type = "stream")
@ 

In this case, probably the stream plots are most helpful. Note, however,
that (in contrast to some figures in the literature showing models of
clonal expansion) the stream plot (or the stacked plot) does not try to
explicitly show parent-descendant relationships, which would hardly be
realistically possible in these plots (although the plots of phylogenies
in section \ref{phylog} could be of help).

\subsection{McFarland model with 5000 passengers and 70
  drivers}\label{mcf5070}

<<fig.width=6>>=

set.seed(456)
nd <- 70  
np <- 5000 
s <- 0.1  
sp <- 1e-3 
spp <- -sp/(1 + sp)
mcf1 <- allFitnessEffects(noIntGenes = c(rep(s, nd), rep(spp, np)),
                          drv = seq.int(nd))
mcf1s <-  oncoSimulIndiv(mcf1,
                         model = "McFL", 
                         mu = 1e-7,
                         detectionSize = 1e8, 
                         detectionDrivers = 100,
                         sampleEvery = 0.02,
                         keepEvery = 2,
                         initSize = 2000,
                         finalTime = 1000,
                         onlyCancer = FALSE)
summary(mcf1s)
@ 

<<mcf1sx1,fig.width=6.5, fig.height=10>>=
par(mfrow  = c(2, 1))
## I use thinData to make figures smaller and faster
plot(mcf1s, addtot = TRUE, lwdClone = 0.9, log = "", 
     thinData = TRUE, thinData.keep = 0.5)
## I also use here xlim, to focus only on a part of the
## data (and make it plot faster)
plot(mcf1s, show = "drivers", type = "stacked",
     thinData = TRUE, thinData.keep = 0.5,
     xlim = c(600, 1000), legend.ncols = 2)
@ 


With the above output (where we see there are over 500 different
genotypes) trying to represent the genotypes makes no sense. 


\subsection{McFarland model with 50000 passengers and 70
  drivers: clonal competition}\label{mcf50070}

The next is too slow (takes a couple of minutes in an i5 laptop) and too
big to run in a vignette, because we keep track of over 4000 different
clones (which leads to a result object of over 800 MB):

<<eval=FALSE>>=

set.seed(123)
nd <- 70  
np <- 50000 
s <- 0.1  
sp <- 1e-4 ## as we have many more passengers
spp <- -sp/(1 + sp)
mcfL <- allFitnessEffects(noIntGenes = c(rep(s, nd), rep(spp, np)),
                          drv = seq.int(nd))
mcfLs <-  oncoSimulIndiv(mcfL,
                         model = "McFL", 
                         mu = 1e-7,
                         detectionSize = 1e8, 
                         detectionDrivers = 100,
                         sampleEvery = 0.02,
                         keepEvery = 2,
                         initSize = 1000,
                         finalTime = 2000,
                         onlyCancer = FALSE)
@ 

But you can access the pre-stored results and plot them (beware: this
object has been trimmed by removing empty passenger rows in the Genotype matrix)

<<mcflsx2,fig.width=6>>=
data(mcfLs)
plot(mcfLs, addtot = TRUE, lwdClone = 0.9, log = "", plotDiversity = TRUE)
@ 


The argument \Rcode{plotDiversity = TRUE} asks to show a small plot on top
with Shannon's diversity index.


<<>>=
summary(mcfLs)
## number of passengers per clone
summary(colSums(mcfLs$Genotypes[-(1:70), ]))
@ 


Note that we see clonal competition between clones with the same number of
drivers (and with different drivers, of course). We will return to this
(section \ref{clonalint}).

A stacked plot might be better to show the extent of clonal competition
(plotting takes some time ---a stream plot reveals similar patterns and is
also slower than the line plot). I will thin the data for this plot so it
is faster and smaller (but we miss some of the fine grain, of course):


<<mcflsx3>>=
plot(mcfLs, type = "stacked", thinData = TRUE, 
     thinData.keep = 0.5,
     plotDiversity = TRUE,
     xlim = c(0, 1000))
@ 


%% %% The problem is the Genotype matrix. We remove empty passenger rows.
%% <<>>=
%% g1 <- mcfLs$Genotypes[1:nd, ]
%% g2 <- mcfLs$Genotypes[(nd+1):(nd+np), ]
%% rs <- rowSums(g2)
%% g3 <- g2[which(rs == 0), ]
%% g4 <- rbind(g1, g3)
%% @ 


\subsection{Loading fitnessEffects data for simulation
  examples}\label{fedata}
We will use several of the previous examples. Most of them are in file
\Robject{examplesFitnessEffects}, where they are stored inside a list,
with named components (names the same as in the examples above):

<<>>=
data(examplesFitnessEffects)
names(examplesFitnessEffects)
@ 


\subsection{Simulation with a conjunction example}\label{s-cbn1}

We will simulate using the simple CBN-like restrictions of
section \ref{cbn1} with two different models:

<<>>=
data(examplesFitnessEffects)
evalAllGenotypes(examplesFitnessEffects$cbn1, order = FALSE)[1:10, ]
sm <-  oncoSimulIndiv(examplesFitnessEffects$cbn1,
                       model = "McFL", 
                       mu = 5e-7,
                       detectionSize = 1e8, 
                       detectionDrivers = 2,
                       sampleEvery = 0.025,
                       keepEvery = 5,
                       initSize = 2000,
                       onlyCancer = TRUE)
summary(sm)
@ 

%% We will use several plots here.

%% <<>>=
%% ## Show drivers, line plot
%% plot(sm, show = "drivers", type = "line", addtot = TRUE)
%% ## drivers, stacked
%% plot(sm, show = "drivers", type = "stacked")

%% ## Genotypes, line plot
%% plot(sm, show = "genotypes", type = "line")
%% ## genotypes, stacked
%% plot(sm, show = "genotypes", type = "stacked")
%% @ 


<<>>=
set.seed(1234)
evalAllGenotypes(examplesFitnessEffects$cbn1, order = FALSE, 
                 model = "Bozic")[1:10, ]
sb <-  oncoSimulIndiv(examplesFitnessEffects$cbn1,
                       model = "Bozic", 
                       mu = 5e-6,
                       detectionSize = 1e8, 
                       detectionDrivers = 4,
                       sampleEvery = 2,
                       initSize = 2000,
                       onlyCancer = TRUE)
summary(sb)
@ 

As usual, we will use several plots here.
\clearpage

<<sbx1,fig.width=6.5, fig.height=3.3>>=
## Show drivers, line plot
par(cex = 0.75, las = 1)
plot(sb,show = "drivers", type = "line", addtot = TRUE, plotDiversity = TRUE)
@ 
<<sbx2,fig.width=6.5, fig.height=3.3>>=
## Drivers, stacked
par(cex = 0.75, las = 1)
plot(sb,show = "drivers", type = "stacked", plotDiversity = TRUE)
@ 
<<sbx3,fig.width=6.5, fig.height=3.3>>=
## Drivers, stream
par(cex = 0.75, las = 1)
plot(sb,show = "drivers", type = "stream", plotDiversity = TRUE)
@ 
\clearpage
<<sbx4,fig.width=6.5, fig.height=3.3>>=
## Genotypes, line plot
par(cex = 0.75, las = 1)
plot(sb,show = "genotypes", type = "line", plotDiversity = TRUE)
@ 
<<sbx5,fig.width=6.5, fig.height=3.3>>=
## Genotypes, stacked
par(cex = 0.75, las = 1)
plot(sb,show = "genotypes", type = "stacked", plotDiversity = TRUE)
@ 
<<sbx6,fig.width=6.5, fig.height=3.3>>=
## Genotypes, stream
par(cex = 0.75, las = 1)
plot(sb,show = "genotypes", type = "stream", plotDiversity = TRUE)
@ 

The above illustrates again that different types of plots can be useful to
reveal different patterns in the data. For instance, here, because of the
huge relative frequency of one of the clones/genotypes, the stacked and
stream plots do not reveal the other clones/genotypes as we cannot use a
log-transformed y-axis, even if there are other clones/genotypes present.


\subsection{Simulation with order effects and McFL model}\label{clonalint}
%% Interesting to show effects of order: o3

%% Increase mutation rate, so does not take forever
%% <<>>=

%% tmp <-  oncoSimulIndiv(examplesFitnessEffects[["o3"]],
%%                        model = "McFL", 
%%                        mu = 5e-5,
%%                        detectionSize = 1e8, 
%%                        detectionDrivers = 3,
%%                        sampleEvery = 0.025,
%%                        max.num.tries = 10,
%%                        keepEvery = -9,
%%                        initSize = 2000,
%%                        finalTime = 8000,
%%                        onlyCancer = TRUE); 

%% tmp

%% tmp <-  oncoSimulIndiv(examplesFitnessEffects[["o3"]],
%%                        model = "Bozic", 
%%                        mu = 5e-5,
%%                        detectionSize = 1e6, 
%%                        detectionDrivers = 4,
%%                        sampleEvery = 2,
%%                        max.num.tries = 100,
%%                        keepEvery = -9,
%%                        initSize = 2000,
%%                        onlyCancer = TRUE)
%% tmp

%% @ 

(We use a somewhat large mutation rate than usual, so that the simulation
runs quickly.)


<<fig.width=6>>=

set.seed(4321)
tmp <-  oncoSimulIndiv(examplesFitnessEffects[["o3"]],
                       model = "McFL", 
                       mu = 5e-5,
                       detectionSize = 1e8, 
                       detectionDrivers = 3,
                       sampleEvery = 0.025,
                       max.num.tries = 10,
                       keepEvery = 5,
                       initSize = 2000,
                       finalTime = 6000,
                       onlyCancer = FALSE) 
@ 

We show a stacked and a line plot of the drivers:

\clearpage

<<tmpmx1,fig.width=6.5, fig.height=4.1>>=
par(las = 1, cex = 0.85)
plot(tmp, addtot = TRUE, log = "", plotDiversity = TRUE)
@ 
<<tmpmx2,fig.width=6.5, fig.height=4.1>>=
par(las = 1, cex = 0.85)
plot(tmp, type = "stacked", plotDiversity = TRUE, 
     ylim = c(0, 5500), legend.ncols = 4)
@ 

In this example (and at least under Linux, with both GCC and clang), we
can see that the mutants with three drivers do not get established when we
stop the simulation at time 6000. This is one case where the summary
statistics about number of drivers says little of value, as fitness is
very different for genotypes with the same number of mutations, and does
not increase in a simple way with drivers:

<<>>=
evalAllGenotypes(examplesFitnessEffects[["o3"]], addwt = TRUE)
@ 

A few figures could help:

<<tmpmx3,fig.width=6.5, fig.height=10>>=
par(mfrow = c(2, 1))
plot(tmp, show = "genotypes", ylim = c(0, 5500), legend.ncols = 3)
plot(tmp, show = "genotypes", type = "line", ylim = c(1, 6000))
@ 

(When reading the figure legends, recall that genotype  $x > y\ \_\ z$ is
one where a mutation in ``x'' happened before a mutation in ``y'', and
there is also a mutation in ``z'' for which order does not matter. Here,
there are no genes for which order does not matter and thus there is
nothing after the ``\_'').


In this case, the clones with three drivers end up displacing those with
two by the time we stop; moreover, notice how those with one driver never
really grow to a large population size, so we basically go from a
population with clones with zero drivers to a population made of clones
with two or three drivers:

%%<<fig.width=6>>=
<<>>=
set.seed(15)
tmp <-  oncoSimulIndiv(examplesFitnessEffects[["o3"]],
                       model = "McFL", 
                       mu = 5e-5,
                       detectionSize = 1e8, 
                       detectionDrivers = 3,
                       sampleEvery = 0.015,
                       max.num.tries = 10,
                       keepEvery = 5,
                       initSize = 2000,
                       finalTime = 20000,
                       onlyCancer = FALSE,
                       extraTime = 1500)
tmp
@ 

\clearpage

use a drivers plot:
<<tmpmdx5,fig.width=6.5, fig.height=4>>=
par(las = 1, cex = 0.85)
plot(tmp, addtot = TRUE, log = "", plotDiversity = TRUE)
@ 
<<tmpmdx6,fig.width=6.5, fig.height=4>>=
par(las = 1, cex = 0.85)
plot(tmp, type = "stacked", plotDiversity = TRUE,
     legend.ncols = 4, ylim = c(0, 5200))
@ 

\clearpage

Now show the genotypes explicitly:
<<tmpmdx7,fig.width=6.5, fig.height=5.3>>=
## Improve telling appart the most abundant 
## genotypes by sorting colors
## differently via breakSortColors
## Modify ncols of legend, so it is legible by not overlapping
## with plot
par(las = 1, cex = 0.85)
plot(tmp, show = "genotypes", breakSortColors = "distave",
     plotDiversity = TRUE, legend.ncols = 4,
     ylim = c(0, 5300))
@


As before, the argument \Rcode{plotDiversity = TRUE} asks to show a small
plot on top with Shannon's diversity index. Here, as before, the quick
clonal expansion of the clone with two drivers leads to a sudden drop in
diversity (for a while, the population is made virtually of a single
clone). Note, however, that compared to section \ref{mcf50070}, we are
modeling here a scenario with very few genes, and correspondingly very few
possible genotypes, and thus it is not strange that we observe very little
diversity.

%% These patterns, however, are not always present

%% <<fig.width=6>>=

%% set.seed(7654) 
%% tmp <-  oncoSimulIndiv(examplesFitnessEffects[["o3"]],
%%                        model = "McFL", 
%%                        mu = 5e-5,
%%                        detectionSize = 1e8, 
%%                        detectionDrivers = 3,
%%                        sampleEvery = 0.015,
%%                        max.num.tries = 10,
%%                        keepEvery = 5,
%%                        initSize = 2000,
%%                        finalTime = 10000,
%%                        onlyCancer = FALSE,
%%                        extraTime = 10)
%% tmp
%% plot(tmp, addtot = TRUE, log = "")

%% @ 


%% Although in other runs we do not reach the three gene mutant and continue
%% with clone competition for a long time:


(We have used \Rcode{extraTime} to continue the simulation well past the
point of detection, here specified as three drivers. Instead of specifying
\Rcode{extraTime} we can set the \Rcode{detectionDrivers} value to a
number larger than the number of existing possible drivers, and the
simulation will run until \Rcode{finalTime} if \Rcode{onlyCancer =
  FALSE}.)


\clearpage

\subsection{Numerical issues with Bozic}\label{ex-0-death}

As we mentioned above (section \ref{fit-neg-pos}) death rates of 0 can
lead to trouble when using Bozic's model:

%% <<>>=

%% set.seed(987)
%% ie3 <- allFitnessEffects(noIntGenes = rexp(3))
%% evalAllGenotypes(ie3, order = FALSE, addwt = TRUE, 
%%                  model = "Bozic")

%% ie3_b <- oncoSimulIndiv(ie3, model = "Bozic")

%% evalAllGenotypes(ie3, order = FALSE, addwt = TRUE)

%% ie3_e <- oncoSimulIndiv(ie3, model = "Exp")

%% @ 

%% Even simpler

%% ## set.seed(987)
<<>>=
i1 <- allFitnessEffects(noIntGenes = c(1))
evalAllGenotypes(i1, order = FALSE, addwt = TRUE, 
                 model = "Bozic")
i1_b <- oncoSimulIndiv(i1, model = "Bozic")

@ 


Of course, there is no problem in using the above with other models:

<<>>=
evalAllGenotypes(i1, order = FALSE, addwt = TRUE, 
                 model = "Exp")
i1_e <- oncoSimulIndiv(i1, model = "Exp")
summary(i1_e)
@ 


\subsection{Interactive graphics}\label{interactive}

It is possible to create interactive stacked area and stream plots using
the \Rpackage{streamgraph} package, available from
\Burl{https://github.com/hrbrmstr/streamgraph}.  However, that package is
not available as a CRAN or BioConductor package, and thus we cannot depend
on it for this vignette (or this package). You can, however, paste the
code below and make it run locally.

Before calling the \Rfunction{streamgraph} function, though, we need to
convert the data from the original format in which it is stored into
``long format''. A simple convenience function is provided as
\Rfunction{OncoSimulWide2Long} in \Biocpkg{OncoSimulR}.


As an example, we will use the data we generated above for section
\ref{bauer2}.


<<eval=FALSE>>=
## Convert the data
lb1 <- OncoSimulWide2Long(b1)

## Install the streamgraph package from github and load
library(devtools)
devtools::install_github("hrbrmstr/streamgraph")
library(streamgraph)

## Stream plot for Genotypes
sg_legend(streamgraph(lb1, Genotype, Y, Time, scale = "continuous"),
              show=TRUE, label="Genotype: ")

## Staked area plot and we use the pipe
streamgraph(lb1, Genotype, Y, Time, scale = "continuous", 
          offset = "zero") %>% sg_legend(show=TRUE, label="Genotype: ")
@ 


%% (Note: the idiomatic way of doing the above with \CRANpkg{tidyr} is using 
%% \verb= %>% =, the pipe operator. Something like 
%% \begin{verbatim}
%% streamgraph(lb1, Genotype, Y, Time, scale = ``continuous'',  
%%            offset = ``zero'') \%>\%                                                                                                                                sg_legend(show=TRUE, label=``Genotype: '') 
%% \end{verbatim}

%% but it gives me problems with knitr, etc).


\section{Sampling multiple simulations}\label{sample}

Often, you will want to simulate multiple runs of the same scenario, and
do something with them. Conceptually, the first step is running multiple
simulations and, then, sampling them.

We will use the ``pancreas'' example, above section \ref{pancreas}.
<<>>=


pancrPop <- oncoSimulPop(10, pancr,
                         detectionSize = 1e7,
                         keepEvery = 10,
                         mc.cores = 2)

summary(pancrPop)

@ 

The above runs the simulation process 10 times, and stores the
results. We can then sample from them:

<<>>=
pancrSPop <- samplePop(pancrPop)
pancrSPop
@ 


But if we are only interested in the final matrix of populations by
mutations, the above is wasteful, because we store fully all of the
simulations (in the call to \Rfunction{oncoSimulPop}) and then sample (in
the call to \Rfunction{samplePop}). In particular, data from every
sampling time (as given by \Rcode{sampleEvery}) is preserved. It is in the
call to \Rfunction{samplePop} when we actually sample the data.


An alternative approach is to use the function
\Rfunction{oncoSimulSample}. The output is directly the matrix (and a
little bit of summary from each run), and during the simulation it only
stores one time point. 


<<>>=

pancrSamp <- oncoSimulSample(10, pancr)
pancrSamp

@ 


\subsection{Differences between \Rfunction{samplePop} and
  \Rfunction{oncoSimulSample}}\label{diffsample}

\Rfunction{samplePop} provides two sampling times: ``last'' and
``uniform''. "last" means to sample each individual in the very last time
period of the simulation. "uniform" means sampling each
individual at a time choosen uniformly from all the times recorded in the
simulation between the time when the first driver appeared and the final
time period. "unif" means that it is almost sure that different
individuals will be sampled at different times. "last" does not guarantee
that different individuals will be sampled at the same time unit, only that
all will be sampled in the last time unit of their simulation.


With \Rfunction{oncoSimulSample} we obtain samples that correspond to
\Rcode{timeSample = ``last''} in \Rfunction{samplePop} by specifying a
unique value for \Rfunction{detectionSize} and
\Rfunction{detectionDrivers}. The data from each simulation will
correspond to the time point at which those are reached (analogous to
\Rcode{timeSample = ``last''}). How about uniform sampling? We pass a
vector of \Rfunction{detectionSize} and \Rfunction{detectionDrivers},
where each value of the vector comes from a uniform distribution. This is
not identical to the ``uniform'' sampling of  \Rfunction{oncoSimulSample},
as we are not sampling uniformly over all time periods, but are stopping
at uniformly distributed values over the stopping conditions. Arguably,
however, the procedure in \Rfunction{samplePop} might be closer to what we
mean with ``uniformly sampled over the course of the disease'' if that
course is measured in terms of drivers or size of tumor.


As an example, if you look at the output above, the object ``pancrSamp''
contains some simulations that have only a few drivers because those
simulations were set to run only until they had just a small number of
cells.


An additional advantage of \Rfunction{oncoSimulSample} is that we can
specify arbitrary sampling schemes, just by passing the appropriate vector
\Rfunction{detectionSize} and \Rfunction{detectionDrivers}. A disadvantage
is that if we change the stopping conditions we can not just resample the
data, but we need to run it again.


There is no difference between \Rfunction{oncoSimulSample} and
\Rfunction{oncoSimulPop} + \Rfunction{samplePop} in terms of the
\Rcode{typeSample} argument (whole tumor or single cell).


Finally, there are some additional differences between the two
functions. \Rfunction{oncoSimulPop} can run parallelized (it uses
\Rfunction{mclapply}). This is not done with \Rfunction{oncoSimulSample}
because this function is designed for simulation experiments where you
want to examine many different scenarios simultaneously. Thus, we provide
additional stopping criteria (\Rcode{max.wall.time.total} and
\Rcode{max.num.tries.total}) to determine whether to continue running the
simulations, that bounds the total running time of all the simulations in
a call to \Rfunction{oncoSimulSample}. And, if you are running multiple
different scenarios, you might want to make multiple, separate,
independent calls (e.g., from different R processes) to
\Rfunction{oncoSimulSample}, instead of relying in \Rfunction{mclapply},
since this is likely to lead to better usage of multiple cores/CPUs if you
are examining a large number of different scenarios.


\subsection{Dealing with errors in
  \Rfunction{oncoSimulPop}}\label{errorosp}

When running OncoSimulR under Windows \Rfunction{mclapply} does not use
multiple cores, and errors from \Rfunction{oncoSimulPop} are reported
directly. For example:

<<>>=
## This code will only be evaluated under Windows
if(.Platform$OS.type == "windows")
    try(pancrError <- oncoSimulPop(10, pancr,
                               initSize = 1e-5,
                               detectionSize = 1e7,
                               keepEvery = 10,
                               mc.cores = 2))
@ 


Under POSIX operating systems (e.g., GNU/Linux or Mac OSX)
\Rfunction{oncoSimulPop} can ran parallelized by calling
\Rfunction{mclapply}. Now, suppose you did something like

<<>>=
## Do not run under Windows
if(.Platform$OS.type != "windows")
    pancrError <- oncoSimulPop(10, pancr,
                               initSize = 1e-5,
                               detectionSize = 1e7,
                               keepEvery = 10,
                               mc.cores = 2)
@ 

The warning you are seeing tells you there was an error in the functions
called by \Rfunction{mclapply}. If you check the help for
\Rfunction{mclpapply} you'll see that it returns a try-error object, so we
can inspect it. For instance, we could do:

<<eval=FALSE>>=
pancrError[[1]]
@ 

But the output of this call might be easier to read:

<<eval=FALSE>>=
pancrError[[1]][1]
@ 
 
And from here you could see the error that was returned by
\Rfunction{oncoSimulIndiv}: \texttt{initSize < 1} (which is indeed true:
we pass \texttt{initSize = 1e-5}).


\subsection{What can you do with the simulations?}

This is up to you. Below (section \ref{sample-1}) we show an example where
we infer an oncogenetic tree from simulated data.


%% As an example, we can try to infer an oncogenetic tree
%% (and plot it) using the \CRANpkg{Oncotree} package \cite{Oncotree} after
%% getting a quick look at the marginal frequencies of events:

%% <<fig.width=4, fig.height=4>>=
%% colSums(pancrSamp)/nrow(pancrSamp)

%% require(Oncotree)
%% otp <- oncotree.fit(pancrSamp)
%% plot(otp)
%% @ 


\subsection{Whole tumor sampling and genotypes}\label{wtsampl}

You are obtaining genotypes, regardless of order.  When we use ``whole
tumor sampling'', it is the frequency of the mutations in each gene that
counts, not the order. So, for instance, ``c, d'' and ``c, d'' both
contribute to the counts of ``c'' and ``d''. Similarly, when we use single
cell sampling, we obtain a genotype defined in terms of mutations, but
there might be multiple orders that give this genotype. For example, $d >
c$ and $c > d$ both  give you a genotype with ``c'' and ``d'' mutated, and
thus in the output you can have two columns with both genes mutated.


%% in an attempt to explain it, this just makes it too confusing. The
%% above is enough.
%% \subsection{What if there is order?}\label{sim-order}

%% Consider the following example (I fix the seed and use a single core, so
%% no parallelization, to make sure we can reproduce the results)

%% <<>>=

%% oe8 <- allFitnessEffects(orderEffects = c(
%%                              "M > F > M" = 0,
%%                              "D > F > M" = 0.1,
%%                              "F > D > M" = 0.2
%% ),
%%                       epistasis = c("D" = 0.02, "M" = 0.02, "F" = 0.02),
%%                         geneToModule =
%%                             c("Root" = "Root",
%%                               "M" = "m",
%%                               "F" = "f",
%%                               "D" = "d") )

%% evalAllGenotypes(oe8)

%% set.seed(678) 
%% oe8P1 <- oncoSimulPop(8, oe8,
%%                      model = "Exp", 
%%                       detectionSize = 1e8, keepEvery = 10, mc.cores = 1)
%% lapply(oe8P1, print)


%% set.seed(678) 
%% oe8P1 <- oncoSimulPop(1, oe8,
%%                      model = "McFL", 
%%                       detectionDrivers = 2, 
%%                       keepEvery = 10, mc.cores = 1)
%% lapply(oe8P1, print)

%% @ 


%% <<>>=

%% o8 <- allFitnessEffects(orderEffects = c(
%%                             "F > D" = 0,
%%                             "D > F" = 0.14,
%%                             "D > M" = 0.13,
%%                             "F > M" = 0.12,
%%                             "M > D" = 0.15),
%%                       epistasis = c("D" = 0.01, "M" = 0.01, "F" = 0.02),
%%                         geneToModule =
%%                             c("Root" = "Root",
%%                               "M" = "m",
%%                               "F" = "f",
%%                               "D" = "d") )

%% evalAllGenotypes(o8)

%% set.seed(678) 
%% o8P1 <- oncoSimulPop(8, o8,
%%                      model = "Exp", keepEvery = 10, mc.cores = 1)
%% ## lapply(o8P1, print)
%% @ 

%% Now, if we look at the sixth population we see

%% <<>>=
%% o8P1[[6]]
%% @ 

%% Obviously, in terms of the genes that are mutated, both ``d, f, m'' and
%% ``d, m, f'' have the same genes mutated so if we sample, for instance doing

%% <<>>=

%% @ 


%% o9 <- allFitnessEffects(orderEffects = c(
%%                             "F > D > M" = 0,
%%                             "D > F > M" = 0.14,
%%                             "D > M > F" = 0.13,
%%                             "D > M"     = 0.12,
%%                             "M > D"     = 0.15),
%%                       epistasis = c("D:-M" = 0.05, "M:-D" = 0.04),
%%                         geneToModule =
%%                             c("Root" = "Root",
%%                               "M" = "m",
%%                               "F" = "f",
%%                               "D" = "d") )


%% set.seed(11)
%% o9P1 <- oncoSimulPop(8, o9,
%%                      model = "Exp", keepEvery = 10, mc.cores = 1)
%% lapply(o9P1, print)


%% @ 


%% \subsection{Testing of mappings}

%% The mapping of restriction tables, epistasis, and order effects to
%% fitness, especially when there are modules, is a delicate part of the
%% code: reasonable cases are straightforward to deal with, but there are
%% many ways to shoot oneself in the foot. That is why we have placed lots of
%% pre- and post-condition checks in the code (both R and C++), and we have a
%% comprehensive set of tests in file zz. You are welcome to suggest more
%% tricky scenarios (and tests for them).


%% \section{Introduction}

%% This vignette presents the OncoSimulR package. OncoSimulR allows you to
%% simulate tumor progression using several models of tumor progression. In
%% these simulations you can restrict the order in which mutations can
%% accumulate. For instance, you can restrict the allowed order as specified,
%% for instance, in Oncogenetic Tree (OT; \cite{Desper1999JCB, Szabo2008}) or
%% Conjunctive Bayesian Network (CBN; \cite{Beerenwinkel2007, Gerstung2009,
%%   Gerstung2011}) models. Moreover, you can add passenger mutations to the
%% simulations. The models so far implemented are all continuous time models,
%% which are simulated using the BNB algorithm of Mather et
%% al.\ \cite{Mather2012}. This is a summary of some of the key features:


%% \begin{itemize}
%% \item You can pass arbitrary restrictions as specified by OTs or CBNs.
  
%% \item You can add passenger mutations.
  
%% \item You can allow for deviations from the OT and CBN models, specifying
%%   a penalty for such deviations (the $s_h$ parameter).
  
%% \item Right now, three different models are available, two that lead to
%%   exponential growth, one of them loosely based on Bozic et al.\
%%   \cite{Bozic2010}, and another that leads to logistic-like growth, based
%%   on McFarland et al.\ \cite{McFarland2013}.
%% \item Simulations are generally very fast as I use the BNB algorithm
%%   implemented in C++.
%% \end{itemize}


%% Further details about the motivation for wanting to
%% simulate data this way can be found in \cite{ot-biorxiv}, where additional
%% comments about model parameters and caveats are discussed. The Java
%% program by \cite{Reiter2013a} offers somewhat similar functionality, but
%% they are restricted to at most four drivers, you cannot use arbitrary CBNs
%% or OTs to specify order restrictions, there is no allowance for
%% passengers, and a single type of model (a discrete time Galton-Watson
%% process) is implemented.


%% Using this package will often involve the following steps:

%% \begin{enumerate}
%% \item Specify the restrictions in the order of mutations: section \ref{poset}.
%% \item Simulate cancer progression: section \ref{simul}. You can simulate
%%   for a single subject or for a set of subjects. You will need to
%%   \begin{itemize}
%%   \item Decide on a model (e.g., Bozic or McFarland).
%%   \item Specify the parameters of the model.
%%   \end{itemize}
%%   Of course, at least for initial playing around, you can use the defaults.
  
%% \item Sample from the simulated data: section \ref{sample}, and do
%%   something with those simulated data (e.g., fit an OT model to
%%   them). What you do with the data, however, is outside the scope of this
%%   package.   
%% \end{enumerate}


%% Before anything else, let us load the package. We also explicitly load
%% \Biocpkg{graph} for the vignette to work (you do not need that for your
%% usual interactive work).

%% <<>>=
%% library(OncoSimulR)
%% library(graph)
%% @ 


%% \section{Specifying restrictions: posets}\label{poset}

%% How to specify the restrictions is shown in the help for
%% \Rfunction{poset}. It is often useful, to make sure you did not make any
%% mistakes, to plot the poset. This is from the examples (we use an ``L''
%% after a number so that the numbers are integers, not doubles; we could
%% alternatively have modified \texttt{storage.mode}).

%% <<fig.height=3>>=
%% ## Node 2 and 3 depend on 1, and 4 depends on no one
%% p1 <- cbind(c(1L, 1L, 0L), c(2L, 3L, 4L))
%% plotPoset(p1, addroot = TRUE)
%% @ 

%% <<fig.height=3>>=
%% ## A simple way to create a poset where no gene (in a set of 15) depends
%% ## on any other.
%% p4 <- cbind(0L, 15L)
%% plotPoset(p4, addroot = TRUE)
%% @ 


%% Specifying posets is actually straightforward. For instance, we can
%% specify the pancreatic cancer poset in Gerstung et al.\
%% \cite{Gerstung2011} (their figure 2B, left). We specify the poset using
%% numbers, but for nicer plotting we will use names (KRAS is 1, SMAD4 is 2,
%% etc). This example is also in the help for \Rfunction{poset}:

%% <<fig.height=3>>=
%% pancreaticCancerPoset <- cbind(c(1, 1, 1, 1, 2, 3, 4, 4, 5),
%%                                c(2, 3, 4, 5, 6, 6, 6, 7, 7))
%% storage.mode(pancreaticCancerPoset) <- "integer"
%% plotPoset(pancreaticCancerPoset,
%%           names = c("KRAS", "SMAD4", "CDNK2A", "TP53",
%%                     "MLL3","PXDN", "TGFBR2"))
%% @
%% \section{Simulating cancer progression}\label{simul}


%% We can simulate the progression in a single subject. Using an example
%% very similar to the one in the help:


%% <<echo=FALSE,results='hide',error=FALSE>>=
%% options(width=60)
%% @ 

%% <<>>=
%% ## use poset p1101
%% data(examplePosets)
%% p1101 <- examplePosets[["p1101"]]

%% ## Bozic Model
%% b1 <- oncoSimulIndiv(p1101, keepEvery = 15)
%% summary(b1)
%% @ 


%% The first thing we do is make it simpler (for future examples) to use a
%% set of restrictions. In this case, those encoded in poset p1101. Then, we
%% run the simulations and look at a simple summary and a plot. %% We explicitly
%% %% set \texttt{silent = TRUE} to prevent the vignette from filling up with
%% %% intermediate output.

%% If you want to plot the trajectories, it is better to keep more frequent
%% samples,  so you can see when clones appear:

%% <<fig.height=5, fig.width=5>>=
%% b2 <- oncoSimulIndiv(p1101, keepEvery = 1)
                    
%% summary(b2)
%% plot(b2)
%% @ 


%% The following is an example where we do not care about passengers, but we
%% want to use a different graph, and we want a few more drivers before
%% considering cancer has been reached. And we allow it to run for longer.
%% Note that in the McF model \texttt{detectionSize} really plays no
%% role. Note also how we pass the poset: it is the same as before, but now
%% we directly access the poset in the list of posets.

%% <<>>=

%% m2 <- oncoSimulIndiv(examplePosets[["p1101"]], model = "McFL", 
%%                      numPassengers = 0, detectionDrivers = 8, 
%%                      mu = 5e-7, initSize = 4000, 
%%                      sampleEvery = 0.025,
%%                      finalTime = 25000, keepEvery = 5, 
%%                      detectionSize = 1e6) 
%% plot(m2, addtot = TRUE, log = "")

%% @ 


%% The default is to simulate progression until a simulation reaches cancer
%% (i.e., only simulations that satisfy the detectionDrivers or the
%% detectionSize will be returned). If you use the McF model with large
%% enough \texttt{initSize} this will often be the case but not if you use
%% very small \texttt{initSize}. Likewise, most of the Bozic runs do not
%% reach cancer. Lets try a few:

%% <<>>=
%% b3 <- oncoSimulIndiv(p1101, onlyCancer = FALSE)
%% summary(b3)

%% b4 <- oncoSimulIndiv(p1101, onlyCancer = FALSE)
%% summary(b4)
%% @ 

%% Plot those runs:

%% <<fig.width=8, fig.height=4>>=
%% par(mfrow = c(1, 2))
%% par(cex = 0.8) ## smaller font
%% plot(b3)
%% plot(b4)
%% @ 


%% \subsection{Simulating progression in several subjects}

%% To simulate the progression in a bunch of subjects (we will use only
%% four, so as not to fill the vignette with plots) we can do, with the same
%% settings as above:

%% <<>>=
%% p1 <- oncoSimulPop(4, p1101)
%% par(mfrow = c(2, 2))
%% plot(p1)
%% @ 


%% \section{Sampling from a set of simulated subjects}\label{sample}
%% \label{sec:sampling-from-set}

%% You will often want to do something with the simulated data. For instance,
%% sample the simulated data. Here we will obtain the trajectories for 100
%% subjects in a scenario without passengers. Then we will sample with the
%% default options and store that as a vector of genotypes (or a matrix of
%% subjects by genes):


%% <<>>=

%% m1 <- oncoSimulPop(100, examplePosets[["p1101"]], 
%%                    numPassengers = 0)

%% @ 

%% The function \Rfunction{samplePop} samples that object, and also gives you
%% some information about the output:

%% <<>>=
%% genotypes <- samplePop(m1)
%% @ 


%% What can you do with it? That is up to you. As an example, let us try to
%% infer an oncogenetic tree (and plot it) using the \CRANpkg{Oncotree}
%% package \cite{Oncotree} after getting a quick look at the marginal
%% frequencies of events:

%% <<fig.width=4, fig.height=4>>=
%% colSums(genotypes)/nrow(genotypes)

%% require(Oncotree)
%% ot1 <- oncotree.fit(genotypes)
%% plot(ot1)
%% @ 

%% Your run will likely differ from mine, but with the defaults (detection
%% size of $10^8$) it is likely that events down the tree will never
%% appear. You can set \texttt{detectionSize = 1e9} and you will see that
%% events down the tree are now found in the cross-sectional sample.


%% Alternatively, you can use single cell sampling and that, sometimes,
%% recovers one or a couple more events.

%% <<fig.width=4, fig.height=4>>=
%% genotypesSC <- samplePop(m1, typeSample = "single")
%% colSums(genotypesSC)/nrow(genotypesSC)

%% ot2 <- oncotree.fit(genotypesSC)
%% plot(ot2)
%% @ 

%% You can of course rename the columns of the output matrix to something
%% else if you want so the names of the nodes will reflect those potentially
%% more meaningful names.


\subsection{Can I start the simulation from a specific mutant?}\label{initmut}

You bet. In v.1 you can only give the initial mutant as one with a single
mutated gene. In version 2, however, you can specify the genotype for the
initial mutant with the same flexibility as in
\Rfunction{evalGenotype}. Here we show a couple of examples (we use the
representation of the phylogeny ---discussed in section \ref{phylog}--- of
the clones so that you can see which clones appear, and from which).

%% o3init <- allFitnessEffects(orderEffects = c(
%%                             "M > D > F" = 0.99,
%%                             "D > M > F" = 0.2,
%%                             "D > M"     = 0.1,
%%                             "M > D"     = 0.9),
%%                         noIntGenes = c("u" = 0.01, 
%%                                        "v" = 0.01,
%%                                        "w" = 0.001,
%%                                        "x" = 0.0001,
%%                                        "y" = -0.0001,
%%                                        "z" = -0.001),
%%                         geneToModule =
%%                             c("Root" = "Root",
%%                               "M" = "m",
%%                               "F" = "f",
%%                               "D" = "d") )
<<fig.height=6>>=

o3init <- allFitnessEffects(orderEffects = c(
                            "M > D > F" = 0.99,
                            "D > M > F" = 0.2,
                            "D > M"     = 0.1,
                            "M > D"     = 0.9),
                        noIntGenes = c("u" = 0.01, 
                                       "v" = 0.01,
                                       "w" = 0.001,
                                       "x" = 0.0001,
                                       "y" = -0.0001,
                                       "z" = -0.001),
                        geneToModule =
                            c("M" = "m",
                              "F" = "f",
                              "D" = "d") )

oneI <- oncoSimulIndiv(o3init, model = "McFL",
                       mu = 5e-5, finalTime = 500,
                       detectionDrivers = 3,
                       onlyCancer = FALSE,
                       initSize = 1000,
                       keepPhylog = TRUE,
                       initMutant = c("m > u > d")
                       )
plotClonePhylog(oneI, N = 0)


## 
ospI <- oncoSimulPop(4, 
                    o3init, model = "Exp",
                    mu = 5e-5, finalTime = 500,
                    detectionDrivers = 3,
                    onlyCancer = TRUE,
                    initSize = 10,
                    keepPhylog = TRUE,
                    initMutant = c("d > m > z"),
                    mc.cores = 2
                    )

op <- par(mar = rep(0, 4), mfrow = c(2, 2))
plotClonePhylog(ospI[[1]])
plotClonePhylog(ospI[[2]])
plotClonePhylog(ospI[[3]])
plotClonePhylog(ospI[[4]])
par(op)


ossI <- oncoSimulSample(4, 
                        o3init, model = "Exp",
                        mu = 5e-5, finalTime = 500,
                        detectionDrivers = 2,
                        onlyCancer = TRUE,
                        initSize = 10,
                        initMutant = c("z > d"),
                        thresholdWhole = 1 ## check presence of initMutant
                    )

## No phylogeny is kept with oncoSimulSample, but look at the 
## OcurringDrivers and the sample

ossI$popSample
ossI$popSummary[, "OccurringDrivers", drop = FALSE]


@ 


\section{Showing the true phylogenetic relationships of clones}\label{phylog}

If you run simulations with the \texttt{keepPhylog = TRUE} argument, the
simulations keep track of when every clone is generated, and that will
allow us to see the true phylogenetic relationships of clones. (This is
disabled by default: the code runs a little bit slower and the result is
larger.)


Let us re-run a previous example:

<<>>=

set.seed(15)
tmp <-  oncoSimulIndiv(examplesFitnessEffects[["o3"]],
                       model = "McFL", 
                       mu = 5e-5,
                       detectionSize = 1e8, 
                       detectionDrivers = 3,
                       sampleEvery = 0.015,
                       max.num.tries = 10,
                       keepEvery = 5,
                       initSize = 2000,
                       finalTime = 20000,
                       onlyCancer = FALSE,
                       extraTime = 1500,
                       keepPhylog = TRUE)
tmp
@ 

We can plot the phylogenetic relationships\footnote{There are several
  packages in R devoted to phylogenetic inference and related issues. For
  instance, \CRANpkg{ape}. I have not used that infrastructure because of
  our very specific needs and circumstances; for instance, internal nodes
  are observed, we can have networks instead of trees, and we have no
  uncertainty about when events occurred.} of every clone ever created
(with fitness larger than 0 ---clones without viability are never shown):

<<>>=
plotClonePhylog(tmp, N = 0)
@ 

However, we often only want to show clones that exist (have number of
cells $>0$) at a certain time (while of course showing all of their
ancestors, even if those are now extinct ---i.e., regardless of their
current numbers).

<<>>=
plotClonePhylog(tmp, N = 1)
@ 

If we set \texttt{keepEvents = TRUE} the arrows show how many times each
clone appeared:

(The next can take a while)
<<pcpkeepx1>>=
plotClonePhylog(tmp, N = 1, keepEvents = TRUE)
@ 

And we can plot the phylogeny so the vertical axis is proportional to time
(though you might see overlap of nodes if a child node appeared shortly
after the parent):

<<>>=
plotClonePhylog(tmp, N = 1, timeEvents = TRUE)
@ 

We can obtain the adjacency matrix doing

<<fig.keep="none">>=
get.adjacency(plotClonePhylog(tmp, N = 1, returnGraph = TRUE))

@ 


We can see another example here:

<<>>=

set.seed(456)
mcf1s <-  oncoSimulIndiv(mcf1,
                         model = "McFL", 
                         mu = 1e-7,
                         detectionSize = 1e8, 
                         detectionDrivers = 100,
                         sampleEvery = 0.02,
                         keepEvery = 2,
                         initSize = 2000,
                         finalTime = 1000,
                         onlyCancer = FALSE,
                         keepPhylog = TRUE)

@ 

Showing only clones that exist at the end of the simulation (and all their
parents):

<<>>=
plotClonePhylog(mcf1s, N = 1)
@ 

Notice that the labels here do not have a ``\_'', since there were no order
effects in fitness. However, the labels show the genes that are
mutated, just as before.

Similar, but with vertical axis proportional to time:


<<>>=
plotClonePhylog(mcf1s, N = 1, timeEvents = TRUE)
@ 

What about those that existed in the last 200 time units?
<<>>=
plotClonePhylog(mcf1s, N = 1, t = c(800, 1000))
@ 

And try now to show also when the clones appeared (we restrict the time
to between 900 and 1000, to avoid too much clutter):
<<>>=
plotClonePhylog(mcf1s, N = 1, t = c(900, 1000), timeEvents = TRUE)
@ 

(By playing with \texttt{t}, it should be possible to obtain animations of
the phylogeny. We will not pursue it here.)


If the previous graph seems cluttered, we can represent it in a different
way by calling \CRANpkg{igraph} directly after storing the graph and using
the default layout:

<<fig.keep="none">>=
g1 <- plotClonePhylog(mcf1s, N = 1, t = c(900, 1000), returnGraph = TRUE)
@ 

<<>>=
plot(g1)
@ 

which might be easier to show complex relationships or identify central or
key clones.


%% There is support in R for phylog, blablabal. But does not work for our
%% specific problem. So use igraph blablabla


%% <<>>=
%% plotClonePhylog(mcf1s, TRUE, FALSE, FALSE)
%% plotClonePhylog(tmp, FALSE, FALSE, FALSE)
%% @ 


It is of course quite possible that, especially if we consider few genes,
our phylogeny will be a network, not a tree, as the same child node can
have multiple parents. You can play with this example, modified from one
we saw before (section \ref{mn1}):

<<eval=FALSE>>=
op <- par(ask = TRUE)
while(TRUE) {
    tmp <- oncoSimulIndiv(smn1, model = "McFL",
                          mu = 5e-5, finalTime = 500,
                          detectionDrivers = 3,
                          onlyCancer = FALSE,
                          initSize = 1000, keepPhylog = TRUE)
    plotClonePhylog(tmp, N = 0)
}
par(op)
@ 


\subsection{Phylogenies from multiple runs}\label{phylogmult}

If you use \Rfunction{oncoSimulPop} you can store and plot the phylogenies
of the different runs:

<<>>=

oi <- allFitnessEffects(orderEffects =
               c("F > D" = -0.3, "D > F" = 0.4),
               noIntGenes = rexp(5, 10),
                          geneToModule =
                              c("F" = "f1, f2, f3",
                                "D" = "d1, d2") )
oiI1 <- oncoSimulIndiv(oi, model = "Exp")
oiP1 <- oncoSimulPop(4, oi,
                     keepEvery = 10,
                     mc.cores = 2,
                     keepPhylog = TRUE)

@ 

We will plot the first two:
<<fig.height=9>>=

op <- par(mar = rep(0, 4), mfrow = c(2, 1))
plotClonePhylog(oiP1[[1]])
plotClonePhylog(oiP1[[2]])
par(op)

@ 


This is so far disabled in function \Rfunction{oncoSimulSample}, since
that function is optimized for other uses. This might change in the future.


\section{Using v.1 posets and simulations}\label{v1}

It is strongly recommended that you use the new (v.2) procedures for
specifying fitness effects. However, the former v.1 procedures are still
available, with only very minor changes to function calls. What follows
below is the former vignette. You might want to use v.1 because for
certain models (e.g., small number of genes, with restrictions as
specified by a simple poset) simulations might be faster with v.1 (fitness
evaluation is much simpler ---we are working on further improving speed).

\subsection{Specifying restrictions: posets}\label{poset}

How to specify the restrictions is shown in the help for
\Rfunction{poset}. It is often useful, to make sure you did not make any
mistakes, to plot the poset. This is from the examples (we use an ``L''
after a number so that the numbers are integers, not doubles; we could
alternatively have modified \texttt{storage.mode}).

<<fig.height=3>>=
## Node 2 and 3 depend on 1, and 4 depends on no one
p1 <- cbind(c(1L, 1L, 0L), c(2L, 3L, 4L))
plotPoset(p1, addroot = TRUE)
@ 

<<fig.height=3>>=
## A simple way to create a poset where no gene (in a set of 15) depends
## on any other.
p4 <- cbind(0L, 15L)
plotPoset(p4, addroot = TRUE)
@ 


Specifying posets is actually straightforward. For instance, we can
specify the pancreatic cancer poset in Gerstung et al.\
\cite{Gerstung2011} (their figure 2B, left). We specify the poset using
numbers, but for nicer plotting we will use names (KRAS is 1, SMAD4 is 2,
etc). This example is also in the help for \Rfunction{poset}:

<<fig.height=3>>=
pancreaticCancerPoset <- cbind(c(1, 1, 1, 1, 2, 3, 4, 4, 5),
                               c(2, 3, 4, 5, 6, 6, 6, 7, 7))
storage.mode(pancreaticCancerPoset) <- "integer"
plotPoset(pancreaticCancerPoset,
          names = c("KRAS", "SMAD4", "CDNK2A", "TP53",
                    "MLL3","PXDN", "TGFBR2"))

@
\subsection{Simulating cancer progression}\label{simul1}


We can simulate the progression in a single subject. Using an example
very similar to the one in the help:


<<echo=FALSE,results='hide',error=FALSE>>=
options(width=60)
@ 

<<>>=
## use poset p1101
data(examplePosets)
p1101 <- examplePosets[["p1101"]]

## Bozic Model
b1 <- oncoSimulIndiv(p1101, keepEvery = 15)
summary(b1)
@ 


The first thing we do is make it simpler (for future examples) to use a
set of restrictions. In this case, those encoded in poset p1101. Then, we
run the simulations and look at a simple summary and a plot. %% We explicitly
%% set \texttt{silent = TRUE} to prevent the vignette from filling up with
%% intermediate output.

If you want to plot the trajectories, it is better to keep more frequent
samples,  so you can see when clones appear:

<<pb2bothx1,fig.height=5.5, fig.width=5.5>>=
b2 <- oncoSimulIndiv(p1101, keepEvery = 1)
summary(b2)
plot(b2)
@ 

As we have seen before, the stacked plot here is less useful and that is
why I do not evaluate that code for this vignette.

<<pbssttx1,eval=FALSE>>=
plot(b2, type = "stacked")
@ 


The following is an example where we do not care about passengers, but we
want to use a different graph, and we want a few more drivers before
considering cancer has been reached. And we allow it to run for longer.
Note that in the McF model \texttt{detectionSize} really plays no
role. Note also how we pass the poset: it is the same as before, but now
we directly access the poset in the list of posets.

<<echo=FALSE,eval=TRUE>>=
set.seed(1) ## for repeatability. Once I saw it not reach cancer.
@ 
<<>>=

m2 <- oncoSimulIndiv(examplePosets[["p1101"]], model = "McFL", 
                     numPassengers = 0, detectionDrivers = 8, 
                     mu = 5e-7, initSize = 4000, 
                     sampleEvery = 0.025,
                     finalTime = 25000, keepEvery = 5, 
                     detectionSize = 1e6) 
@ 

(Very rarely the above run will fail to reach cancer. If that
happens, execute it again.)


As usual, we will plot using both a line and a stacked plot:

<<m2x1,fig.width=6.5, fig.height=10>>=
par(mfrow = c(2, 1))
plot(m2, addtot = TRUE, log = "",
     thinData = TRUE, thinData.keep = 0.5)
plot(m2, type = "stacked",
     thinData = TRUE, thinData.keep = 0.5)
@ 

The default is to simulate progression until a simulation reaches cancer
(i.e., only simulations that satisfy the detectionDrivers or the
detectionSize will be returned). If you use the McF model with large
enough \texttt{initSize} this will often be the case but not if you use
very small \texttt{initSize}. Likewise, most of the Bozic runs do not
reach cancer. Lets try a few:

<<>>=
b3 <- oncoSimulIndiv(p1101, onlyCancer = FALSE)
summary(b3)

b4 <- oncoSimulIndiv(p1101, onlyCancer = FALSE)
summary(b4)
@ 

Plot those runs:

<<b3b4x1ch1, fig.width=8, fig.height=4>>=
par(mfrow = c(1, 2))
par(cex = 0.8) ## smaller font
plot(b3)
plot(b4)
@ 


\subsubsection{Simulating progression in several subjects}

To simulate the progression in a bunch of subjects (we will use only
four, so as not to fill the vignette with plots) we can do, with the same
settings as above:

<<ch2>>=
p1 <- oncoSimulPop(4, p1101, mc.cores = 2)
par(mfrow = c(2, 2))
plot(p1, ask = FALSE)
@ 

We can also use stream and stacked plots, though they might not be as
useful in this case. For the sake of keeping the vignette small, these are
commented out.
<<p1multx1,eval=FALSE>>=
par(mfrow = c(2, 2))
plot(p1, type = "stream", ask = FALSE)
@

<<p1multstx1,eval=FALSE>>=
par(mfrow = c(2, 2))
plot(p1, type = "stacked", ask = FALSE)
@


\subsection{Sampling from a set of simulated subjects}\label{sample-1}
\label{sec:sampling-from-set}

You will often want to do something with the simulated data. For instance,
sample the simulated data. Here we will obtain the trajectories for 100
subjects in a scenario without passengers. Then we will sample with the
default options and store that as a vector of genotypes (or a matrix of
subjects by genes):


<<>>=

m1 <- oncoSimulPop(100, examplePosets[["p1101"]], 
                   numPassengers = 0, mc.cores = 2)

@ 

The function \Rfunction{samplePop} samples that object, and also gives you
some information about the output:

<<>>=
genotypes <- samplePop(m1)
@ 


What can you do with it? That is up to you. As an example, let us try to
infer an oncogenetic tree (and plot it) using the \CRANpkg{Oncotree}
package \cite{Oncotree} after getting a quick look at the marginal
frequencies of events:

<<fxot1,fig.width=4, fig.height=4>>=
colSums(genotypes)/nrow(genotypes)

require(Oncotree)
ot1 <- oncotree.fit(genotypes)
plot(ot1)
@ 

Your run will likely differ from mine, but with the defaults (detection
size of $10^8$) it is likely that events down the tree will never
appear. You can set \texttt{detectionSize = 1e9} and you will see that
events down the tree are now found in the cross-sectional sample.


Alternatively, you can use single cell sampling and that, sometimes,
recovers one or a couple more events.

<<fxot2,fig.width=4, fig.height=4>>=
genotypesSC <- samplePop(m1, typeSample = "single")
colSums(genotypesSC)/nrow(genotypesSC)

ot2 <- oncotree.fit(genotypesSC)
plot(ot2)
@ 

You can of course rename the columns of the output matrix to something
else if you want so the names of the nodes will reflect those potentially
more meaningful names.


\section{Generating random DAGs for restrictions}\label{simo}

You might want to randomly generate DAGs like those often found in the
literature on oncogenetic trees et al. Function \Rfunction{simOGraph}
might help here. 

<<>>=
## No seed fixed, so reruns will give different DAGs.
(a1 <- simOGraph(10))
library(graph) ## for simple plotting
plot(as(a1, "graphNEL"))
@ 

Once you obtain the adjacency matrices, it is for now up to you to convert
them into appropriate posets or fitnessEffects objects.


Why this function? I searched for, and could not find any that did what I
wanted, in particular bounding the number of parents, being able to
specify the approximate depth\footnote{Where depth is defined in the usual
  way to mean smallest number of nodes ---or edges--- to traverse to get
  from the bottom to the top of the DAG.} of the graph, and optionally
being able to have DAGs where no node is connected to another both
directly (an edge between the two) and indirectly (there is a path between
the two through other nodes). So I wrote my own code. The code is fairly
simple to understand (all in file \texttt{generate-random-trees.R}). I
would not be surprised if this way of generating random graphs has been
proposed and named before; please let me know, best if with a reference.


Should we remove direct connections if there are indirect? Or, should we
set \texttt{removeDirectIndirect = TRUE}? Except for \cite{FarahaniLagergren2013},
none of the DAGs I've seen in the context of CBNs, oncogenetic trees, etc,
include both direct and indirect connections between nodes. If these
exist, reasoning about the model can be harder. For example, with CBN (AND
or CMPN or monotone relationships) adding a direct connection makes no
difference iff we assume that the relationships encoded in the DAG are
fully respected (e.g., all $s_h = -\infty$). But it can make a difference
if we allow for deviations from the monotonicity, specially if we only
check for the satisfaction of the presence of the immediate ancestors. And
things get even trickier if we combine XOR with AND. The code for
computing fitness, however, should deal with all of this just fine.


\section{Session info and packages used}

This is the information about the version of R and packages used:
<<>>=
sessionInfo()
@ 

%\newpage
%%\bibliographystyle{apalike} %% or agsm or natbib? or apalike; maybe agsm
%% does the URL without turning into note?

%\bibliographystyle{apalike} %% or agsm or natbib? or apalike; maybe agsm
\bibliography{OncoSimulR}

\end{document}


%% remember to use bibexport to keep just the minimal bib needed
%% bibexport -o extracted.bib OncoSimulR.aux
%% rm OncoSimulR.bib
%% mv extracted.bib OncoSimulR.bib
%% and then turn URL of packages into notes

%%% Local Variables:
%%% TeX-master: t
%%% ispell-local-dictionary: "en_US"
%%% coding: iso-8859-15
%%% End: