---
title: Treefit - Getting started
vignette: >
  %\VignetteIndexEntry{Treefit - Getting started}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output: rmarkdown::html_vignette
---
This document is a user-friendly manual on how to use the Treefit
package, which is the first toolkit for quantitative trajectory
inference using single-cell gene expression data. In this tutorial, we
demonstrate how to generate and analyze two kinds of toy data with the
aim to help get familiar with the practical workflow of Treefit. After
learning some basics, we will use Treefit to perform more biologically
interesting analysis in the next tutorial
(`vignette("working-with-seurat")`).
## 1. Generating toy data
While the Treefit package has been developed to help biologists who
wish to perform trajectory inference from single-cell gene expression
data, Treefit can also be viewed as a toolkit to generate and analyze
a point cloud in d-dimensional Euclidean space (i.e.,
simulated gene expression data).
Treefit provides some useful functions to generate artificial
datasets. For example, as we will now demonstrate, the function
`treefit::generate_2d_n_arms_star_data()` creates data that
approximately fit a star tree with the number of arms or branches; the
term star means a tree that looks like the star symbol "\*"; for
example, the alphabet letters Y and X can be viewed as star trees that
have three and four arms, respectively.
The rows and columns of the generated data correspond to data points
(i.e., n single cells) and their features (i.e.,
expression values of d different genes), respectively. The
Treefit package can be used to analyze both raw count data and
normalized expression data, but regarding the production of toy data,
it is meant to be used to generate continuous data like normalized
expression values.
Importantly, we can generate data with a desired level of noise by
changing the value of the `fatness` parameter of this function. For
example, if you set the `fatness` parameter to `0.0` then you will get
precisely tree-like data without noise. By contrast, setting the
`fatness` to `1.0` gives very noisy data that are no longer
tree-like. In this tutorial, we deal with two types of toy data whose
`fatness` values are `0.1` and `0.8`, respectively. We note that
Treefit can be used to generate and analyze high dimensional datasets
but we focus on generating 2-dimensional data to make things as simple
as possible in this introductory tutorial.
### 1.1. Tree-like toy data
Let us first generate 2-dimensional tree-like data that contain 500
data points and reasonably fit a star tree with three arms. We can
create such data and draw a scatter plot of them (Figure 1) simply by
executing the following two lines of code:
```{r generate-tree-like-data, fig.asp=1, fig.cap="Figure 1. A scatter plot of tree-like toy data generated by Treefit"}
star.tree_like <- treefit::generate_2d_n_arms_star_data(500, 3, 0.1)
plot(star.tree_like)
```
### 1.2. Less tree-like, noisy toy data
Similarly, we can generate noisy data that do not look so tree-like as
the previous one. In this tutorial, we simply change the value of the
`fatness` parameter from `0.1` to `0.8` in order to obtain
non-tree-like data. The following two lines of code yields a scatter
plot shown in Figure 2:
```{r generate-less-tree-like-data, fig.asp=1, fig.cap="Figure 2. A scatter plot of noisy toy data generated by Treefit"}
star.less_tree_like <- treefit::generate_2d_n_arms_star_data(500, 3, 0.8)
plot(star.less_tree_like)
```
## 2. Analyzing the data
Having generated two datasets whose tree-likeness are different, we
can analyze each of them. The Treefit package allows us to estimate
how well the data can be explained by tree models and to predict how
many "principal paths" there are in the best-fit tree. As shown in
Figure 1, the points in the first data clearly form a shape like the
letter "Y" and so the data are considered to fit a star tree with
three arms very well, whereas Figure 2 indicates that the second data
are no longer very tree-like because of the high level of added
noise. Our goal in this tutorial is to reproduce these conclusions by
using Treefit.
### 2.1. Tree-like toy data
Let us estimate the goodness-of-fit between tree models and the first
toy data. This can be done by using `treefit::treefit()` as
follows. The `name` parameter is optional but we should specify
it, if possible. Because it's useful to identify the estimation.
```{r estimate-tree-like-data}
fit.tree_like <- treefit::treefit(list(expression=star.tree_like),
                                  name="tree-like")
# Save the analytsis result to use other tutorials.
saveRDS(fit.tree_like, "fit.tree_like.rds")
fit.tree_like
```
`treefit::treefit()` returns a `treefit` object that summarizes the
analysis of Treefit. We will explain how to interpret the results in
the next section. For now, we may focus on learning how to use
Treefit.
As we will see later, it is helpful to visualize the results using
`plot()`. By executing `plot(fit.tree_like)`, we can obtain the
following two user-friendly visual plots, which makes it easier to
interpret the results of the Treefit analysis.
```{r plot-tree-like-data-estimation-result, fig.cap="Figure 3. The output for the tree-like data shown in Figure 1"}
plot(fit.tree_like)
```
### 2.2. Less tree-like, noisy toy data
We can analyze the second toy data in the same manner.
```{r estimate-less-tree-like-data}
fit.less_tree_like <- treefit::treefit(list(expression=star.less_tree_like),
                                       name="less-tree-like")
# Save the analytsis result to use other tutorials.
saveRDS(fit.less_tree_like, "fit.less_tree_like.rds")
fit.less_tree_like
```
```{r plot-less-tree-like-data-estimation-result, fig.cap="Figure 4. The output for the noisy data shown in Figure 2"}
plot(fit.less_tree_like)
```
### 2.3. Comparing two results
We can compare different results by passing all results to `plot()` as
follows:
```{r plot-estimation-results, fig.cap="Figure 5. Comparison of the plots shown in Figure 3 and Figure 4."}
plot(fit.tree_like, fit.less_tree_like)
```
## 3. Interpreting the results
Before interpreting the previous results, we briefly summarize the
process of the Treefit analysis that consists of the following three
steps.
  1. First, Treefit repeatedly "perturbs" the input data (i.e.,
     adds some small noise to the original row count data or
     normalized expression data) in order to produce many slightly
     different datasets that may have been acquired in the biological
     experiment.
  2. Second, for each dataset, Treefit calculates a distance matrix
     that represents the dissimilarities between sample cells and then
     constructs a tree from each distance matrix. The current version
     of Treefit computes a minimum spanning tree (MST) that has been
     widely used for trajectory inference.
  3. Finally, Treefit evaluates the goodness-of-fit between the data
     and tree models. The underlying idea of this method is that the
     structure of trees inferred from tree-like data tends to have
     high robustness to noise, compared to non-tree-like
     data. Therefore, Treefit measures the mutual similarity between
     estimated trees in order to check the stability of the tree
     structures. To this end, Treefit constructs a
     p-dimensional subspace that extracts the main features of
     each tree structure and then measuring mutual similarities
     between the subspaces by using a special type of metrics called
     the Grassmann distance. In principle, when the estimated trees
     are mutually similar in their structure, the mean and standard
     deviation (SD) of the Grassmann distance are small.
Although the word "Grassmann distance" may sound so unfamiliar to some
readers, the concept appears in different disguises in various
practical contexts. For example, the Grassmann distance has a close
connection to canonical correlation analysis (CCA). Treefit provides
two Grassmann distances `$max_cca_distance` and `$rms_cca_distance`
that can be used for different purposes as we now explain.
### 3.1. Estimating the tree-likeness of data
The Treefit analysis using the first Grassmann distance
`$max_cca_distance` (shown in the left panel of Figure 5) tells us the
goodness-of-fit between data and tree models. In principle, as
mentioned earlier, if the mean and SD of `$max_cca_distance` are
small, then this means that the estimated trees are mutually similar
in their structure. As can be observed, the distance changes according
to the dimensionality p of the feature space, but
`$max_cca_distance` has the property that the value decreases
monotonically as p increases for any datasets.
Comparing the Treefit results for the two datasets, we see that the
mean Grassmann distance for the first data does not fall below the
second one regardless of the value of p and that the SD of the
Grassmann distance for the first data is very small compared to the
second data. These results imply that the estimated tree structures
are very robust to noise in the first case but not in the second
case. Thus, Treefit has verified that the first data are highly
tree-like while the second data are not.
### 3.2. Predicting the number of principal paths in the underlying tree
The Treefit analysis using the other Grassmann distance
`$rms_cca_distance` (shown in the right panel of Figure 5) is useful to
infer the number of "principal paths" in the best-fit tree. From a
biological perspective, this analysis can be used to discover a novel
or unexpected cell type from single-cell gene expression (for details,
see `vignette("working-with-seurat")`).
Unlike the previous Grassmann distance, the mean value of
`$rms_cca_distance` can fluctuate depending on the value of
p. Interestingly, we can predict the number of principal paths
in the best-fit tree by exploring for which p the distance
value reaches "the bottom of a valley" (i.e., attains a local
minimum). More precisely, when `$rms_cca_distance` attains a local
minimum at a certain p, the value p+1 indicates the
number of principal paths in the best-fit
tree. `$n_principal_paths_candidates` has these p+1 values. We
don't need to calculate them manually. When Treefit produces a plot
having more than one valleys, the smallest p is usually most
informative for the prediction. The smallest p+1 value can be
obtained by `$n_principal_paths_candidates[1]`.
Comparing the Treefit results for the two datasets, we first see that
both plots attains a local minimum at p=2. This means that for
both datasets the best-fit tree has p+1=3 principle paths,
which is correct because both were generated from the same star tree
with three arms. Another important point to be made is that the SD of
the Grassmann distance for the first data is very small at p=2
compared to that for the second data; in other words, Treefit made
this prediction more confidently for the first dataset than for the
second one. This result is reasonable because the first dataset is
much less noisy than the second one. Thus, Treefit has correctly
determined the number of principal paths in the underlying tree
together with the goodness-of-fit for each dataset.