--- title: "treestructure applied to structured coalescent simulation" author: "Erik Volz" date: "`r Sys.Date()`" output: bookdown::html_vignette2: #rmarkdown::html_vignette #bookdown::pdf_book: toc: TRUE pkgdown: as_is: true fontsize: 12pt vignette: > %\VignetteIndexEntry{treestructure applied to structured coalescent simulation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 11, warning = FALSE, message = FALSE ) ``` # Structured coalescent simulation This example shows the function `trestruct` applied to a simulated structured coalescent tree that includes samples from a large constant size population and samples from three small 'outbreaks' which are growing exponentially. These simulations were generated with the [phydynR package](https://emvolz-phylodynamics.github.io/phydynR/). ```{r message=FALSE} library(ape) library(treestructure) library(ggtree) ``` Load the tree: ```{r} ( tree <- ape::read.tree( system.file('sim.nwk', package = 'treestructure') ) ) ``` Note that the tip labels corresponds to the deme of each sample. '1' is the constant size reservoir, and '0' is the exponentially growing deme. This will run the `treestructure` algorithm under default setting: ```{r message=FALSE} s <- trestruct( tree ) ``` You can print the results: ```{r} print(s) ``` ## Plotting results The default plotting behavior uses the `ggtree` package if available. ```{r message=FALSE} plot(s) + ggtree::geom_tiplab() ``` If not, or if desired, `ape` plots are available ```{r} plot( s, use_ggtree = FALSE ) ``` For subsequent analysis, you may want to turn the `treestructure` result into a dataframe: ```{r} structureData <- as.data.frame( s ) head( structureData ) ``` Each cluster and partition assignment is stored as a factor. You could use `split` to get a data frame for each partition. Suppose we want a tree corresponding to partition 1: ```{r} with ( structureData, ape::keep.tip(s$tree, taxon[ partition==1 ] ) ) -> partition1 partition1 plot(partition1) ``` ## Parameter choice and number of clusters Two parameters will have large influence on results: 1. `level` is the significance level for subdividing a clade into a new cluster. To detect more clusters, increase `p`, but note that this will also increase the false positive rate. 2. `minCladeSize` controls the smallest allowed cluster size in terms of the number of tips. With a smaller value, smaller clusters may be detected, but computation time will increase. Example: ```{r message=FALSE} trestruct( tree, level = .05, minCladeSize = 5 ) ``` In practice, clustering thresholds are always subjective and the best value of the `level` parameter will depend on your application. One way to choose an appropriate `level` would be to use additional data associated with each sample. You can select the `level` which gives a set of clusters that explains the most variance in the data of interest (e.g. use the cluster as a factor in an ANOVA). ### CH index Alternatively, in the absence of any additional data, the `treestructure` package supports using the [CH index](https://en.wikipedia.org/wiki/Calinski–Harabasz_index) to compare different `level`s. This statistic is based on the ratio of the between-cluster and within-cluster variance of the time of each node (distance from the root) and returns the `level` such that this ratio is maximized. If you wish to use the CH index, pass `level = NULL` to `trestruct`, and read documentation for the `levellb`, `levelub`, and `res` parameters.