--- title: "Runtime Comparison of Critical Point Pairing Algorithms" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Runtime Comparison of Critical Pairing Algorithms} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignetteDepends{dplyr,tidyr,scales,ggplot2,bench} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette compares the runtimes and memory allocations of the multi-pass and single-pass algorithms for pairing critical points. Several tidyverse packages are used to post-process the benchmark results: ```{r setup, message=FALSE} library(rgph) library(dplyr) library(tidyr) library(ggplot2) ``` ## The running example To illustrate, compare the results of the two algorithms on the running example from Tu &al (2018): ```{r example calculation} ex_file <- system.file("extdata", "running_example.txt", package = "rgph") ex_reeb <- read_reeb_graph(ex_file) ( ex_multi <- reeb_graph_pairs(ex_reeb, method = "multi_pass") ) ( ex_single <- reeb_graph_pairs(ex_reeb, method = "single_pass") ) all.equal(ex_multi, ex_single) ``` We expect the resulting data frames to be equivalent, but the output contains the attributes `method` and `elapsed_time` that we expect to be different. If we ignore attributes, then we find the results to agree: ```{r example validation} all.equal(ex_multi, ex_single, check.attributes = FALSE) ``` For this reason, we omit the default check in the benchmarking run: ```{r example benchmark} bench::mark( multi = reeb_graph_pairs(ex_reeb, method = "multi_pass"), single = reeb_graph_pairs(ex_reeb, method = "single_pass"), check = FALSE ) ``` The R bindings are equivalent; while total runtimes will be higher than when comparing the Java programs directly, the differences between them should be the same. Indeed, the allocated memory is exactly the same. However, the single-pass algorithm---Propagate and Pair---is a marginal improvement over the multiple-pass merge pairing algorithm. ## A more complex case The exported object `flower` is a `reeb_graph` constructed from the flower mesh, one of 12 from the `AIM@SHAPE` collection used by Tu &al to benchmark algorithms. Among them, it contains the most critical points, so it will better illustrate the difference between the algorithms in computationally intensive settings. ```{r flower data} # print only a few edges for illustration print(flower, n = 4) ``` However, this Reeb graph contains isolated vertices, which contribute no positive-persistent features to the output but must be dropped before calculation, lest they seize up the algorithms: ```{r flower isolates} # print only a few pairs for illustration print(reeb_graph_pairs(flower, method = "multi_pass"), n = 4) ``` So that the benchmark results are more reflective of the algorithms' requirements, we manually drop these points first: ```{r flower benchmark} flower <- rgph:::drop_reeb_graph_points(flower) bench::mark( multi = reeb_graph_pairs(flower, method = "multi_pass"), single = reeb_graph_pairs(flower, method = "single_pass"), check = FALSE ) ``` For this Reeb graph, Propagate and Pair outperforms the multi-pass algorithm by roughly a factor of 3. ## How performance and improvement scale Finally, we compare how both algorithms scale. Vertex and edge data are provided for three random merge tree Reeb graphs on an exponential size scale. These are converted to split trees by negating the vertex values and benchmarked separately before being stacked. (Note that the single-pass algorithm was found superior to the multi-pass algorithm on split trees but inferior on merge trees.) ```{r random (split) trees benchmark} # collect split tree Reeb graphs tree_files <- vapply( c( `10` = "10_tree_iterations.txt", `100` = "100_tree_iterations.txt", `1000` = "1000_tree_iterations.txt" ), function(f) system.file("extdata", f, package = "rgph"), "" ) tree_reebs <- lapply(tree_files, read_reeb_graph) tree_reebs <- lapply(tree_reebs, function(rg) { rg$values <- -rg$values; rg }) # aggregate benchmark comparisons tree_bench <- tibble() for (i in seq_along(tree_reebs)) { bm <- bench::mark( multi = reeb_graph_pairs(tree_reebs[[i]], method = "multi_pass"), single = reeb_graph_pairs(tree_reebs[[i]], method = "single_pass"), check = FALSE ) bm <- transmute( bm, method = as.character(expression), n_itr, time, memory ) bm <- relocate(mutate(bm, size = as.integer(names(tree_files)[[i]])), size) tree_bench <- bind_rows(tree_bench, bm) } ``` By plotting all of the run times from each of the benchmarks, we can visualize both the differences in runtimes between the algorithms and the variability in those runtimes. ```{r random (split) trees plot, fig.width=8, fig.height=4, out.width="100%"} # plot runtime results tree_bench %>% select(size, method, time) %>% unnest(time) %>% ggplot(aes(x = as.factor(size), y = time * 1e3)) + geom_boxplot(aes(color = method, shape = method)) + scale_y_continuous( transform = "log1p", labels = scales::label_number(suffix = "ms") ) + labs(x = "tree size", y = "run time") ``` The single-pass advantage appears to grow with size. The ratio of medians quantifies this: ```{r random (split) trees medians} tree_bench %>% transmute(size, method, median = vapply(time, median, 0.)) %>% pivot_wider(id_cols = size, names_from = method, values_from = median) %>% transmute(size, ratio = single / multi) ```