--- title: "HVT: Collection of functions used to build hierarchical topology preserving maps" author: "Zubin Dowlaty, Shubhra Prakash, Sangeet Moy Das, Shantanu Vaidya, Praditi Shah, Srinivasan Sudarsanam, Somya Shambhawi, Chepuri Gopi Krishna, Siddharth Shorya, PonAnureka Seenivasan, Vishwavani Ravichandran, Bidesh Ghosh, Alimpan Dey" date: "Created Date: 2018-05-17
Modified Date: `r Sys.Date()`" fig.height: 4 fig.width: 15 output: rmarkdown::html_vignette: toc: true toc_depth : 3 vignette: > %\VignetteIndexEntry{HVT: Collection of functions used to build hierarchical topology preserving maps} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} tangle: false --- ```{css, echo=FALSE} /* CSS for floating TOC on the left side */ #TOC { /* float: left; */ position: fixed; margin-left: -22vw; width: 18vw; height: fit-content; overflow-y: auto; padding-top: 20px; padding-bottom: 20px; background-color: #f9f9f9; border-right: 1px solid #ddd; margin-top: -14.5em; } .main-container { margin-left: 220px; /* Adjust this value to match the width of the TOC + some margin */ } body{ max-width:1200px; width: 53%; } p { text-align: justify; } li { padding-bottom: 5px; } .caption { text-align: center; } ``` ```{r setup, warning = FALSE, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", out.width = "auto", out.height = "480px", fig.width = 7, fig.height = 5, fig.align = "center", fig.retina = 1, dpi = 150, eval=TRUE, tangle = FALSE ) ``` # 1. Abstract The HVT package offers a suite of R functions designed to construct topology preserving maps for in-depth analysis of multivariate data. It is particularly well-suited for datasets with numerous records. The package organizes the typical workflow into several key stages: 1. **Data Compression**: Long datasets are compressed using Hierarchical Vector Quantization (HVQ) to achieve the desired level of data reduction. 2. **Data Projection**: Compressed cells are projected into one and two dimensions using dimensionality reduction algorithms, producing embeddings that preserve the original topology. This allows for intuitive visualization of complex data structures. 3. **Tessellation**: Voronoi tessellation partitions the projected space into distinct cells, supporting hierarchical visualizations. Heatmaps and interactive plots facilitate exploration and insights into the underlying data patterns. 4. **Scoring**: Test dataset is evaluated against previously generated maps, enabling their placement within the existing structure. Sequential application across multiple maps is supported if required. 5. **Temporal Analysis and Visualization**: Functions in this stage examine time-series data to identify patterns, estimate transition probabilities, and visualize data flow over time. 6. **Dynamic Forecasting**: Monte Carlo simulations of Markov chain provides forecasting capabilities for both ex-post and ex-ante scenarios with meticulously handling problematic states when found. The HVT package allows creation of visually stunning tessellations, showcasing the power of topology preserving maps. Below is an image depicting a captivating tessellation of a torus, see **vignette** for more details. ```{r predictlayer_flow,echo=FALSE,warning=FALSE,fig.show='hold',message=FALSE,fig.cap='Figure 1: The Voronoi tessellation for layer 1 and number of cells 500 with the heat map overlaid for variable z.'} knitr::include_graphics('./pngs/torus2.png') ``` # 2. Vignettes Following are the links to the vignettes for the HVT package: ```{r,eval=TRUE, echo=FALSE} library(knitr) library(kableExtra) # Create a data frame with hyperlinks vignette_data <- data.frame( Version_Number = c("v18.05.17", "v18.05.17", "v23.05.16", "v23.10.26", "v24.05.16","v24.08.14", "v25.03.01","v25.08.25"), Vignette_Title = c( "HVT Vignette", "HVT Model Diagnostics Vignette", "HVT Scoring Cells with Layers using scoreLayeredHVT", "Temporal Analysis and Visualization: Leveraging Time Series Capabilities in HVT", "Visualizing LLM Embeddings using HVT", "Implementation of t-SNE and UMAP in trainHVT function", "Dynamic Forecasting of Macroeconomic Time Series Dataset using HVT", "Hyperparameter Experimentation for Champion Model Selection in MSM Dynamic Forecasting" ), Description = c( "Contains the workflow of the functions used for vector quantization and construction of Hierarchical Voronoi Tessellations for data analysis.", "Contains demonstrations of functions used to perform model diagnostics and validation for the trained HVT model.", "Contains explanations of the functions used for scoring cells with layers based on a sequence of maps using scoreLayeredHVT.", "Contains implementations of the functions used for analyzing time series data and creating its state transition flow maps.", "Contains implementation and analysis of hierarchical clustering using functions to evaluate and visualize token embeddings generated by OpenAI in 2D Space.", "Contains enhancements to the `trainHVT` function with advanced dimensionality reduction techniques such as t-SNE and UMAP, and includes a table of evaluation metrics to improve interpretability.", "Contains enhancements to the HVT package for dynamic forecasting using Monte Carlo Simulations of Markov Chain (MSM) on macroeconomic time series dataset.", "Contains enhancements to enable strategic selection of the champion model based on the lowest Mean Absolute Error by hyperparameter tuning in msm - dynamic forecasting." ), stringsAsFactors = FALSE ) # Generate the table kable(vignette_data, format = "html", col.names = c("Version", "Vignette Title", "Description"), align = "l", escape = FALSE) %>% kable_styling(full_width = TRUE) %>% column_spec(1, width = "15%") ``` # 3. Version History ## HVT (v25.2.6) - What's New 14th October, 2025 In this version of the HVT package, the following new feature and vignette have been introduced: **Feature** 1. **Experimentation of hyperparameters in `msm`**: This update introduces a new function called `HVTMSMoptimization` that runs grid search experiments across different hyperparameters (number of cells, clusters(k), nearest neighbors(nn)) by training and scoring HVT models, running MSM simulations for each combination. Returns the tabulated results and plotly object visualizations that highlight the champion model (i.e., the combination with lowest MAE). **Vignette** 1. **Hyperparameter Experimentation for Champion Model Selection in MSM Dynamic Forecasting**: This vignette provides a comprehensive demonstration of using `HVTMSMoptimization`, covering the complete workflow from initial dataset handling, selection for train & test, executing hyperparameter tuning and identifying the champion model, implementing the champion model, and comparing MAE results. *The issue with time-series animation plots from previous release has now been resolved with the latest gganimate update.* ## HVT (v25.2.5) 04th July, 2025 *Dropping the time-series animation plots from the package since the latest version of gganimate doesn’t support them — a patched release will follow once the issue is resolved.* ## HVT (v25.2.4) 04th June, 2025 In this version of the HVT package, the following new features and vignette have been introduced: **Features** 1. **Dynamic Forecasting of a Time Series Dataset**: This update introduces a new function called `msm` Monte Carlo Simulations of Markov Chain for dynamic forecasting of states in time series dataset. It supports both ex-post and ex-ante forecasting, offering valuable insights into future trends while resolving state transition challenges through clustering and nearest-neighbor methods to enhance simulation accuracy. 2. **Z score Plots**: This update introduces a new function called `plotZscore` that generates Z-score plots corresponding to the HVT cells for the given data, offering a visual representation of data distribution and highlighting potential outliers. **Vignette** 1. **Dynamic Forecasting of Macroeconomic Time Series Dataset using HVT**: This vignette illustrates the practical use of the new `msm` function on a macroeconomic dataset with 10 variables. It covers all steps, including data preparation, model training, scoring, and forecasting, while addressing challenges related to state transitions and evaluating performance using Mean Absolute Error (MAE). ## HVT (v24.9.1) 4th September, 2024 In this version of the HVT package, the following new features and vignettes have been introduced: **Features** 1. **Implementation of t-SNE and UMAP in `trainHVT`**: This update incorporates dimensionality reduction methods like t-SNE and UMAP in the `trainHVT` function, complementing the existing Sammon's projection. It also enables the visualization of these techniques across all hierarchical levels within the HVT framework. 2. **Implementation of dimensionality reduction evaluation metrics**: This update introduces highly effective dimensionality reduction evaluation metrics as part of the output list of the `trainHVT` function. These metrics are organized into two levels: Level 1 (L1) and Level 2 (L2). The L1 metrics address key areas of dimensionality reduction which are mentioned below, by ensuring comprehensive evaluation and performance. - Structure Preservation Metrics - Distance Preservation Metrics - Human Centered Metrics - Interpretive Quality Metrics - Computational Efficiency Metrics 3. **Introduction of `clustHVT` function**: In this update, we introduced a new function called `clustHVT` specifically designed for Hierarchical clustering analysis. The function performs clustering of cells exclusively when the hierarchy level is set to 1, determining the optimal number of clusters by evaluating various indices. Based on user input, it conducts hierarchical clustering using AGNES with the default ward.D2 method. The output includes a dendrogram and an interactive 2D clustered HVT map that reveals cell context upon hovering. This function is not applicable when the hierarchy level is greater than 1. **Vignettes** 1. **Implementation of t-SNE and UMAP in `trainHVT` function**: This vignette showcases the integration of t-SNE and UMAP in the `trainHVT` function, offering a comprehensive guide on how to apply and visualize these dimensionality reduction techniques. It also covers the dimensionality reduction evaluation metrics and provides insights into their interpretation. 2. **Visualizing LLM Embeddings using HVT (Hierarchical Voronoi Tessellation)**: This vignette will outline the process of analyzing OpenAI-generated token embeddings using the HVT package, covering data compression, visualization, and hierarchical clustering, as well as comparing domain name assignments for clusters. It examines HVT's effectiveness in preserving contextual relationships between embeddings. Additionally, it provides a brief overview of the newly added `clustHVT` function and its parameters. ## HVT (v24.5.2) 2nd May, 2024 In this version of the HVT package, the following new features have been introduced: 1. **Updated Nomenclature:** To make the function names more consistent and understandable/intuitive, we have renamed the functions throughout the package. Given below are a few instances. * `HVT` to `trainHVT` * `predictHVT` to `scoreHVT` * `predictLayerHVT` to `scoreLayeredHVT` 2. **Restructured Functions:** The functions have been rearranged and grouped into new sections which are highlighted on the index page of the package’s PDF documentation. Given below are a few instances. * `trainHVT` function now resides within the `Training_or_Compression` section. * `plotHVT` function now resides within the `Tessellation_and_Heatmap` section. * `scoreHVT` function now resides within the `Scoring` section. 3. **Enhancements:** The pre-existed functions, `hvtHmap` and `exploded_hmap`, have been combined and incorporated into the `plotHVT` function. Additionally, `plotHVT` now includes the ability to perform 1D plotting. 4. **Temporal Analysis** - The new update focuses on the integration of time series capabilities into the HVT package by extending its foundational operations to time series data which is emphasized in this vignette. - The new functionalities are introduced to analyze underlying patterns and trends within the data, providing insights into its evolution over time and also offering the capability to analyze the movement of the data by calculating its transitioning probability and create elegant plots and GIFs. Below are the new functions and their brief descriptions: - `plotStateTransition`: Provides the time series flowmap plot. - `getTransitionProbability`: Provides a list of transition probabilities. - `reconcileTransitionProbability`: Provides plots and tables for comparing transition probabilities calculated manually and from markovchain function. - `plotAnimatedFlowmap`: Creates flowmaps and animations for both self state and without self state scenarios. ## HVT (v23.11.02) {-} 17th November, 2023 This version of the HVT package offers functionality to score cells with layers based on a sequence of maps created using `scoreLayeredHVT`. Given below are the steps to create the successive set of maps. 1. **Map A** - The output of `trainHVT` function which is trained on parent data. 2. **Map B** - The output of `trainHVT` function which is trained on the 'data with novelty' created from `removeNovelty` function. 3. **Map C** - The output of `trainHVT` function which is trained on the 'data without novelty' created from `removeNovelty` function. The `scoreLayeredHVT` function uses these three maps to score the test datapoints. Let us try to understand the steps with the help of the diagram below ```{r mlayer_flow,echo=FALSE,warning=FALSE,fig.show='hold',message=FALSE,fig.cap='Figure 2: Data Segregation for scoring based on a sequence of maps using scoreLayeredHVT()'} knitr::include_graphics('./pngs/scoreLayeredHVT_function.png') ``` ## HVT (v22.12.06) {-} 06th December, 2022 This version of the HVT package offers features for both training an HVT model and eliminating outlier cells from the trained model. 1. **Training or Compression:** The initial step entails training the parent data using the `trainHVT` function, specifying the desired compression percentage and quantization error. 2. **Remove novelty cells:** Following the training process, outlier cells can be identified manually from the 2D hvt plot. These outlier cells can then be inputted into the `removeNovelty` function, which subsequently produces two datasets in its output: one containing 'data with novelty' and the other containing 'data without novelty'. # 4. Installation of HVT (v25.2.6) **CRAN Installation** ``` r install.packages("HVT") ``` **Git Hub Installation** ``` r library(devtools) devtools::install_github(repo = "Mu-Sigma/HVT") ```