--- title: "Introduction to sshist" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to sshist} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 3.5 ) ``` The `sshist` package provides a fast and robust method for selecting the optimal number of bins for histograms. It implements the algorithm developed by Shimazaki and Shinomoto (2007), which minimizes the expected L2 loss function between the histogram and the unknown underlying density function. This method is particularly superior to standard rules (like Sturges or Freedman-Diaconis) when the data exhibits: - Multimodal structures. - Time-dependent rate changes (e.g., in neuroscience). - Complex 2D distributions. ```{r setup} library(sshist) ``` ## 1. One-Dimensional Optimization Let's look at the classic "Old Faithful" geyser data. It has two distinct modes (short and long eruptions). ### Standard Approach vs. sshist Standard histograms often hide the structure if the bin width is not chosen carefully. ```{r comparision} data(faithful) x_data <- faithful$eruptions # 1. Standard R histogram (Sturges rule by default) oldpar <- par(mfrow = c(1, 2)) hist(x_data, main = "Standard hist()", xlab = "Duration of Eruptions", col = "gray90") # 2. Shimazaki-Shinomoto Optimization # The function returns an S3 object with optimal parameters res <- sshist(x_data) hist(x_data, breaks=res$edges, main=paste("Optimal Hist (N=", res$opt_n, ")"), xlab = "Duration of Eruptions", col = "gray90") par(oldpar) ``` As you can see, `sshist` automatically detects the bimodal nature and sharp peaks much better than the conservative default. ### Accessing Parameters You can extract the optimal number of bins (opt_n) or bin width (opt_d) to use in other plotting libraries like ggplot2. ```{r optimal params} print(res) # The plot method shows the Cost Function Graph and the Optimal Histogram plot(res) ``` The left plot shows the cost function graph. The red dot indicates the global minimum (optimal binning). The right plot shows the resulting histogram. ## 2. Two-Dimensional Optimization `sshist` also supports 2D histograms, solving the problem of choosing bin sizes for X and Y axes simultaneously. ```{r sshist_2d} # Run 2D optimization # This calculates the cost function for a grid of (Nx, Ny) combinations res_2d <- sshist_2d(iris$Petal.Length, iris$Petal.Width) # optimal number of bins print(res_2d) # The plot method shows the Cost Function Landscape and the Optimal Histogram plot(res_2d) ``` The left plot shows the cost function landscape. The red dot indicates the global minimum (optimal binning). The right plot shows the resulting histogram. ## Performance The core cost function calculation is implemented in C++ (via Rcpp) and parallelized with OpenMP where possible, making it efficient even for large datasets or dense grid searches in 2D. ## References - Shimazaki, H. and Shinomoto, S., 2007. A method for selecting the bin size of a time histogram. Neural Computation, 19(6), pp.1503-1527. [doi:10.1162/neco.2007.19.6.1503](https://doi.org/10.1162/neco.2007.19.6.1503) - https://www.neuralengine.org/res/histogram.html