\documentclass[nojss]{jss} %\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{sparseHessianFD} %\VignetteKeyword{sparseHessianFD} \usepackage{amsmath} \usepackage{amsfonts} \usepackage{amsthm} \usepackage{subcaption} \usepackage{algorithm} \usepackage{algorithmic} \usepackage{tabularx} \usepackage{dcolumn} \usepackage{booktabs} \usepackage{etoolbox} \usepackage{placeins} \newtoggle{tikz} \toggletrue{tikz} \iftoggle{tikz}{ \usepackage{tikz} }{} \newcolumntype{d}[1]{D{.}{.}{#1}} %%\newcommand{\pkg}[1]{\emph{#1}} %%\newcommand{\proglang}[1]{\textsf{#1}} %%\newcommand{\filename}[1]{\textit{#1}} %%\newcommand{\code}[1]{\texttt{#1}} \newcommand{\func}[1]{\code{#1}} \newcommand{\class}[1]{\textsl{#1}} \newcommand{\funcarg}[1]{\code{#1}} \newcommand{\variable}[1]{\code{#1}} \newcommand{\method}[1]{\func{#1}} \newcommand{\df}[3]{\mathsf{d}^{#1}f(#2;#3)} \newcommand{\parD}[3]{\mathsf{D}^{#1}_{#2}#3} \newcommand{\hess}[2]{\mathsf{H}_{#1}#2} \newcommand{\hessLT}[2]{\mathsf{L}_{#1}#2} \newcommand{\Mat}[1]{\mathbf{#1}} \newcommand{\Real}[1]{\mathbb{R}^{#1}} \newcommand{\Complex}[1]{\mathbb{Z}^{#1}} \DeclareMathOperator{\Imag}{Im} \author{Michael Braun\\Edwin L. Cox School of Business\\Southern Methodist University} \Plainauthor{Michael Braun} \title{\pkg{sparseHessianFD}: An R Package for Estimating Sparse Hessian Matrices} \Plaintitle{sparseHessianFD: An R Package for Estimating Sparse Hessian Matrices} \Shorttitle{sparseHessianFD: Sparse Hessians} \Keywords{sparse Hessians, sparsity, computation of Hessians, graph coloring, finite differences, differentiation, complex step} \Address{ Michael Braun\\ Edwin L. Cox School of Business\\ Southern Methodist University\\ 6212 Bishop Blvd.\\ Dallas, TX 75275\\ E-mail: \email{braunm@smu.edu}\\ URL: \url{https://braunm.github.io} } \Abstract{ Sparse Hessian matrices occur often in statistics, and their fast and accurate estimation can improve efficiency of numerical optimization and sampling algorithms. By exploiting the known sparsity pattern of a Hessian, methods in the sparseHessianFD package require many fewer function or gradient evaluations than would be required if the Hessian were treated as dense. The package implements established graph coloring and linear substitution algorithms that were previously unavailable to R users, and is most useful when other numerical, symbolic or algorithmic methods are impractical, inefficient or unavailable.} \date{January 25, 2017} %% need no \usepackage{Sweave.sty} %%\newenvironment{knitrout}{\begin{figure}}{\end{figure}} \begin{document} <>= suppressPackageStartupMessages(require(Matrix)) knitr::render_sweave() knitr::opts_chunk$set(prompt=TRUE, cache=FALSE,error=FALSE) #$ options(replace.assign=TRUE, width=77, prompt="R> ") @ The Hessian matrix of a log likelihood function or log posterior density function plays an important role in statistics. From a frequentist point of view, the inverse of the negative Hessian is the asymptotic covariance of the sampling distribution of a maximum likelihood estimator. In Bayesian analysis, when evaluated at the posterior mode, it is the covariance of a Gaussian approximation to the posterior distribution. More broadly, many numerical optimization algorithms require repeated computation, estimation or approximation of the Hessian or its inverse; see \citet{NocedalWright2006}. The Hessian of an objective function with $M$ variables has $M^2$ elements, of which $M(M+1)/2$ are unique. Thus, the storage requirements of the Hessian, and computational cost of many linear algebra operations on it, grow quadratically with the number of decision variables. For applications with hundreds of thousands of variables, computing the Hessian even once might not be practical under time, storage or processor constraints. Hierarchical models, in which each additional heterogeneous unit is associated with its own subset of variables, are particularly vulnerable to this curse of dimensionality However, for many problems, the Hessian is \emph{sparse}, meaning that the proportion of ``structural'' zeros (matrix elements that are always zero, regardless of the value at which function is estimated) is high. Consider a log posterior density in a Bayesian hierarchical model. If the outcomes across units are conditionally independent, the cross-partial derivatives of heterogeneous variables across units are zero. As the number of units increases, the size of the Hessian still grows quadratically, but the number of \emph{non-zero} elements grows only linearly; the Hessian becomes increasingly sparse. The row and column indices of the non-zero elements comprise the \emph{sparsity pattern} of the Hessian, and are typically known in advance, before computing the values of those elements. \proglang{R} packages such as \pkg{trustOptim} \citep{R_trustOptim}, \pkg{sparseMVN} \citep{R_sparseMVN} and \pkg{ipoptr} \citep{R_ipoptr} have the capability to accept Hessians in a compressed sparse format. The \pkg{sparseHessianFD} package is a tool for estimating sparse Hessians numerically, using either finite differences or complex perturbations of gradients. Section~\ref{sec:numdiff} will cover the specifics, but the basic idea is as follows. Consider a real-valued function $f(x)$, its gradient $\nabla{f(x)}$, and its Hessian $\hess{}{f(x)}$, for $x\in\mathbb{R}^M$. Define the derivative vector as the transpose of the gradient, and a vector of partial derivatives, so ${\parD{}{}{f(x)}=\nabla f(x)^\top=\left(\parD{}{1}{},\mathellipsis,\parD{}{M}{}\right)}$. (Throughout the paper, we will try to reduce notational clutter by referring to the derivative and Hessian as $\parD{}{}{}$ and $\hess{}{}$, respectively, without the $f(x)$ symbol). Let $e_m$ be a vector of zeros, except with a 1 in the $m$th element, and let $\delta$ be a sufficiently small scalar constant. A ``finite difference'' linear approximation to the $m$th column of the Hessian is ${\hess{m}{}\approx\left(\nabla f(x+\delta e_m)- \nabla f(x)\right)/\delta}$. Estimating a dense Hessian in this way involves at least $M+1$ calculations of the gradient: one for the gradient at $x$, and one after perturbing each of the $M$ elements of $x$, one at a time. Under certain conditions, a more accurate approximation is the ``complex step'' method: $\hess{m}{}\approx\Imag\left(\nabla f(x+i\delta e_m)\right)/\delta$, where $i=\sqrt{-1}$ and $\Imag$ returns the imaginary part of a complex number \citep{SquireTrapp1998}. Regardless of the approximation method used, if the Hessian is sparse, most of the elements are constrained to zero. Depending on the sparsity pattern of the Hessian, those constraints may let us recover the Hessian with fewer gradient evaluations by perturbing multiple elements of $x$ together. For some sparsity patterns, estimating a Hessian in this way can be profoundly efficient. In fact, for the hierarchical models that we consider in this paper, the number of gradient evaluations does not increase with the number of additional heterogeneous units. The package defines the \class{sparseHessianFD} class, whose initializer requires the user to provide functions that compute an objective function, its gradient (as accurately as possible, to machine precision), and the sparsity pattern of its Hessian matrix. The sparsity pattern (e.g., location of structural zeros) must be known in advance, and cannot vary across the domain of the objective function. The only functions and methods of the class that the end user should need to use are the initializer, methods that return the Hessian in a sparse compressed format, and perhaps some utility functions that simplify the construction of the sparsity pattern. The class also defines methods that partition the variables into groups that can be perturbed together in a finite differencing step, and recovers the elements of the Hessian via linear substitution. Those methods perform most of the work, but should be invisible to the user. As with any computing method or algorithm, there are boundaries around the space of applications for which \pkg{sparseHessianFD} is the right tool for the job. In general, numerical approximations are not ``first choice'' methods because the result is not exact, so \pkg{sparseHessianFD} should not be used when the application cannot tolerate any error, no matter how small. Also, we admit that some users might balk at having to provide an exact gradient, even though the Hessian will be estimated numerically.\footnote{It is possible that a gradient that is approximated using a complex step method would be precise enough to be useful. A further investigation is beyond the scope of this paper.} However, deriving a vector of first derivatives, and writing \proglang{R} functions to compute them, is a lot easier than doing the same for a matrix of second derivatives, and more accurate than computing second-order approximations from the objective function. Even when we have derived the Hessian symbolically, in practice it may still be faster to estimate the Hessian using \pkg{sparseHessianFD} than coding it directly. These are the situations in which \pkg{sparseHessianFD} adds the most value to the statistician's toolbox. This article proceeds as follows. First, we present some background information about numerical differentiation, and sparse matrices in \proglang{R}, in Section~\ref{sec:background}. In Section~\ref{sec:using}, we explain how to use the package. Section~\ref{sec:algorithms} explains the underlying algorithms, and Section~\ref{sec:timing} demonstrates the scalability of those algorithms. \section[Background]{Background}\label{sec:background} Before describing how to use the package, we present two short background notes. The first note is an informal mathematical explanation of numerical estimation of the Hessian matrix, with an illustration of how the number of gradient estimates can be reduced by exploiting the sparsity pattern and symmetric structure. This note borrows heavily from, and uses the notation in, \citet[Chapter 6]{MagnusNeudecker2007}. The second note is a summary of some of the sparse matrix classes that are defined in the \pkg{Matrix} package \citep{R_Matrix}, which are used extensively in \pkg{sparseHessianFD}. \subsection[Numerical differentiation of sparse Hessians]{Numerical differentiation of sparse Hessians}\label{sec:numdiff} The partial derivative of a real scalar-valued function $f(x)$ with respect to $x_j$ (the $j$th component of $x\in\Real{M}$) is defined as \begin{align} \label{eq:defParD} \parD{}{j}{f(x)}=\lim\limits_{\delta\to 0}\frac{f(x+\delta e_j)-f(x)}{\delta} \end{align} For a sufficiently small $\delta$, this definition allows for a linear approximation to $\parD{}{j}{f(x)}$. The derivative of $f(x)$ is the vector of all $M$ partial derivatives. \begin{align} \parD{}{}{f(x)}=\left(\parD{}{1}{f(x)},\mathellipsis,\parD{}{M}{f(x)}\right) \end{align} The \emph{gradient} is defined as $\nabla f(x)=\parD{}{}{f(x)}^\top$. We define the second-order partial derivative as \begin{align} \label{eq:14} \parD{2}{jk}{}=\lim\limits_{\delta\to 0}\frac{\parD{}{j}{f(x+\delta e_k)}-\parD{}{j}{f(x)}}{\delta} \end{align} and the Hessian as \begin{align} \label{eq:15} \hess{}{f(x)}= \begin{pmatrix} \parD{2}{11}{}& \parD{2}{12}{}& \mathellipsis & \parD{2}{1M}{}\\ \parD{2}{21}{}& \parD{2}{22}{}& \mathellipsis & \parD{2}{2M}{}\\ \vdots&\vdots&&\vdots\\ \parD{2}{M1}{}& \parD{2}{M2}{}& \mathellipsis & \parD{2}{MM}{} \end{pmatrix} \end{align} The Hessian is symmetric, so $\parD{2}{ij}{}=\parD{2}{ji}{}$. \subsubsection[Approximation using finite differences]{Approximation using finite differences} To estimate the $m$th column of $\hess{}{}$ using finite differences, we choose a sufficiently small $\delta$, and compute \begin{align} \label{eq:1} \hess{m}{f(x)}&\approx\frac{\parD{}{}{f(x+\delta e_m)}-\parD{}{}{f(x)}}{\delta} \end{align} For $M=2$, our estimate of a general $\hess{}{f(x)}$ would be \begin{align} \label{eq:FDhess2} \hess{}{f(x)}&\approx \begin{pmatrix} \parD{}{1}{f(x_1+\delta, x_2)}-\parD{}{1}{f(x_1,x_2)}& \parD{}{1}{f(x_1,x_2+\delta)}-\parD{}{1}{f(x_1,x_2)}\\ \parD{}{2}{f(x_1+\delta, x_2)}-\parD{}{2}{f(x_1,x_2)}& \parD{}{2}{f(x_1,x_2+\delta)}-\parD{}{2}{f(x_1,x_2)} \end{pmatrix}/\delta \end{align} This estimate requires three evaluations of the gradient to get ${\parD{}{}{f(x_1,x_2)}}$, ${\parD{}{}{f(x_1+\delta,x_2)}}$, and ${\parD{}{}{f(x_1,x_2+\delta)}}$. Now suppose that the Hessian is sparse, and that the off-diagonal elements are zero. That means that \begin{align} \parD{}{1}{f(x_1,x_2+\delta)}-\parD{}{1}{f(x_1,x_2)}&=0\label{eq:I1}\\ \parD{}{2}{f(x_1+\delta,x_2)}-\parD{}{2}{f(x_1,x_2)}&=0\label{eq:I2} \end{align} If the identity in Equation~\ref{eq:I1} holds for $x_1$, it must also hold for $x_1+\delta$, and if Equation~\ref{eq:I2} holds for $x_2$, it must also hold for $x_2+\delta$. Therefore, \begin{align} \label{eq:FDhess2sp} \hess{}{f(x)}&\approx \begin{pmatrix} \parD{}{1}{f(x_1+\delta, x_2+\delta)}-\parD{}{1}{f(x_1,x_2)}&0\\ 0& \parD{}{2}{f(x_1+\delta,x_2+\delta)}-\parD{}{2}{f(x_1,x_2)} \end{pmatrix}/\delta \end{align} Only two gradients, ${\parD{}{}{f(x_1,x_2)}}$ and ${\parD{}{}{f(x_1+\delta,x_2+\delta)}}$, are needed. Being able to reduce the number of gradient evaluations from 3 to 2 depends on knowing that the cross-partial derivatives are zero. \subsubsection[Approximation using complex steps]{Approximation using complex steps} If $f(x)$ is defined over a complex domain and is holomorphic, then we can we can approximate $\parD{}{}{f(x)}$ and $\hess{}{f(x)}$ at real values of $x$ using the complex step method. This method comes from a Taylor series expansion of $f(x)$ in the imaginary direction of the complex plane \citep{SquireTrapp1998}. After rearranging terms, and taking the imaginary parts of both sides, \begin{align} \label{eq:complexTaylor} f(x+i\delta)&=f(x) + i\delta\parD{}{}{f(x)}+\mathcal{O}(\delta^2)\\ \parD{}{}{f(x)}&\approx\frac{\Imag(f(x+i\delta))}{\delta} \end{align} Estimating a first derivative using the complex step method does not require a differencing operation, so there is no subtraction operation that might generate roundoff errors. Thus, the approximation can be made arbitrarily precise as $\delta\rightarrow 0$ \citep{LaiCrassidis2008}. This is not the case for second-order approximations of the Hessian \citep{AbreuStich2013}. However, when the gradient can be computed exactly, we can compute a first-order approximation to Hessian by treating it as the Jacobian of a vector-valued function \citep{LaiCrassidis2008}. \begin{align} \label{eq:CShess2} \hess{}{f(x)}&\approx \begin{pmatrix} \Imag(\parD{}{1}{f(x_1+i\delta, x_2)})& \Imag(\parD{}{1}{f(x_1,x_2+i\delta)})\\ \Imag(\parD{}{2}{f(x_1+i\delta, x_2)})&\Imag(\parD{}{2}{f(x_1,x_2+i\delta)}) \end{pmatrix}/\delta \end{align} If this matrix were dense, we would need two evaluations of the $\parD{}{}f(x)$ to estimate it. If the matrix were sparse, with the same sparsity pattern as the Hessian in Equation~\ref{eq:FDhess2sp}, and we assume that structural zeros remain zero for all complex $x\in\Complex{M}$, then we need only one evaluation. Suppose we were to subtract $\Imag(\parD{}{}{f(x_1,x_2)})$ from each column of $\hess{}{f(x)}$. When $x$ is real, the imaginary part of the gradient is zero, so this operation has no effect on the value of the Hessian. But the sparsity constraints ensure that the following identities hold all complex $x$. \begin{align} \Imag(\parD{}{1}{f(x_1,x_2+i\delta)}) -\Imag(\parD{}{1}{f(x_1,x_2)})&=0\label{eq:I1c}\\ \Imag(\parD{}{2}{f(x_1+i\delta,x_2)})-\Imag(\parD{}{2}{f(x_1,x_2)})&=0\label{eq:I2c} \end{align} As with the finite difference method, because Equation~\ref{eq:I1c} holds for $x_1$, it must also hold for $x_1+i\delta$, and because Equation~\ref{eq:I2c} holds for $x_2$, it must also hold for $x_2+i\delta$. Thus, \begin{align} \label{eq:CShess2sp} \hess{}{f(x)}&\approx \begin{pmatrix} \Imag(\parD{}{1}{f(x_1+i\delta, x_2+i\delta)})&0\\ 0&\Imag(\parD{}{2}{f(x_1+i\delta,x_2+i\delta)}) \end{pmatrix}/\delta \end{align} for real $x$. Only one evaluation of the gradient is required. \subsubsection[Perturbing groups of variables]{Perturbing groups of variables} \Citet{CurtisPowellReid1974} describe a method of estimating sparse Jacobian matrices by perturbing groups of variables together. \Citet{PowellToint1979} extend this idea to the general case of sparse Hessians. This method partitions the decision variables into $C$ mutually exclusive groups so that the number of gradient evaluations is reduced. Define $\Mat{G}\in\Real{M\times C}$ where $\Mat{G}_{mc}=\delta$ if variable $m$ belongs to group $c$, and zero otherwise. Define $\Mat{G}_c\in\Real{M}$ as the $c$th column of $\Mat{G}$. Next, define $\Mat{Y}\in\Real{M\times C}$ such that each column is either a difference in gradients, or the imaginary part of a complex-valued gradient, depending on the chosen method. \begin{align} \label{eq:Yg} \Mat{Y}_c&= \begin{cases} \nabla f(x+G_c)-\nabla f(x)&\text{finite difference method}\\ \Imag(\nabla f(x+iG_c))&\text{complex step method} \end{cases} \end{align} If $C=M$, then $\Mat{G}$ is a diagonal matrix with $\delta$ in each diagonal element. The matrix equation $\hess{}{}\Mat{G}=\Mat{Y}$ represents the linear approximation $\hess{im}{}\delta\approx y_{im}$, and we can solve for all elements of $\hess{}{}$ just by computing $\Mat{Y}$. But if $C>= N <- 5 k <- 2 nv1 <- (N + 1) * k nels1 <- nv1^2 nnz1 <- (N + 1) * k^2 + 2 * N * k^2 nnz1LT <- (N + 1) * k * (k+1)/2 + N * k^2 Q <- 1000 nv2 <- (Q+1) * k nels2 <- nv2^2 nnz2 <- (Q+1) * k^2 + 2 * Q * k^2 nnz2LT <- (Q+1) * k * (k+1)/2 + Q * k^2 options(scipen=999) @ Let $x_1$ and $x_2$ be two subsets of elements of $x$. Define $\parD{2}{x_1,x_2}$ as the product set of cross-partial derivatives between all elements in $x_1$ and all elements in $x_2$. From the log posterior density in Equation~\ref{eq:LPD}, we can see that $\parD{2}{\beta_i, \beta_i}\neq 0$ (one element of $\beta_i$ could be correlated with another element of $\beta_i$), and that, for all $i$, $\parD{2}{\beta_i, \mu}\neq 0$ (because $\mu$ is the prior mean of each $\beta_i$). However, since the $\beta_i$ and $\beta_j$ are independently distributed, and the $y_i$ are conditionally independent, the cross-partial derivatives $\parD{2}{\beta_i, \beta_j}=0$ for all $i\neq j$. When $N$ is much greater than $k$, there will be many more zero cross-partial derivatives than non-zero. Each $\parD{2}{}$ is mapped to a submatrix of $\hess{}{}$, most of which with be zero. The resulting Hessian of the log posterior density will be sparse. The sparsity pattern depends on the indexing function; that is, on how the variables are ordered in $x$. One such ordering is to group all of the coefficients in the $\beta_i$ for each unit together. \begin{align} \beta_{11},\dotsc,\beta_{1k},\beta_{21},\dotsc,\beta_{2k},~\dotsc~,\beta_{N1},\dotsc,\beta_{Nk},\mu_1,\dotsc,\mu_k \end{align} In this case, the Hessian has a ``block-arrow'' structure. For example, if $N=\Sexpr{N}$ and $k=\Sexpr{k}$, then there are \Sexpr{nv1} total variables, and the Hessian will have the pattern in Figure~\ref{fig:blockarrow}. Another possibility is to group coefficients for each covariate together. \begin{align} \beta_{11},\dotsc,\beta_{N1},\beta_{12},\dotsc,\beta_{N2},~\dotsc~,\beta_{1k},\dotsc,\beta_{Nk},\mu_1,\dotsc,\mu_k \end{align} Now the Hessian has an "banded" sparsity pattern, as in Figure~\ref{fig:banded}. \begin{figure}[tbp] \begin{subfigure}[b]{.5\textwidth} <>= Mat <- as(kronecker(Diagonal(N), matrix(1, k, k)),"sparseMatrix") Mat <- rbind(Mat, Matrix(1, k, N*k)) Mat <- cbind(Mat, Matrix(1, k * (N+1), k)) printSpMatrix(as(Mat,"nMatrix")) @ \caption{A ``block-arrow'' sparsity pattern.}\label{fig:blockarrow} \end{subfigure} \begin{subfigure}[b]{.5\textwidth} <>= Mat <- kronecker(Matrix(1, k, k), Diagonal(N)) Mat <- rbind(Mat, Matrix(1, k, N * k)) Mat <- cbind(Mat, Matrix(1, k * (N + 1), k)) printSpMatrix(as(Mat,"nMatrix")) @ \caption{A ``banded'' sparsity pattern.}\label{fig:banded} \end{subfigure} \caption{Two examples of sparsity patterns for a hierarchical model.}\label{fig:patterns} \end{figure} In both cases, the number of non-zeros is the same. There are \Sexpr{nels1} elements in this symmetric matrix, but only \Sexpr{nnz1} are non-zero, and only \Sexpr{nnz1LT} values are unique. Although the reduction in RAM from using a sparse matrix structure for the Hessian may be modest, consider what would happen if $N=\Sexpr{scales::comma(Q)}$ instead. In that case, there are \Sexpr{scales::comma(nv2)} variables in the problem, and more than $\Sexpr{scales::comma(floor(nels2/10^6))}$ million elements in the Hessian. However, only $\Sexpr{scales::comma(nnz2)}$ of those elements are non-zero. If we work with only the lower triangle of the Hessian, then we need to work with only \Sexpr{scales::comma(nnz2LT)} values. The sparsity pattern required by \pkg{sparseHessianFD} consists of the row and column indices of the non-zero elements in the \emph{lower triangle} the Hessian, and it is the responsibility of the user to ensure that the pattern is correct. In practice, rather than trying to keep track of the row and column indices directly, it might be easier to construct a pattern matrix first, check visually that the matrix has the right pattern, and then extract the indices. The package defines utility functions (Table~\ref{tab:helpers}) to convert between sparse matrices, and the vectors of row and column indices required by the \class{sparseHessianFD} initializer. \begin{table}[ptb] \centering \setlength\extrarowheight{.25em} \begin{tabularx}{1.0\linewidth}{lX} \toprule \func{Matrix.to.Coord} & Returns a list of vectors containing row and column indices of the non-zero elements of a matrix.\\ \func{Matrix.to.Pointers}&Returns indices and pointers from a sparse matrix.\\ \func{Coord.to.Pointers}&Converts list of row and column indices (triplet format) to a list of indices and pointers (compressed format).\\ \bottomrule \end{tabularx} \caption{sparseHessianFD matrix conversion functions.} \label{tab:helpers} \end{table} The \func{Matrix.to.Coord} function extracts row and column indices from a sparse matrix. The following code constructs a logical block diagonal matrix, converts it to a sparse matrix, and prints the sparsity pattern of its lower triangle. \begin{minipage}{\columnwidth} <<>>= library("sparseHessianFD") bd <- kronecker(diag(3), matrix(TRUE, 2, 2)) Mat <- as(bd, "nMatrix") printSpMatrix(tril(Mat)) mc <- Matrix.to.Coord(tril(Mat)) mc @ \end{minipage} \vspace{1em} To check that a proposed sparsity pattern represents the intended matrix visually, use the \pkg{Matrix} \func{sparseMatrix} constructor. <<>>= pattern <- sparseMatrix(i=mc$rows, j=mc$cols) printSpMatrix(pattern) @ If there is uncertainty about whether an element is a structural zero or not, one should err on the side of it being non-zero, and include that element in the sparsity pattern. There might be a loss of efficiency if the element really is a structural zero, but the result will still be correct. All that would happen is that the numerical estimate for that element would be zero (within machine precision). On the other hand, excluding a non-zero element from the sparsity pattern will likely lead to an incorrect estimate of the Hessian. \subsection[The sparseHessianFD class]{The \class{sparseHessianFD} class} The function \func{sparseHessianFD} is an initializer that returns a reference to a \class{sparseHessianFD} object. The initializer determines an appropriate permutation and partitioning of the variables, and performs some additional validation tests. The arguments to the initializer are described in Table~\ref{tab:init}. {\setlength{\extrarowheight}{.25em} \begin{table}[ptb] \begin{tabularx}{\linewidth}{>{\bfseries}r X} \toprule x& A numeric vector, with length $M$ at which the object will be initialized and tested.\\ fn,gr& \proglang{R} Functions that return the value of the objective function, and its gradient. The first argument is the numeric variable vector. Other named arguments can be passed to \func{fn} and {gr} as well (see the \funcarg{...} argument below).\\ rows, cols& Sparsity pattern: integer vectors of the row and column indices of the non-zero elements in the \emph{lower triangle} of the Hessian.\\ delta& The perturbation amount for finite differencing of the gradient to compute the Hessian (the $\delta$ in Section~\ref{sec:numdiff}). Defaults to \code{sqrt(.Machine\$double.eps)}.\\ index1& If \variable{TRUE} (the default), \funcarg{row} and \funcarg{col} use one-based indexing. If \variable{FALSE}, zero-based indexing is used.\\ complex&If \variable{TRUE}, the complex step method is used. If \variable{FALSE} (the default), a simple finite differencing of gradients is used.\\ ...& Additional arguments to be passed to \func{fn} and \func{gr}.\\ \bottomrule \end{tabularx} \caption{Arguments to the sparseHessianFD initializer.}\label{tab:init} \end{table} } To create a \class{sparseHessianFD} object, just call \func{sparseHessianFD}. Applying the default values for the optional arguments, the usage syntax to create a \class{sparseHessianFD} object is <>= obj <- sparseHessianFD(x, fn, gr, rows, cols, ...) @ where \funcarg{...} represents all other arguments that are passed to \funcarg{fn} and \funcarg{gr}. The \func{fn}, \func{gr} and \func{hessian} methods respectively evaluate the function, gradient and Hessian at a variable vector $x$. The \func{fngr} method returns the function and gradient as a list. The \func{fngrhs} method includes the Hessian as well. <>= f <- obj$fn(x) df <- obj$gr(x) hess <- obj$hessian(x) fngr <- obj$fngr(x) fngrhs <- obj$fngrhs(x) @ \subsection[An example]{An example}\label{sec:exHess} Now we can estimate the Hessian for the log posterior density of the model from Section~\ref{sec:example}. For demonstration purposes, \pkg{sparseHessianFD} includes functions that compute the value (\func{binary.f}), the gradient (\func{binary.grad}) and the Hessian (\func{binary.hess}) of this model. We will treat the result from \func{binary.hess} as a ``true'' value against which we will compare the numerical estimates. To start, we load the data, set some dimension parameters, set prior values for $\Mat{\Sigma}^{-1}$ and $\Mat{\Omega}^{-1}$, and simulate a vector of variables at which to evaluate the function. The \func{binary.f} and \func{binary.grad} functions take the data and priors as lists. The \func{data(binary)} call adds the appropriate data list to the environment, but we need to construct the prior list ourselves. <>= set.seed(123) data("binary") str(binary) N <- length(binary[["Y"]]) k <- NROW(binary[["X"]]) T <- binary[["T"]] nvars <- as.integer(N * k + k) priors <- list(inv.Sigma = rWishart(1, k + 5, diag(k))[,,1], inv.Omega = diag(k)) @ This dataset represents the simulated choices for $N= \Sexpr{N}$ customers over $T= \Sexpr{T}$ purchase opportunties, where the probability of purchase is influenced by $k= \Sexpr{k}$ covariates. The next code chunk evaluates the ``true'' value, gradient and Hessian. The \funcarg{order.row} argument tells the function whether the variables are ordered by household (\variable{TRUE}) or by covariate (\variable{FALSE}). If \funcarg{order.row} is \variable{TRUE}, then the Hessian will have a banded pattern. If \funcarg{order.row} is \variable{FALSE}, then the Hessian will have a block-arrow pattern. <>= P <- rnorm(nvars) order.row <- FALSE true.f <- binary.f(P, binary, priors, order.row=order.row) true.grad <- binary.grad(P, binary, priors, order.row=order.row) true.hess <- binary.hess(P, binary, priors, order.row=order.row) @ The sparsity pattern of the Hessian is specified by two integer vectors: one each for the row and column indices of the non-zero elements of the lower triangule of the Hessian. For this example, we happen to have a matrix with the same sparsity pattern of the Hessian we are trying to compute, so we can use the \func{Matrix.to.Coord} function to extract the appropriate index vectors. In practice, it is more likely that we would need to determine the row and column indices directly, through our knowledge of the structure of the problem. For a hierarchical model, we can create a block-arrow pattern matrix using either the \func{Matrix::bdiag} or \func{kronecker} functions to create a block diagonal matrix, and concatenate dense rows and columns to the margins. <>= pattern <- Matrix.to.Coord(tril(true.hess)) str(pattern) @ Finally, we create an instance of a \class{sparseHessianFD} object. Evaluations of the function and gradient using the \method{fn} and \method{gr} methods will always give the same results as the true values because they are computed using the same functions. <>= obj <- sparseHessianFD(P, fn=binary.f, gr=binary.grad, rows=pattern[["rows"]], cols=pattern[["cols"]], data=binary, priors=priors, order.row=order.row) f <- obj$fn(P) identical(f, true.f) gr <- obj$gr(P) identical(gr, true.grad) @ The default choice of method is \funcarg{complex=FALSE}, so the evaluation of the Hessian is a finite differenced approximation, so it is very close to, but not identical to, the true value, in terms of mean relative difference. <>= options(scipen=0) @ <>= hs <- obj$hessian(P) mean(abs(hs-true.hess))/mean(abs(hs)) @ If \funcarg{complex=TRUE} in the initializer, the call to the \method{hessian} method will apply the complex step method. To use this method, the functions passed as \funcarg{fn} and \funcarg{gr} must both accept a complex argument, and return a complex result, even though we are differentiating a real-valued function. Although \pkg{base} \proglang{R} supports complex arguments for most basic mathematical functions, many common functions (e.g., \func{gamma}, \func{log1p}, \func{expm1}, and the probability distribution functions) do not have complex implementations. Furthermore, the complex step method is valid only if the function is holomorphic. The methods in \class{sparseHessianFD} do \emph{not} check that this is the case for the function at hand. We convey the following warning from the documentation of the \pkg{numDeriv} package \citep{R_numDeriv}, which also implements the complex step method: ``avoid this method if you do not know that your function is suitable. Your mistake may not be caught and the results will be spurious.'' Fortunately for demonstration purposes, the log posterior density in Equation~\ref{eq:LPD} is holomorphic, so we can estimate its Hessian using the complex step method, and compute the mean relative difference from the true Hessian. <>= obj2 <- sparseHessianFD(P, fn=binary.f, gr=binary.grad, rows=pattern[["rows"]], cols=pattern[["cols"]], complex=TRUE, data=binary, priors=priors, order.row=order.row) hs2 <- obj2$hessian(P) mean(abs(hs2-true.hess))/mean(abs(hs2)) @ <>= options(scipen=999) @ In short, the complex step method can be more accurate than finite differencing, but it comes with theoretical and implementation restrictions that may limit its universality. \section[Algorithms]{Algorithms}\label{sec:algorithms} In this section, we explain how \pkg{sparseHessianFD} works. The algorithms are adapted from \citet{ColemanGarbow1985}, who provided \proglang{Fortran} implementations as \citet{ColemanGarbow1985b}. Earlier versions of \pkg{sparseHessianFD} included licensed copies of the \citet{ColemanGarbow1985b} code, on which the current version no longer depends. Although newer partitioning algorithms have been proposed (e.g., \citealp{GebremedhinManne2005,GebremedhinTarafdar2009}), mainly in the context of automatic differentiation, we have chosen to implement established algorithms that are known to work well, and are likely optimal for the hierarchical models that many statisticians will encounter. \subsection[Partitioning the variables]{Partitioning the variables}\label{sec:coloring} Finding consistent, efficient partitions can be characterized as a vertex coloring problem from graph theory \citep{ColemanMore1984}. In this sense, each variable is a vertex in an undirected graph, and an edge connects two vertices $i$ and $j$ if and only if ${\hess{ij}{f(x)}\neq 0}$. The sparsity pattern of the Hessian is the adjacency matrix of the graph. By ``color,'' we mean nothing more than group assignment; if a variable is in a group, then its vertex has the color associated with that group. A ``proper'' coloring of a graph is one in which two vertices with a common edge do not have the same color. \Citet{ColemanMore1984} define a ``triangular coloring'' as a proper coloring with the additional condition that common neighbors of a vertex do not have the same color. A triangular coloring is a special case of an ``cyclic coloring,'' in which any cycle in the graph uses at least three colors \citep{GebremedhinTarafdar2007}. An ``intersection set'' contains characteristics that are common to two vertices, and an ``intersection graph'' connects vertices whose intersection set is not empty. In our context, the set in question is the row indices of the non-zero elements in each column of $\hessLT{}{}$. In the intersection graph, two vertices are connected if the corresponding columns in $\hessLT{}{}$ have at least one non-zero element in a common row. \Citet{PowellToint1979} write that a partitioning is consistent with a substitution method if and only if no columns of the of lower triangle of the Hessian that are in the same group have a non-zero element in the same row. An equivalent statement is that no two adjacent vertices in the intersection graph can have the same color. Thus, we can partition the variables by creating a proper coloring of the intersection graph of $\hessLT{}{}$. This intersection graph, and the number of colors needed to color it, are not invariant to permutation of the rows and columns of $\hess{}{}$. Let $\pi$ represent such a permutation, and let $\hessLT{\pi}{}$ be the lower triangle of $\pi\hess{}{}{\pi^\top}$. \Citet[Theorem 6.1]{ColemanMore1984} show that a coloring is triangular if and only if it is also a proper coloring of the intersection graph of $\hessLT{\pi}{}$. Furthermore, \citet{ColemanCai1986} prove that a partitioning is consistent with a substitution method if and only if it is an acyclic coloring of the graph of the sparsity pattern of the Hessian. Therefore, finding an optimal partitioning of the variables involves finding an optimal combination of a permutation $\pi$, and coloring algorithm for the intersection graph of $\hessLT{\pi}{}$. These ideas are illustrated in Figures~\ref{fig:graph1} and \ref{fig:graph2}. Figure~\ref{fig:graph1adj} shows the sparsity pattern of the lower triangle of a Hessian as an adjacency matrix, and Figure~\ref{fig:graph1pic} is the associated graph with a proper vertex coloring. Every column (and thus, every pair of columns) in Figure~\ref{fig:graph1adj} has a non-zero element in row 7, so there are no non-empty intersection sets across the columns. All vertices are connected to each other in the intersection graph (Figure~\ref{fig:graph1int}), which requires seven colors for a proper coloring. Estimating a sparse Hessian with this partitioning scheme would be no more efficient than treating the Hessian as if it were dense. \begin{figure}[ptb] \begin{subfigure}[b]{.32\textwidth}\centering \begin{tabular}{r|ccccccc} & 1&2&3&4&5&6&7\\ \hline 1& 1&&&&&&\\ 2&1&1&&&&&\\ 3&0&0&1&&&&\\ 4&0&0&1&1&&&\\ 5&0&0&0&0&1&&\\ 6&0&0&0&0&1&1&\\ 7&1&1&1&1&1&1&1 \end{tabular} \caption{Adjacency matrix.}\label{fig:graph1adj} \end{subfigure} \begin{subfigure}[b]{.32\textwidth}\centering \iftoggle{tikz}{ \begin{tikzpicture} \tikzstyle{every node}=[draw, shape=circle] \node at (0,0) [fill=green!20] (v7) {7}; \node at (0:1) [fill=blue!20] (v1) {1}; \node at (60:1) [fill=red!20] (v2) {2}; \node at (120:1) [fill=blue!20] (v3) {3}; \node at (180:1) [fill=red!20] (v4) {4}; \node at (240:1) [fill=blue!20] (v5) {5}; \node at (300:1) [fill=red!20] (v6) {6}; \draw (v1) -- (v7) (v2) -- (v7) (v3) -- (v7) (v4) -- (v7) (v5) -- (v7) (v6) -- (v7) (v1) -- (v2) (v3) -- (v4) (v5) -- (v6); \end{tikzpicture} }{} %% end toggle \caption{Sparsity graph.}\label{fig:graph1pic} \end{subfigure} \begin{subfigure}[b]{.32\textwidth}\centering \iftoggle{tikz}{ \begin{tikzpicture} \tikzstyle{every node}=[draw, shape=circle] \node at (0,0) [fill=green!20] (v7) {7}; \node at (0:1) [fill=blue!20] (v1) {1}; \node at (60:1) [fill=red!20] (v2) {2}; \node at (120:1) [fill=yellow!20] (v3) {3}; \node at (180:1) [fill=purple!20] (v4) {4}; \node at (240:1) [fill=brown!20] (v5) {5}; \node at (300:1) [fill=white!20] (v6) {6}; \draw (v1) -- (v7) (v2) -- (v7) (v3) -- (v7) (v4) -- (v7) (v5) -- (v7) (v6) -- (v7) (v1) -- (v6) (v2) -- (v6) (v3) -- (v6) (v4) -- (v6) (v5) -- (v6) (v1) -- (v5) (v2) -- (v5) (v3) -- (v5) (v4) -- (v5) (v1) -- (v4) (v2) -- (v4) (v3) -- (v4) (v1) -- (v3) (v2) -- (v3) (v1) -- (v2); \end{tikzpicture} }{} %% end toggle \caption{Intersection graph.}\label{fig:graph1int} \end{subfigure} \caption{Unpermuted matrix.}\label{fig:graph1} \end{figure} Now suppose we were to rearrange $\hess{}{}$ so the last row and and column were moved to the front. In Figure~\ref{fig:graph2adj}, all columns share at least one non-zero row with the column for variable 7, but variable groups $\{2,4,6\}$ and $\{1,3,5\}$ have empty intersection sets. The intersection graph in Figure~\ref{fig:graph2int} has fewer edges than Figure~\ref{fig:graph1int}, and can be colored with only three colors. \begin{figure}[ptb] \begin{subfigure}[b]{.32\textwidth}\centering \begin{tabular}{r|ccccccc} & 7&1&2&3&4&5&6\\ \hline 7& 1&&&&&&\\ 1&1&1&&&&&\\ 2&1&1&1&&&&\\ 3&1&0&0&1&&&\\ 4&1&0&0&1&1&&\\ 5&1&0&0&0&0&1&\\ 6&1&0&0&0&0&1&1 \end{tabular} \caption{Adjacency matrix.}\label{fig:graph2adj} \end{subfigure} \begin{subfigure}[b]{.32\textwidth}\centering \iftoggle{tikz}{ \begin{tikzpicture} \tikzstyle{every node}=[draw, shape=circle] \node at (0,0) [fill=green!20] (v7) {7}; \node at (0:1) [fill=blue!20] (v1) {1}; \node at (60:1) [fill=red!20] (v2) {2}; \node at (120:1) [fill=blue!20] (v3) {3}; \node at (180:1) [fill=red!20] (v4) {4}; \node at (240:1) [fill=blue!20] (v5) {5}; \node at (300:1) [fill=red!20] (v6) {6}; \draw (v1) -- (v7) (v2) -- (v7) (v3) -- (v7) (v4) -- (v7) (v5) -- (v7) (v6) -- (v7) (v1) -- (v2) (v3) -- (v4) (v5) -- (v6); \end{tikzpicture} }{} %% end toggle \caption{Sparsity graph.}\label{fig:graph2pic} \end{subfigure} \begin{subfigure}[b]{.32\textwidth}\centering \iftoggle{tikz}{ \begin{tikzpicture} \tikzstyle{every node}=[draw, shape=circle] \node at (0,0) [fill=green!20] (v7) {7}; \node at (0:1) [fill=blue!20] (v1) {1}; \node at (60:1) [fill=red!20] (v2) {2}; \node at (120:1) [fill=blue!20] (v3) {3}; \node at (180:1) [fill=red!20] (v4) {4}; \node at (240:1) [fill=blue!20] (v5) {5}; \node at (300:1) [fill=red!20] (v6) {6}; \draw (v1) -- (v7) (v2) -- (v7) (v3) -- (v7) (v4) -- (v7) (v5) -- (v7) (v6) -- (v7) (v1) -- (v2) (v3) -- (v4) (v5) -- (v6); \end{tikzpicture} }{} %% end toggle \caption{Intersection graph.}\label{fig:graph2int} \end{subfigure} \caption{Permuted matrix.}\label{fig:graph2} \end{figure} The practical implication of all of this is that by permuting the rows and columns of the Hessian, we may be able to reduce the number of colors needed for a cyclic coloring of the graph of the sparsity pattern. Fewer colors means fewer partitions of the variables, and that means fewer gradient evaluations to estimate the Hessian. The \class{sparseHessianFD} class finds a permutation, and partitions the variables, when it is initialized. The problem of finding a cyclic coloring of the graph of the sparsity pattern is NP-complete \citep{ColemanCai1986}, so the partitioning may not be truly optimal. Fortunately, we just need the partitioning to be reasonably good, to make the effort worth our while. A plethora of vertex coloring heuristics have been proposed, and we make no claims that any of the algorithms in \pkg{sparseHessianFD} are even ``best available'' for all situations. The first step is to permute the rows and columns of the Hessian. A reasonable choice is the ``smallest-last'' ordering that sorts the rows and columns in decreasing order of the number of elements \citep[Theorem 6.2]{ColemanMore1984}. To justify this permutation, suppose non-zeros within a row are randomly distributed across columns. If the row is near the top of the matrix, there is a higher probability that any non-zero element is in the upper triangle, not in the lower. By putting sparser rows near the bottom, we do not change the number of non-zeros in the lower triangle, but we should come close to minimizing the number of non-zeros in each row. Thus, we would expect the number of columns with non-zero elements in common rows to be smaller, and the intersection graph to be sparser \citep{GebremedhinTarafdar2007}. The adjacency matrix of the intersection graph of the permuted matrix is the Boolean crossproduct, $\hessLT{\pi}{}^\top \hessLT{\pi}{}$. Algorithm \ref{alg:coloring} is a ``greedy'' vertex coloring algorithm, in which vertices are colored sequentially. The result is a cyclic coloring on the sparsity graph, which in turn is a consistent partitioning of the variables. \begin{algorithm} \begin{algorithmic} \REQUIRE{$P[i],i=1,\mathellipsis,M$: sets of column indices of non-zero elements in row $i$.} \REQUIRE{$F[i],i=1,\mathellipsis,M$: sets of ``forbidden'' colors for vertex $i$ (initially empty)} \REQUIRE{$U$: set of used colors (initially empty)} \REQUIRE{$C[i], i=1,\mathellipsis,M$: vector to store output of assigned colors (initially all zero)}. \STATE{$k\leftarrow 0$} \COMMENT{Largest color index used} \STATE{Insert $0$ in $U$} \FOR{$i=1$ \TO $M$} \IF{$F[i]$ is empty (no forbidden colors)} \STATE{$C[i]\leftarrow 0$} \ELSE \STATE{$V\leftarrow U - F[i]$}\COMMENT{Used colors that are not forbidden} \IF{$V$ is empty} \STATE{$k\leftarrow k+1$} \STATE{Insert $k$ into $U$} \STATE{$C[i]\leftarrow k$} \ELSE \STATE{$C[i]\leftarrow\min(V)$} \COMMENT{Assign smallest existing non-forbidden color to $i$} \ENDIF \ENDIF \ENDFOR \FOR{$j$ in $P[i]$} \STATE{Insert $C[i]$ into $F[j]$} \COMMENT{Make $i$'s color forbidden to all of its uncolored neighbors} \ENDFOR \RETURN{$C$} \end{algorithmic} \caption{Consistent partitioning of variables for a triangular substitution method.}\label{alg:coloring} \end{algorithm} \subsection[Computing the Hessian by substitution]{Computing the Hessian by substitution} The cycling coloring of the sparsity graph defines the $\Mat{G}$ matrix from Section \ref{sec:numdiff}. We then estimate $\Mat{Y}$ using Equation~\ref{eq:Yg}. Let $C_m$ be the assigned color to variable $m$. The substitution method is defined in \citet[Equation 6.1]{ColemanMore1984}. \begin{align} \label{eq:subst} \hess{ij}{f(x)}&=\Mat{Y}_{i,C_j}/\delta - \sum_{l>i,l\in C_j}\hess{li}{f(x)} \end{align} We implement the substitution method using Algorithm \ref{alg:subst}. This algorithm completes the bottom row of the lower triangle, copies values to the corresponding column in the upper triangle, and advances upwards. \begin{algorithm} \begin{algorithmic} \REQUIRE{$P[i],i=1,\mathellipsis,M$: sets of column indices of non-zero elements in row $i$.} \REQUIRE{$C[i], i=1,\mathellipsis,M$: vector of assigned colors} \REQUIRE{$\Mat{H}$, an $M\times M$ Hessian (initialized to zero)} \REQUIRE{$\Mat{B}$, a $\max(C)\times M$ matrix (initialized to zero)} \REQUIRE{$\Mat{Y}$, a matrix of finite differences} \REQUIRE{$\delta$, the small constant used to estimate $\Mat{Y}$} \FOR{$i$ = $M$ \TO $1$} \FOR{All $j$ in $P_i$} \STATE{$z\leftarrow Y[i,C[j]]/\delta - B[C[j], i]$} \STATE{$\Mat{B}[C[i], j]\leftarrow \Mat{B}[C[i], j] + z$} \STATE{$H[i,j]\leftarrow z$} \STATE{$H[j,i]\leftarrow H[i,j]$} \ENDFOR \ENDFOR \end{algorithmic} \caption{Triangular substitution method.}\label{alg:subst} \end{algorithm} \subsection[Software libraries]{Software libraries} The coloring and substitution algorithms use the \pkg{Eigen} numerical library \citep{Eigen}, and the \pkg{Rcpp} \citep{R_Rcpp} and \pkg{RcppEigen} \citep{R_RcppEigen} \proglang{R} packages. The \pkg{testthat} \citep{R_testthat}, \pkg{scales} \citep{R_scales} and \pkg{knitr} \citep{R_knitr} packages were used for testing, and to prepare this article. \newpage \section[Speed and scalability]{Speed and scalability}\label{sec:timing} As far as we know, \pkg{numDeriv} \citep{R_numDeriv} is the only other \proglang{R} package that computes numerical approximations to derivatives. Like \pkg{sparseHessianFD} it includes functions to compute Hessians from user-supplied gradients (through the \func{jacobian} function), and implements both the finite differencing and complex step methods. Its most important distinction from \pkg{sparseHessianFD} is that it treats all Hessians as dense. Thus, we will use \pkg{numDeriv} as the baseline against which we can compare the performance of \pkg{sparseHessianFD}. To prepare Table~\ref{tab:numDeriv}, we estimated Hessians of the log posterior density in Equation~\ref{eq:LPD} with different numbers of heterogeneous units ($N$) and within-unit parameters ($k$). The total number of variables is $M=(N+1)k$. Table~\ref{tab:numDeriv} shows the mean and standard deviations (across 500 replications) for the time (in milliseconds) to compute a Hessian using functions for both the finite difference and complex step methods from each package. Times were generated on a compute node running Scientific Linux 6 (64-bit) with an 8-core Intel Xeon X5560 processor (2.80 GHz) with 24 GB of RAM, and collected using the \pkg{microbenchmark} package \citep{R_microbenchmark}. Code to replicate Table~\ref{tab:numDeriv} is available in the \func{doc/} directory of the installed package, and the \func{vignettes/} directory of the source package. In Table~\ref{tab:numDeriv} we see that computation times using \pkg{sparseHessianFD} and considerably shorter than those using \pkg{numDeriv}. <>= suppressMessages(library(xtable)) suppressMessages(library(dplyr)) load("vignette_tab4.Rdata") @ \begin{table}[ptb]\centering \begin{tabularx}{1\linewidth}{@{\extracolsep{\fill} }rrr|rrrr|rrrr} \toprule &&&\multicolumn{4}{c|}{\bfseries{Finite differencing}}&\multicolumn{4}{c}{\bfseries{Complex step}}\\ &&&\multicolumn{2}{c}{\pkg{numDeriv}}&\multicolumn{2}{c|}{\pkg{sparseHessianFD}}&\multicolumn{2}{c}{\pkg{numDeriv}}&\multicolumn{2}{c}{\pkg{sparseHessianFD}}\\ $N$&$k$&$M$&mean&sd&mean&sd&mean&sd&mean&sd\\ %%\midrule <>= arrange(tab4, M) %>% xtable(digits=c(rep(0,4),rep(1,8))) %>% print(only.contents=TRUE, comment=FALSE, booktabs=TRUE, include.rownames=FALSE, include.colnames=FALSE) @ \end{tabularx} \caption{Computation times (milliseconds) for computing Hessians using the \pkg{numDeriv} and \pkg{sparseHessianFD} packages, and the finite difference and complex step methods, across 500 replications. Rows are ordered by the number of variables.}\label{tab:numDeriv} \end{table} To help us understand just how scalable \pkg{sparseHessianFD} is, we ran another set of simulations, for the same hierarchical model, for different values of $N$ and $k$. We then recorded the run times for different steps in the sparse Hessian estimation, across 200 replications. The steps are summarized in Table~\ref{tab:timeMeasures}. The times were generated on an Apple Mac Pro with a 12-core Intel Xeon E5-2697 processor (2.7 GHz) with 64 GB of RAM. \begin{table}[ptb]\centering \begin{tabularx}{1.0\linewidth}{>{\bfseries}l X} \toprule Measure&Description\\ \midrule Function&estimating the objective function\\ Gradient&estimating the gradient\\ Hessian&computing the Hessian (not including initialization or partitioning time)\\ Partitioning&finding a consistent partitioning of the variables (the vertex coloring problem)\\ Initialization&total setup time (including the partitioning time)\\ \bottomrule \end{tabularx} \caption{Summary of timing tests (see Figure~\ref{fig:timing}).}\label{tab:timeMeasures} \end{table} In the plots in Figure~\ref{fig:timing}, the number of heterogeneous units ($N$) is on the x-axis, and median run time, in milliseconds, is on the y-axis. Each panel shows the relationship between $N$ and run time for a different step in the algorithm, and each curve in a panel represents a different number of within-unit parameters ($k$). \begin{figure}[tbp] \centering \includegraphics{fig_timings} \caption{Run times for sparse Hessian computation.} \label{fig:timing} \end{figure} Computation times for the function and gradient, as well as the setup and partitioning times for the \class{sparseHessianFD} object, grow linearly with the number of heterogenous units. The time for the Hessian grows linearly as well, and that might be partially surprising. We saw in Section~\ref{sec:coloring} that adding additional heterogeneous units in a hierarchical model does not increase the number of required gradient evaluations. So we might think that the time to compute a Hessian should not increase with $N$ \emph{at all}. The reason it does is that each gradient evaluation takes longer. Nevertheless, we can conclude that the \pkg{sparseHessianFD} algorithms are quite efficient and scalable for hierarchical models. \FloatBarrier \bibliography{sparseHessianFD} \end{document}