| Title: | Estimators and Algorithms for Heavy-Tailed Distributions |
| Version: | 0.1.1 |
| Description: | Implements the estimators and algorithms described in Chapters 8 and 9 of the book "The Fundamentals of Heavy Tails: Properties, Emergence, and Estimation" by Nair et al. (2022, ISBN:9781009053730). These include the Hill estimator, Moments estimator, Pickands estimator, Peaks-over-Threshold (POT) method, Power-law fit, and the Double Bootstrap algorithm. |
| License: | MIT + file LICENSE |
| URL: | https://github.com/0diraf/heavytails |
| Depends: | R (≥ 3.5.0) |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| Imports: | stats |
| Suggests: | testthat (≥ 3.0.0) |
| Config/testthat/edition: | 3 |
| NeedsCompilation: | no |
| Packaged: | 2025-11-25 17:54:07 UTC; diraF |
| Author: | Farid Rohan [aut, cre] |
| Maintainer: | Farid Rohan <frohan@ur.rochester.edu> |
| Repository: | CRAN |
| Date/Publication: | 2025-12-01 14:30:02 UTC |
Double Bootstrap algorithm
Description
This function implements the Double Bootstrap algorithm as described by in Chapter 9 by Nair et al. It applies bootstrapping to two samples of different sizes to choose the value of k that minimizes the mean square error.
Usage
doublebootstrap(
data,
n1 = -1,
n2 = -1,
r = 50,
k_max_prop = 0.5,
kvalues = 20,
na.rm = FALSE
)
Arguments
data |
A numeric vector of i.i.d. observations. |
n1 |
A numeric scalar specifying the first bootstrap sample size, Nair et al. describe this as |
n2 |
A numeric scalar specifying the second bootstrap sample size |
r |
A numeric scalar specifying the number of bootstraps |
k_max_prop |
A numeric scalar. The max k as a proportion of the sample size. It might be computationally expensive to consider all possible k values from the data. Furthermore, lower k values can be noisy, while higher values can be biased. Hence, k here is limited to a specific proportion (by default 50%) of the data |
kvalues |
An integer specifying the length of sequence of candidate k values |
na.rm |
Logical. If |
Details
Chapter 9 of Nair et al. specifically describes the Double Bootstrap algorithm for the Hill estimator.
The Hill Double Bootstrap method uses the Hill estimator as the first estimator
\hat{\xi}_{n,k}^{(1)} := \frac{1}{k}\sum_{i=1}^{k}\log\left(\frac{X_{(i)}}{X_{(k+1)}}\right)
And a second moments-based estimator:
\hat{\xi}_{n,k}^{(2)} = \frac{M_{n,k}}{2\hat{\xi}_{n,k}^{H} }
Where
M_{n,k} := \frac{1}{k}\sum_{i=1}^{k}\left(\log\left(\frac{X_{(i)}}{X_{(k+1)}}\right)\right)^2
The difference between these two \hat \xi is given by:
|\hat{\xi}_{n,k}^{(1)} - \hat{\xi}_{n,k}^{(2)}| = \frac{|M_{n,k}-2(\hat{\xi}_{n,k}^{H})^{2}|}{2|\hat{\xi}_{n,k}^{H}|}
The Hill bootstrap method selects \hat \kappa in a way that minimizes the mean square error in the numerator by going through r bootstrap samples of different sizes n_1 and n_2.
\hat{\kappa}_{1}^{*} := \text{arg min}_{k} \frac{1}{r} \sum_{j=1}^{r} (M_{n_1,k}(j) - 2(\hat{\xi}_{n_1,k}^{(1)}(j))^2)^2
This process is repeated to determine \hat \kappa_{2} with the bootstrap sample of size n_{2}. The final \hat \kappa is given by:
\hat{\kappa}^{*} = \frac{(\hat{\kappa}_{1}^{*})^2}{\hat{\kappa}_{2}^{*}} (\frac{\log \hat{\kappa}_{1}^{*}}{2\log n_1 - \log \hat{\kappa}_{1}^{*}})^{\frac{2(\log n_1 - \log \hat{\kappa}_{1}^{*})}{\log n_1}}
Value
A named list containing the final results of the Double Bootstrap algorithm:
k: The optimal number of top-order statistics\hat{k}selected by minimizing the MSE.alpha: The estimated tail index\hat{\alpha}(Hill estimator) corresponding to\hat{k}.
References
Danielsson, J., de Haan, L., Peng, L., & de Vries, C. G. (2001). Using a bootstrap method to choose the sample fraction in tail index estimation. Journal of Multivariate Analysis, 76(2), 226–248. doi:10.1006/jmva.2000.1903
Nair, J., Wierman, A., & Zwart, B. (2022). The Fundamentals of Heavy Tails: Properties, Emergence, and Estimation. Cambridge University Press. (pp. 229-233) doi:10.1017/9781009053730
Examples
xmin <- 1
alpha <- 2
r <- runif(800, 0, 1)
x <- (xmin * r^(-1/(alpha)))
db_kalpha <- doublebootstrap(data = x, n1 = -1, n2 = -1, r = 5, k_max_prop = 0.5, kvalues = 20)
Negative Log likelihood of Generalized Pareto Distribution
Description
Helper function for pot_estimator(). Returns the \xi and \beta that minimize the negative log-likelihood of the Generalized Pareto Distribution (GPD).
Usage
gpd_lg_likelihood(params, data)
Arguments
params |
Vector containing initial values of |
data |
Original dataset |
Details
l(\xi,\beta)=-n\log(\beta) - (\frac{1}{\xi} + 1)\sum_{i=1}^{n} \log(1 + \xi \frac{x_i}{\beta})
Value
Negative log-likelihood of the GPD.
Examples
x <- rweibull(n=2000, shape = 0.8, scale = 1)
u <- 2 # Threshold
y <- x[x > u] - u
log_lik <- gpd_lg_likelihood(params = c(xi = 0.1, beta = 2), data = y)
Hill Estimator
Description
Hill estimator used to calculate the tail index (alpha) of input data.
Usage
hill_estimator(data, k, na.rm = FALSE)
Arguments
data |
A numeric vector of i.i.d. observations. |
k |
An integer specifying the number of top order statistics to use (the size of the tail). Must be strictly less than the sample size. |
na.rm |
Logical. If |
Details
\hat \alpha_H = \frac{1}{\frac{1}{k} \sum_{i=1}^{k} log(\frac{X_{(i)}}{X_{(k)}})}
where X_{(1)} \ge X_{(2)} \ge \dots \ge X_{(n)} are the order statistics
of the data (descending order).
Value
A single numeric scalar: Hill estimator calculation of the tail index \alpha.
References
Hill, B. M. (1975). A Simple General Approach to Inference About the Tail of a Distribution. The Annals of Statistics, 3(5), 1163–1174. http://www.jstor.org/stable/2958370
Nair, J., Wierman, A., & Zwart, B. (2022). The Fundamentals of Heavy Tails: Properties, Emergence, and Estimation. Cambridge University Press. (pp. 203-205) doi:10.1017/9781009053730
Examples
xmin <- 1
alpha <- 2
r <- runif(800, 0, 1)
x <- (xmin * r^(-1/(alpha)))
hill <- hill_estimator(data = x, k = 5)
Moments Estimator
Description
Moments estimator to calculate \xi for the input data.
Usage
moments_estimator(data, k, na.rm = FALSE, eps = 1e-12)
Arguments
data |
A numeric vector of i.i.d. observations. |
k |
An integer specifying the number of top order statistics to use (the size of the tail). Must be strictly less than the sample size. |
na.rm |
Logical. If |
eps |
numeric, factor added to the denominator to avoid division by zero. Default value is 1e-12. |
Details
\hat \xi_{ME} = \underbrace{\hat \xi_{k,n}^{H,1}}_{T_1} + \underbrace{1 - \frac{1}{2}(1-\frac{(\hat \xi_{k,n}^{H,1})^2}{\hat \xi_{k,n}^{H,2}})^{-1}}_{T_2}
Value
A single numeric scalar: Moments estimator calculation of the shape parameter \xi.
References
Dekkers, A. L. M., Einmahl, J. H. J., & De Haan, L. (1989). A Moment Estimator for the Index of an Extreme-Value Distribution. The Annals of Statistics, 17(4), 1833–1855. http://www.jstor.org/stable/2241667
Nair, J., Wierman, A., & Zwart, B. (2022). The Fundamentals of Heavy Tails: Properties, Emergence, and Estimation. Cambridge University Press. (pp. 216-219) doi:10.1017/9781009053730
Examples
xmin <- 1
alpha <- 2
r <- runif(800, 0, 1)
x <- (xmin * r^(-1/(alpha)))
moments <- moments_estimator(data = x, k = 5)
Pickands Estimator
Description
Pickands estimator to calculate \xi for the input data.
Usage
pickands_estimator(data, k, na.rm = FALSE)
Arguments
data |
A numeric vector of i.i.d. observations. |
k |
An integer specifying the number of top order statistics to use (the size of the tail). Must be strictly less than the sample size. |
na.rm |
Logical. If |
Details
\hat{\xi}_{P} = \frac{1}{\log 2} \log ( \frac{X_{(k)} - X_{(2k)}}{X_{(2k)} - X_{(4k)}})
Value
A single numeric scalar: Pickands estimator calculation of the shape parameter \xi.
References
Pickands, J. (1975). Statistical Inference Using Extreme Order Statistics. The Annals of Statistics, 3(1), 119–131. http://www.jstor.org/stable/2958083
Nair, J., Wierman, A., & Zwart, B. (2022). The Fundamentals of Heavy Tails: Properties, Emergence, and Estimation. Cambridge University Press. (pp. 219-221) doi:10.1017/9781009053730
Examples
xmin <- 1
alpha <- 2
r <- runif(800, 0, 1)
x <- (xmin * r^(-1/(alpha)))
pickands <- pickands_estimator(data = x, k = 5)
Power-law fit (PLFIT) Algorithm
Description
This function implements the PLFIT algorithm as described by Clauset et al. to determine the value of \hat k. It minimizes the Kolmorogorov-Smirnoff (KS) distance between the empirical cumulative distribution function and the fitted power law.
Usage
plfit(data, kmax = -1, kmin = 2, na.rm = FALSE)
Arguments
data |
A numeric vector of i.i.d. observations. |
kmax |
Maximum number of top-order statistics. If kmax = -1, then kmax=(n-1) where n is the length of dataset |
kmin |
Minimum number of top-order statistics to start with |
na.rm |
Logical. If |
Details
D_{n,k} := \sup_{y \ge 1} |\frac{1}{k-1} \sum_{i=1}^{k-1} I (\frac{X_{(i)}}{X_{(k)}} > y) - y^{-\hat{\alpha}_{n,k}^H}|
The above equation, as described by Nair et al., is implemented in this function with an Empirical CDF instead of the empirical survival function, which is mathematical equivalent since they are both complements of each other.
D_{n,k} :=
\sup_{y \ge 1}
|
\underbrace{
\frac{1}{k-1}
\sum_{i=1}^{k-1}
I(\frac{X_{(i)}}{X_{(k)}} \le y)
}_{\text{Empirical CDF}}
-
\underbrace{
(1 - y^{-\hat{\alpha}_{n,k}})
}_{\text{Theoretical CDF}}|
\hat k = \text{argmin} (D_{n,k})
Value
A named list containing the results of the PLFIT algorithm:
k_hat: The optimal number of top-order statistics\hat{k}.alpha_hat: The estimated power-law exponent\hat{\alpha}corresponding to\hat{k}.xmin_hat: The minimum valuex_{\min} = X_{(\hat{k})}above which the power law is fitted.ks_distance: The minimum Kolmogorov-Smirnov distanceD_{n,k}found.
References
Clauset, A., Shalizi, C. R., & Newman, M. E. (2009). Power-law distributions in empirical data. SIAM Review, 51(4), 661-703. doi:10.1137/070710111
Nair, J., Wierman, A., & Zwart, B. (2022). The Fundamentals of Heavy Tails: Properties, Emergence, and Estimation. Cambridge University Press. (pp. 227-229) doi:10.1017/9781009053730
Examples
xmin <- 1
alpha <- 2
r <- runif(800, 0, 1)
x <- (xmin * r^(-1/(alpha)))
plfit_values <- plfit(data = x, kmax = -1, kmin = 2)
Peaks-over-threshold (POT) Estimator
Description
This function chooses the \hat{\xi}_{k} and \hat \beta that minimize the negative log likelihood of the Generalized Pareto Distribution (GPD).
Usage
pot_estimator(data, u, start_xi = 0.1, start_beta = NULL, na.rm = FALSE)
Arguments
data |
A numeric vector of i.i.d. observations. |
u |
A numeric scalar that specifies the threshold value to calculate excesses |
start_xi |
Initial value of |
start_beta |
Initial value of |
na.rm |
Logical. If |
Details
The PDF of a excess data point x_i is given by:
f(x_i;\xi, \beta) = \frac{1}{\beta} \left(1 + \xi \frac{x_i}{\beta}\right)^{-\left(\frac{1}{\xi} + 1\right)}
If we apply log to the above equation we get:
l(x_i;\xi, \beta)=-\log(\beta) - (\frac{1}{\xi} + 1) \log(1 + \xi \frac{x_i}{\beta})
For all excess data points n:
l(\xi,\beta)=\sum_{i=1}^{n} (-\log(\beta) - (\frac{1}{\xi} + 1) \log(1 + \xi \frac{x_i}{\beta}))
l(\xi,\beta)=-n\log(\beta) - (\frac{1}{\xi} + 1)\sum_{i=1}^{n} \log(1 + \xi \frac{x_i}{\beta})
We can thus minimize -l(\xi,\beta). The parameters \xi and \beta that minimize the negative log likelihood are the same that maximize the log likelihood. Hence, by using the excesses, we are able to determine \xi and \beta that best fit the tail of the data.
There is also the case to consider when \xi = 0 which results in an exponential distribution. The total log likelihood in such a case is:
l(0, \beta) = -n \log(\beta) - \frac{1}{\beta} \sum_{i=1}^{n} x_i
Value
An unnamed numeric vector of length 2 containing the estimated
Generalized Pareto Distribution (GPD) parameters that minimize the negative log likelihood: \xi (shape/tail index) and \beta (scale parameter).
References
Davison, A. C., & Smith, R. L. (1990). Models for exceedances over high thresholds. Journal of the Royal Statistical Society: Series B (Methodological), 52(3), 393-425. doi:10.1111/j.2517-6161.1990.tb01796.x
Balkema, A. A., & de Haan, L. (1974). Residual life time at great age. The Annals of Probability, 2(5), 792-804. doi:10.1214/aop/1176996548
Pickands, J. (1975). Statistical Inference Using Extreme Order Statistics. The Annals of Statistics, 3(1), 119–131. http://www.jstor.org/stable/2958083
Nair, J., Wierman, A., & Zwart, B. (2022). The Fundamentals of Heavy Tails: Properties, Emergence, and Estimation. Cambridge University Press. (pp. 221-226) doi:10.1017/9781009053730
Examples
x <- rweibull(n=800, shape = 0.8, scale = 1)
values <- pot_estimator(data = x, u = 2, start_xi = 0.1, start_beta = NULL)