 WData
WDataSet of tools for analyzing and modeling data that may be subject to biases in sampling. It offers functions to estimate density function and cumulative distribution function from a biased sample of a continuous distribution. Regarding density function estimation, the package includes Bhattacharyya et al. (1988) and Jones (1991) density estimators and various bandwidth selectors for the latter, enhancing the flexibility and adaptability of density estimation to different types of samples and biases. For cumulative distribution function estimation, the package includes the empirical estimator proposed by Cox (2005) and the kernel-type estimator by Bose and Dutta (2022), along with several bandwidth selectors for the latter. Finally, the package includes Muttlak (1988) real length-biased dataset on shrub width as an example dataset.
You can install the development version of WData from GitHub with:
# install.packages("devtools")
devtools::install_github("noeliasanchmrt/WData")
library(WData)If you encounter a clear bug, please file an issue with a minimal reproducible example on GitHub. For questions and other discussion, please use forum.posit.co.
The species Cercocarpus montanus, commonly known as mountain mahogany, is a deciduous shrub native to the western United States and northern Mexico. It is typically found on slopes, canyons, and rocky, arid formations with calcareous or other alkaline soils. This small shrub has white flowers and oval-shaped leaves with serrated edges. It is highly drought-tolerant and can survive even in nutrient-poor soils. Its intricate root system prevents landslides on sloped terrains, while its branches provide food and habitat for various animal species. These characteristics make Cercocarpus montanus an ideal species for studying wildlife recovery in a given geographic region.
During the fall semester of 1986, graduate students in a biological sampling techniques course taught by Lyman L. McDonald at the University of Wyoming conducted a study on the size of Cercocarpus montanus in an old limestone quarry located just east of Laramie, Wyoming (United States).
The sampling was conducted using line transect methods, which are widely used in ecological studies to measure species abundance in a given area and other relevant parameters. These methods involve randomly placing parallel sampling lines (transects) across the study area. Researchers traverse these lines, recording variables of interest.
To establish the sampling lines, a baseline was set across the study region. Random positions were generated along this baseline following a uniform distribution, and each sampling line was drawn perpendicularly from these points. One limitation of this approach is the need for a large number of transects to cover the area adequately. An alternative method involves selecting a single random point along the baseline and setting a fixed distance between transects. More details on variations of this method can be found in Buckland et al. (2001).
The Laramie quarry was covered with north-south-oriented rock fissures. Since moisture levels and vegetation density were higher in these fissures, a baseline parallel to them was established. The transects were drawn perpendicular to this baseline, crossing the terrain fissures instead of running parallel to them. A distance of 41.6 meters was set between transects, and two independent replicates (I and II) were obtained, each with three equidistant parallel transects. In total, six transects were surveyed from the baseline to the eastern boundary of the quarry. Students traversed the transects, identifying Cercocarpus montanus shrubs intersected by the line. For each shrub, its maximum height, the number of main branches, and its width (the maximum distance between two parallel tangent lines to the shrub’s contour along the transect) were measured.
Since Cercocarpus montanus is a rhizomatous species and adjacent shrubs may be interconnected via their root system, a shrub was defined as an individual if it had a distinct cluster of stems at the base and was at least 15 centimeters away from its nearest neighbor. For shrub clusters, the length of their intersection with the transect was recorded. More details on the sampling procedure and additional measurements can be found in Muttlak (1988).
Due to the sampling method, wider shrubs had a higher probability of being intersected by the transects. Consequently, the recorded shrub widths represent a sample biased by longitudinal bias, meaning the bias function is given by \(w(x) = x\). The height and branch count measurements were also subject to bias, although the bias function \(w\) is more complex as it depends on the relationship between shrub width and these respective variables.
summary(shrub.data)
summary(shrub.data$Width)df.bhatta():
Bhattacharyya et al. (1988) density estimatorlibrary(WData)
par(mfrow = c(1, 3))
bhatta <- df.bhatta(shrub.data$Width, bw = "nrd0", kernel = "gaussian", from = -0.4, to = 3)
bw.ucv <- bw.ucv(shrub.data$Width, lower = 0.15, upper = 0.3)
bhatta <- df.bhatta(shrub.data$Width, bw = bw.ucv, kernel = "gaussian", from = -0.4, to = 3)
bhatta <- df.bhatta(shrub.data$Width, bw = "SJ-ste", kernel = "gaussian", from = -0.3, to = 3) 
Bhattacharyya et al. (1988) density estimator for shrub width.
df.jones():
Jones (1991) density estimatorThe function allows different bandwidth selection methods:
"bw.f.BGM.rt": Normal reference rule-of-thumb
selector."bw.f.BGM.cv": Cross-validation-based selector."bw.f.BGM.boot1": Bootstrap-based selector (method
1)."bw.f.BGM.boot2": Bootstrap-based selector (method
2).par(mfrow = c(2, 3))
jones <- df.jones(shrub.data$Width, kernel = "gaussian", bw = "bw.f.BGM.rt", from = -0.4, to = 3)
#> Interval for Estimation: [-0.400000, 3.000000]
jones <- df.jones(shrub.data$Width, kernel = "gaussian", bw = "bw.f.BGM.cv", lower = 0.01, upper = 0.5, nh = 100L, from = -0.4, to = 3)
#> Interval for Estimation: [-0.400000, 3.000000]
#> Interval where bandwidth is searched: [0.010000, 0.500000]
jones <- df.jones(shrub.data$Width, kernel = "gaussian", bw = "bw.f.BGM.boot1", from = -0.4, to = 3)
#> Interval for Estimation: [-0.400000, 3.000000]
#> Pilot Bandwidth for Bootstrap: 0.293207
jones <- df.jones(shrub.data$Width, kernel = "gaussian", bw = "bw.f.BGM.boot1", bw0 = "PI", from = -0.4, to = 3)
#> Interval for Estimation: [-0.400000, 3.000000]
#> Pilot Bandwidth for Bootstrap: 0.267953
bw.f.BGM.boot2 <- bw.f.BGM.boot2(y = shrub.data$Width, from = 0.001, to = 3, nh = 100L, plot = F)
#> Interval where bandwidth is searched: [0.000161, 217.341159]
#> Interval where density is evaluated: [0.001000, 3.000000]
#> Pilot Bandwidth for Bootstrap: 0.075912
jones <- df.jones(shrub.data$Width, kernel = "gaussian", bw = bw.f.BGM.boot2, from = -0.4, to = 3)
#> Interval for Estimation: [-0.400000, 3.000000] 
Jones (1991) density estimator for shrub width.
cdf.cox():
Cox (2005) distribution estimatorpar(mfrow = c(1, 1))
plot(cdf.cox(shrub.data$Width), xlab = "", ylab = "", main = "", col = "blue", xlim = c(0, 3))
rug(shrub.data$Width) 
Cox (2005) distribution estimator for shrub width.
bw.F.BD():
Bose and Dutta (2022) local bandwidth selectorpar(mfrow = c(2, 2))
bd <- cdf.bd(shrub.data$Width, correction = "left", from = 0, to = 3, bw = "bw.F.BD", cy.seq = rep(0.25, 512))
#> Interval for Estimation: [0.000000, 3.000000]
bd <- cdf.bd(shrub.data$Width, correction = "left", from = 0, to = 3, bw = "bw.F.BD", cy.seq = rep(0.5, 512))
#> Interval for Estimation: [0.000000, 3.000000]
bd <- cdf.bd(shrub.data$Width, correction = "left", from = 0, to = 3, bw = "bw.F.BD", cy.seq = rep(1.3, 512))
#> Interval for Estimation: [0.000000, 3.000000]
cy.seq <- ifelse(seq(from = 0, to = 3, length.out = 512) <= quantile(shrub.data$Width, 0.05) |
  seq(from = 0, to = 3, length.out = 512) >= quantile(shrub.data$Width, 0.95), 0.5, 1.3)
bd <- cdf.bd(shrub.data$Width, correction = "left", from = 0, to = 3, bw = "bw.F.BD", cy.seq = cy.seq)
#> Interval for Estimation: [0.000000, 3.000000] 
Bose and Dutta (2022) distribution estimator for shrub width using local bandwidth selector.
bw.F.SBC.rt(),
bw.F.SBC.cv()and bw.F.SBC.pi(): Global
bandwidth selectorspar(mfrow = c(1, 3))
bd <- cdf.bd(shrub.data$Width, from = 0, to = 3, correction = "left")
#> Interval for Estimation: [0.000000, 3.000000]
bw_cv <- bw.F.SBC.cv(shrub.data$Width, lower = 0.05, upper = 0.2, nh = 100, plot = F)
#> Interval where bandwidth is searched: [0.050000, 0.200000]
bd <- cdf.bd(shrub.data$Width, correction = "left", bw = bw_cv)
#> Interval for Estimation: [0.030000, 3.130000]
bd <- cdf.bd(shrub.data$Width, from = 0, to = 3, correction = "left", bw = "bw.F.SBC.pi")
#> Interval for Estimation: [0.000000, 3.000000]
#> Pilot Bandwidth: 0.214138 
Bose and Dutta (2022) distribution estimator for shrub width using global bandwidths.