--- title: "Knot Selection for Smooth Profiles of Visual Meteor Showers" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true fig_width: 6 fig_height: 4 vignette: > %\VignetteIndexEntry{Knot Selection for Smooth Profiles of Visual Meteor Showers} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Setup Load the package before running any of the code below: ```{r setup} library(vismeteor) ``` ## Introduction The activity of a visual meteor shower is a smooth, a priori unknown function of solar longitude. The classical evaluation summarises this by computing the Zenithal Hourly Rate (ZHR) per observation interval and plotting it as a noisy stepped curve. Single-interval ZHRs have high variance — Poisson noise plus observer-to-observer scatter — and the underlying activity profile is barely visible. A spline-based fit across solar longitude collapses this noise into a smooth, bias-controlled curve, provided that the spline's complexity (number and location of interior knots) is chosen wisely. `select_knots()` automates that choice: given a candidate grid of possible interior knots and a user-supplied score function (BIC, AIC, or a cross-validation criterion), it returns a small subset that minimises the score. Conceptually, the supplied `knot_candidates` form a *grid of intervals* over the solar longitude: between two adjacent candidates lies an interval in which the spline can bend independently. `select_knots()` decides which interval boundaries are actually required — i.e. which positions carry enough curvature to justify a knot under the score. The natural resolution of the input grid in visual meteor work is the observation granularity itself: typical VMDB observation intervals are 10–15 minutes long, corresponding to roughly 0.007–0.010 degrees of solar longitude. A fine candidate grid at that resolution is the practice recommendation; `select_knots()` will then reduce it to a handful of statistically significant knots. ## When to use `select_knots()` The point of `select_knots()` is to keep the knot positions *explicit and few*: a handful of values you can plot, label, and reason about, instead of a dense anonymous knot grid hidden inside a penalised smoother. The function is model-agnostic — your `score_fun` picks the model family, offset, covariates, and the criterion (BIC, AIC, or a cross-validation score) you optimise. This function is the right tool when - you want a smooth curve through noisy count data, - knot positions should remain interpretable, - the model (family, offset, covariates) is too specific for a generic GAM smoother. Prefer alternatives when - you want a continuous roughness penalty over a dense knot grid: `mgcv::s(bs = "ps")`; - you want adaptive smoothing (smoothing parameter varying over the predictor): `mgcv::s(bs = "ad")`; - you want a hinge-function basis with its own GCV-based pruning: `earth::earth`; - you want a guaranteed global optimum: `select_knots()` is greedy and only finds a local optimum (see `?select_knots`). **Important clarification.** `select_knots()` chooses knots for a *good fit*, not for detecting extrema. Knots typically land *next to* maxima or minima — where the curvature is high — and not *on* them. Local extrema of the activity profile become visible from the *shape* of the fitted curve, not from the list of selected knots. The implementation is generic, but its motivation comes directly from the visual-meteor use case: activity profiles are the natural setting in which knot positions can be *physically interpreted* — main maximum, sub-maxima, shower onset and end. The same interpretive value carries over to other meteor-related smooth fits — for instance the apparent magnitude distribution of a shower. Alternative packages (`mgcv`, `earth`) replace explicit knots with continuous penalty mechanisms, which gives up that interpretation. ## Data preparation We use the package's example data set `PER_2015_rates` (visual rate observations of the Perseids 2015; see `?PER_2015_rates` and `?load_vmdb`). The raw filtering follows standard practice: ```{r, echo=TRUE} data(PER_2015_rates) obs <- PER_2015_rates$observations |> subset( rad_alt > 15 & moon_alt < 0 & sun_alt < -5 & sl_start >= 139.2 & sl_end <= 140.6 ) deg2rad <- \(x) x * pi / 180 obs$sl <- (obs$sl_start + obs$sl_end) / 2 obs$t <- with(obs, t_eff * sin(deg2rad(rad_alt)) / f) rates <- data.frame( sl = obs$sl, t = obs$t, limmag = obs$lim_magn, freq = obs$freq ) rates <- rates[order(rates$sl), ] nrow(rates) ``` What each step does: - **`rad_alt > 15`** — when the radiant is low above the horizon the geometric correction `sin(rad_alt)` becomes numerically fragile and atmospheric extinction dominates; observations below 15 degrees are discarded by convention. - **`moon_alt < 0`** — the Moon above the horizon brightens the sky and reduces the effective limiting magnitude in a way the simple `limmag` correction does not capture; these observations are excluded rather than corrected. - **`sun_alt < -5`** — nautical or astronomical twilight; before that the sky is too bright. - **`sl_start >= 139.2 & sl_end <= 140.6`** — *only for this vignette*: a tight window centred on the Perseid main peak, chosen narrow enough to keep the vignette build fast while still containing the peak (≈ 140.0 deg) and visible curvature on both sides. In a real analysis you would use the full activity range (roughly 107–155 deg for the Perseids). - **`sl = (sl_start + sl_end) / 2`** — VMDB observations are intervals; we summarise each by its midpoint, the standard choice for the regressor. - **`t = t_eff * sin(rad_alt) / f`** is the central correction. It rescales the raw observing time to what the observer would have spent under ideal sky conditions *at their own limiting magnitude*: radiant in the zenith (`sin(rad_alt)`) and a clear sky (`f`). `t_eff` is the effective observing time after deducting breaks; `f` is the cloud-cover correction factor (`>= 1`, larger when more sky is obstructed). The correction from each observer's own limiting magnitude to the ZHR reference 6.5 is **not** built into `t`; the GLM below picks it up via the `limmag` covariate. We will call `t` simply the **observing time** from now on — not to be confused with `t_eff`, the raw effective observing time after breaks; `t` is `t_eff` after the zenith (`sin(rad_alt)`) and cloud-cover (`/ f`) corrections. ## Model: Poisson GLM with a spline of solar longitude ```{r, echo=TRUE} fit_glm <- \(rates, knots) { f <- if (length(knots) == 0L) { freq ~ splines::ns(sl) + limmag + offset(log(t)) } else { freq ~ splines::ns(sl, knots = knots) + limmag + offset(log(t)) } stats::glm(f, data = rates, family = stats::poisson()) } score_bic <- \(rates, knots) stats::BIC(fit_glm(rates, knots)) ``` `offset(log(t))` makes the linear predictor a log-rate per observing hour at the observer's own limiting magnitude. The `limmag` term then rescales that rate later to the ZHR reference (`limmag = 6.5`): `exp(coef(limmag))` is the data-driven population index `r` of the shower under this model. The spline term `splines::ns()` is a natural cubic spline, well-behaved at the boundaries. A practical note on the offset: `offset()` inside a formula is a formula-special and must stay unqualified — both `stats::offset(...)` in the formula and passing the term via `glm()`'s `offset =` argument break `predict()` on `newdata`. ## Knot candidates A regular grid is the simplest choice but ignores the actual observation density. In practice the candidate grid is built from the data themselves via `freq_quantile()` (see `?freq_quantile`): partition the observations, ordered by solar longitude, into bins each containing at least a fixed minimum number of meteors, then use the meteor-weighted solar-longitude midpoint of each bin as a candidate. This produces a fine grid in observation-rich regions and a coarse grid where data is sparse — exactly the resolution `select_knots()` benefits from: ```{r, echo=TRUE} rates$q <- vismeteor::freq_quantile(rates$freq, 12) # at least 12 meteors per bin rates$qsl <- with( rates, ave(freq * sl, q, FUN = sum) / ave(freq, q, FUN = sum) ) cand <- sort(unique(round(rates$qsl, 2))) length(cand) ``` The `round(..., 2)` step collapses bins whose centres are practically indistinguishable (resolution 0.01 deg). The minimum bin count (`12`) controls how dense the candidate grid is; a smaller value gives a finer grid (slower fit), a larger value a coarser one. ## Fixing knots The `fixed_knots` argument pins knot positions that must be present in every fitted model. In forward mode they are set from the start and the search adds further knots on top; in backward mode they are never proposed for removal, neither singly nor by bulk removal. `fixed_knots` need not be a subset of `knot_candidates` — values supplied there are forced into the working set regardless. A fixed knot is a *constraint* on the search, not a finding from it: the criterion no longer has the freedom to reject it. The earlier warning that knots typically land *next to* extrema rather than on them still applies — forcing a knot exactly on, say, a presumed maximum overrides that pattern at the chosen position. Use it when there is a non-data reason a position must be in the model (a documented event time, an external anchor, a stable reference across analyses); see `?select_knots` for the full semantics. ## Forward selection The vignette runs serially (`n_cores = 1L`) because the search window is narrow. For a full Perseids activity range (≈ 107–155 deg) the per-round candidate scoring benefits from `n_cores = parallel::detectCores() - 1L`; see `?select_knots` for the reproducibility recipe with `mclapply()`. ```{r, echo=TRUE} res_fwd <- vismeteor::select_knots(rates, cand, score_bic) sort(c(res_fwd$knots, res_fwd$fixed_knots)) res_fwd$score ``` ```{r, echo=TRUE, fig.alt="BIC trajectory of the forward search over the number of interior knots, with the score optimum at the endpoint"} plot(res_fwd$history$n_knots, res_fwd$history$score, type = "b", pch = 19, col = "blue", xlab = "number of interior knots", ylab = "BIC", main = "Forward selection: BIC trajectory" ) points(tail(res_fwd$history$n_knots, 1L), res_fwd$score, pch = 19, col = "red", cex = 1.5 ) abline(h = res_fwd$score, lty = 2, col = "grey50") ``` The red point marks the score minimum — and with the default `n_steps = NULL` it is also the *endpoint* of the trajectory, because the search stops as soon as no further addition improves the score. To see what one more step would do, re-run the algorithm starting from the optimum with a positive `n_steps`: ```{r, echo=TRUE} res_fwd_plus1 <- vismeteor::select_knots(rates, cand, score_bic, fixed_knots = c( res_fwd$knots, res_fwd$fixed_knots ), n_steps = 1L ) res_fwd_plus1$score - res_fwd$score # always >= 0: how much worse one step makes it ``` `n_steps` is an *exploration* knob, not a model-selection knob: it forces the algorithm to take a fixed number of further iterations regardless of whether they improve the score, so that the immediate score landscape past the optimum becomes visible through `history`. A shallow rise tells you the BIC optimum is soft and that a slightly richer model would not cost much; a steep rise tells you the optimum is sharp and the parsimony is justified. The same recipe — re-call with `fixed_knots = c(prev$knots, prev$fixed_knots), n_steps = k` — extends to as many follow-up steps as you like; see `?select_knots`. ## Backward elimination ```{r, echo=TRUE} res_bwd <- vismeteor::select_knots(rates, cand, score_bic, backward = TRUE) sort(c(res_bwd$knots, res_bwd$fixed_knots)) res_bwd$score ``` ```{r, echo=TRUE, fig.alt="BIC trajectory of the backward search over the number of interior knots, with the score optimum at the endpoint"} plot(res_bwd$history$n_knots, res_bwd$history$score, type = "b", pch = 19, col = "blue", xlab = "number of interior knots", ylab = "BIC", main = "Backward elimination: BIC trajectory" ) points(tail(res_bwd$history$n_knots, 1L), res_bwd$score, pch = 19, col = "red", cex = 1.5 ) abline(h = res_bwd$score, lty = 2, col = "grey50") ``` The trajectory starts at the full candidate pool on the right and moves left as knots are removed; gaps along the x-axis correspond to bulk removal rounds (see `bulk_gap` in `?select_knots`). The red point again marks the score minimum. ## Forward vs. backward: when and why they differ Both searches walk the lattice of knot subsets greedily — each round commits to the *locally* best add or drop — but they start at opposite corners: - **Forward** begins with no interior knots, an underfit model, and adds the knot that reduces BIC the most. It stops as soon as no single addition improves the score. Tends to be parsimonious; can miss combinations of knots that are only useful *jointly* because the first knot of the pair, on its own, did not look improving. - **Backward** begins with *all* candidates as interior knots — an overfit, but flexible, model that already captures every local curvature — and removes the knot whose removal worsens BIC least. Tends to keep more knots; the danger of dropping a globally important knot just because its neighbours mask its individual contribution is real but less severe than forward's miss-by-combination problem. That is *why* the two paths can disagree: at any intermediate state the score of a candidate move depends on what is currently selected. Adding `k_i` to the empty set is a genuinely different operation from adding `k_i` to a set that already contains its neighbours, and the two searches encounter `k_i` in incompatible contexts. **Practical recommendation.** For activity-profile fitting we recommend `backward = TRUE` for the final analysis, even though the function's API default is `FALSE` (the function is generic and defaults to the cheaper-per-fit direction). The reason is qualitative, not performance: starting from the overfit end ensures that all real curvature is already present in the initial model, so the search only has to decide what is redundant — which is the safer mode against the forward "miss-by-combination" failure. Use `backward = FALSE` (forward) for a quick, sparse first sketch or when the full-candidate initial fit is itself too expensive (very large candidate pool, complex model family). In practice neither direction is uniformly faster: forward iterates many small fits one knot at a time, while backward starts with the single most expensive fit and then collapses fast via the bulk-removal mechanism (`bulk_gap = 4L` by default; see `?select_knots`). In any case, running both directions and keeping the lower-BIC result is a cheap robustness check; large disagreements (in either knot count or score) are a warning that the chosen criterion is on the flat part of its landscape and that the data alone may not justify a unique parsimonious model. The `n_steps` argument lets either search continue for a fixed number of iterations past its first local minimum, which is useful for inspecting how flat or sharp that minimum is. ## Final fit and ZHR curve ```{r, echo=TRUE} best <- if (res_fwd$score <= res_bwd$score) { c(res_fwd$knots, res_fwd$fixed_knots) } else { c(res_bwd$knots, res_bwd$fixed_knots) } fit <- fit_glm(rates, sort(best)) # Data-driven population index of the shower, with 95% Wald CI # (log-scale CI on the limmag coefficient, exponentiated). limmag_row <- summary(fit)$coefficients["limmag", ] r_hat <- exp(limmag_row["Estimate"]) r_ci <- exp(limmag_row["Estimate"] + c(-1, 1) * 1.96 * limmag_row["Std. Error"]) c( estimate = unname(r_hat), lower = unname(r_ci[1]), upper = unname(r_ci[2]) ) ``` The point estimate is the population index `r` implied by the rate-vs-limmag slope of this dataset; the CI width tells you how sharply that slope is pinned down by the observations (a tight band means a well-determined `r`, a wide band means the data leave it essentially free). It enters the ZHR prediction below. ```{r, echo=TRUE, fig.alt="Fitted ZHR curve over solar longitude with a 95% confidence band and selected knots marked on the axis"} sl_grid <- seq(139.2, 140.6, length.out = 400) pred_df <- data.frame(sl = sl_grid, limmag = 6.5, t = 1) link <- predict(fit, newdata = pred_df, type = "link", se.fit = TRUE) zhr <- exp(link$fit) zhr_lo <- exp(link$fit - 1.96 * link$se.fit) zhr_hi <- exp(link$fit + 1.96 * link$se.fit) plot(sl_grid, zhr, type = "n", ylim = range(c(zhr_lo, zhr_hi)), xlab = "solar longitude (deg)", ylab = "ZHR", main = "PER 2015 - fitted activity profile (95% CI)" ) polygon(c(sl_grid, rev(sl_grid)), c(zhr_lo, rev(zhr_hi)), col = adjustcolor("blue", alpha.f = 0.2), border = NA ) lines(sl_grid, zhr, col = "blue", lwd = 2) rug(best, col = "red") ``` The prediction is evaluated with `t = 1` (one ZHR-hour) and `limmag = 6.5` (the reference limiting magnitude), so the response is the ZHR per hour on the chosen reference scale. The red rug marks on the x-axis show the selected interior knots; note that they sit *next to* the activity peak rather than on top of it, as expected — that is where the curvature is highest. The shaded band is the pointwise 95 % confidence interval of the fit. We obtain it on the **link** (log) scale — where the GLM's standard errors are approximately Gaussian — and then transform back with `exp()`. The band reflects parameter uncertainty of the spline + linear predictor; it is *not* a prediction interval for individual observations, which would also have to account for Poisson dispersion of the counts themselves. ## Residual analysis with quantile bins Raw residuals of individual Poisson observations with small expectations are visually uninformative. A cleaner diagnostic groups the observations into bins of fixed minimum meteor count using `freq_quantile()` and compares the observed and expected totals per bin. ```{r, echo=TRUE, fig.alt="Binned observed-over-expected ratios with one-sigma error bars over solar longitude"} # Use a coarser binning here than for the candidate grid above # (>= 500 meteors per bin) so that bin uncertainties are small enough # to read systematic deviations from the diagnostic plot. rates$qres <- vismeteor::freq_quantile(rates$freq, 500) rates$pred <- predict(fit, type = "response") bins <- with(rates, { obs_n <- tapply(freq, qres, sum) pred_n <- tapply(pred, qres, sum) data.frame( sl = tapply(sl * freq, qres, sum) / obs_n, ratio = obs_n / pred_n, se = sqrt(obs_n) / pred_n ) }) nrow(bins) plot(bins$sl, bins$ratio, pch = 19, col = "blue", ylim = range(c( bins$ratio - 2 * bins$se, bins$ratio + 2 * bins$se )), xlab = "solar longitude (deg)", ylab = "observed / expected", main = "PER 2015 - binned residual ratios (+/- 1 sigma)" ) arrows(bins$sl, bins$ratio - bins$se, bins$sl, bins$ratio + bins$se, angle = 90, code = 3, length = 0.03, col = "blue" ) abline(h = 1, lty = 2, col = "grey50") ``` A well-fitting model scatters the bin ratios randomly around 1, with about two thirds of the one-sigma error bars crossing the dashed line. Systematic arches over solar longitude would indicate too few knots in a region; visible scatter with no trend is Poisson over-dispersion (not pursued here). The same `freq_quantile()` helper drives both the candidate grid above and the diagnostic here, but with different minimum bin sizes: a small minimum (12) for a fine candidate grid that `select_knots()` can prune from, and a larger minimum (500) for a small number of stable bins on which residual deviations are visually detectable. In the narrow vignette window the larger minimum yields only a handful of bins (see the `nrow(bins)` output); a full activity-range analysis would lower this minimum to keep enough bins for a visible trend. ## See also - `?select_knots` — the function reference. - `?freq_quantile` — quantile bins used here for the residual diagnostic. - `?PER_2015_rates` — the example data set. - `?load_vmdb` — importing rate and magnitude observations from the VMDB API.