knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Load the package before running any of the code below:
library(vismeteor)
The activity of a visual meteor shower is a smooth, a priori unknown function of solar longitude. The classical evaluation summarises this by computing the Zenithal Hourly Rate (ZHR) per observation interval and plotting it as a noisy stepped curve. Single-interval ZHRs have high variance — Poisson noise plus observer-to-observer scatter — and the underlying activity profile is barely visible.
A spline-based fit across solar longitude collapses this noise into a
smooth, bias-controlled curve, provided that the spline's complexity
(number and location of interior knots) is chosen wisely.
select_knots() automates that choice: given a candidate grid of
possible interior knots and a user-supplied score function (BIC, AIC, or
a cross-validation criterion), it returns a small subset that minimises
the score.
Conceptually, the supplied knot_candidates form a grid of intervals
over the solar longitude: between two adjacent candidates lies an
interval in which the spline can bend independently. select_knots()
decides which interval boundaries are actually required — i.e. which
positions carry enough curvature to justify a knot under the score. The
natural resolution of the input grid in visual meteor work is the
observation granularity itself: typical VMDB observation intervals are
10–15 minutes long, corresponding to roughly 0.007–0.010 degrees of
solar longitude. A fine candidate grid at that resolution is the
practice recommendation; select_knots() will then reduce it to a
handful of statistically significant knots.
select_knots()The point of select_knots() is to keep the knot positions explicit
and few: a handful of values you can plot, label, and reason about,
instead of a dense anonymous knot grid hidden inside a penalised
smoother. The function is model-agnostic — your score_fun picks the
model family, offset, covariates, and the criterion (BIC, AIC, or a
cross-validation score) you optimise.
This function is the right tool when
Prefer alternatives when
mgcv::s(bs = "ps");mgcv::s(bs = "ad");earth::earth;select_knots() is greedy and
only finds a local optimum (see ?select_knots).Important clarification. select_knots() chooses knots for a good
fit, not for detecting extrema. Knots typically land next to maxima
or minima — where the curvature is high — and not on them. Local
extrema of the activity profile become visible from the shape of the
fitted curve, not from the list of selected knots.
The implementation is generic, but its motivation comes directly from
the visual-meteor use case: activity profiles are the natural setting
in which knot positions can be physically interpreted — main maximum,
sub-maxima, shower onset and end. The same interpretive value carries
over to other meteor-related smooth fits — for instance the apparent
magnitude distribution of a shower. Alternative packages (mgcv,
earth) replace explicit knots with continuous penalty mechanisms,
which gives up that interpretation.
We use the package's example data set PER_2015_rates (visual rate
observations of the Perseids 2015; see ?PER_2015_rates and
?load_vmdb). The raw filtering follows standard practice:
data(PER_2015_rates) obs <- PER_2015_rates$observations |> subset( rad_alt > 15 & moon_alt < 0 & sun_alt < -5 & sl_start >= 139.2 & sl_end <= 140.6 ) deg2rad <- \(x) x * pi / 180 obs$sl <- (obs$sl_start + obs$sl_end) / 2 obs$t <- with(obs, t_eff * sin(deg2rad(rad_alt)) / f) rates <- data.frame( sl = obs$sl, t = obs$t, limmag = obs$lim_magn, freq = obs$freq ) rates <- rates[order(rates$sl), ] nrow(rates)
What each step does:
rad_alt > 15 — when the radiant is low above the horizon
the geometric correction sin(rad_alt) becomes numerically
fragile and atmospheric extinction dominates; observations below
15 degrees are discarded by convention.moon_alt < 0 — the Moon above the horizon brightens the sky and
reduces the effective limiting magnitude in a way the simple limmag
correction does not capture; these observations are excluded rather
than corrected.sun_alt < -5 — nautical or astronomical twilight; before that
the sky is too bright.sl_start >= 139.2 & sl_end <= 140.6 — only for this vignette:
a tight window centred on the Perseid main peak, chosen narrow enough
to keep the vignette build fast while still containing the peak
(≈ 140.0 deg) and visible curvature on both sides. In a real analysis
you would use the full activity range (roughly 107–155 deg for the
Perseids).sl = (sl_start + sl_end) / 2 — VMDB observations are intervals;
we summarise each by its midpoint, the standard choice for the
regressor.t = t_eff * sin(rad_alt) / f is the central correction. It
rescales the raw observing time to what the observer would have
spent under ideal sky conditions at their own limiting magnitude:
radiant in the zenith (sin(rad_alt)) and a clear sky (f).
t_eff is the effective observing time after deducting breaks; f
is the cloud-cover correction factor (>= 1, larger when more sky
is obstructed). The correction from each observer's own limiting
magnitude to the ZHR reference 6.5 is not built into t; the
GLM below picks it up via the limmag covariate.We will call t simply the observing time from now on — not to be
confused with t_eff, the raw effective observing time after breaks;
t is t_eff after the zenith (sin(rad_alt)) and cloud-cover
(/ f) corrections.
fit_glm <- \(rates, knots) { f <- if (length(knots) == 0L) { freq ~ splines::ns(sl) + limmag + offset(log(t)) } else { freq ~ splines::ns(sl, knots = knots) + limmag + offset(log(t)) } stats::glm(f, data = rates, family = stats::poisson()) } score_bic <- \(rates, knots) stats::BIC(fit_glm(rates, knots))
offset(log(t)) makes the linear predictor a log-rate per observing
hour at the observer's own limiting magnitude. The limmag term then
rescales that rate later to the ZHR reference (limmag = 6.5):
exp(coef(limmag)) is the data-driven population index r of the
shower under this model. The spline term splines::ns() is a natural
cubic spline, well-behaved at the boundaries.
A practical note on the offset: offset() inside a formula is a
formula-special and must stay unqualified — both stats::offset(...)
in the formula and passing the term via glm()'s offset = argument
break predict() on newdata.
A regular grid is the simplest choice but ignores the actual observation
density. In practice the candidate grid is built from the data themselves
via freq_quantile() (see ?freq_quantile): partition the observations,
ordered by solar longitude, into bins each containing at least a fixed
minimum number of meteors, then use the meteor-weighted solar-longitude
midpoint of each bin as a candidate. This produces a fine grid in
observation-rich regions and a coarse grid where data is sparse — exactly
the resolution select_knots() benefits from:
rates$q <- vismeteor::freq_quantile(rates$freq, 12) # at least 12 meteors per bin rates$qsl <- with( rates, ave(freq * sl, q, FUN = sum) / ave(freq, q, FUN = sum) ) cand <- sort(unique(round(rates$qsl, 2))) length(cand)
The round(..., 2) step collapses bins whose centres are practically
indistinguishable (resolution 0.01 deg). The minimum bin count (12)
controls how dense the candidate grid is; a smaller value gives a finer
grid (slower fit), a larger value a coarser one.
The fixed_knots argument pins knot positions that must be present in
every fitted model. In forward mode they are set from the start and the
search adds further knots on top; in backward mode they are never
proposed for removal, neither singly nor by bulk removal. fixed_knots
need not be a subset of knot_candidates — values supplied there are
forced into the working set regardless.
A fixed knot is a constraint on the search, not a finding from it: the
criterion no longer has the freedom to reject it. The earlier warning
that knots typically land next to extrema rather than on them still
applies — forcing a knot exactly on, say, a presumed maximum overrides
that pattern at the chosen position. Use it when there is a non-data
reason a position must be in the model (a documented event time, an
external anchor, a stable reference across analyses); see
?select_knots for the full semantics.
The vignette runs serially (n_cores = 1L) because the search window
is narrow. For a full Perseids activity range (≈ 107–155 deg) the
per-round candidate scoring benefits from
n_cores = parallel::detectCores() - 1L; see ?select_knots for the
reproducibility recipe with mclapply().
res_fwd <- vismeteor::select_knots(rates, cand, score_bic) sort(c(res_fwd$knots, res_fwd$fixed_knots)) res_fwd$score
plot(res_fwd$history$n_knots, res_fwd$history$score, type = "b", pch = 19, col = "blue", xlab = "number of interior knots", ylab = "BIC", main = "Forward selection: BIC trajectory" ) points(tail(res_fwd$history$n_knots, 1L), res_fwd$score, pch = 19, col = "red", cex = 1.5 ) abline(h = res_fwd$score, lty = 2, col = "grey50")
The red point marks the score minimum — and with the default
n_steps = NULL it is also the endpoint of the trajectory, because
the search stops as soon as no further addition improves the score. To
see what one more step would do, re-run the algorithm starting from the
optimum with a positive n_steps:
res_fwd_plus1 <- vismeteor::select_knots(rates, cand, score_bic, fixed_knots = c( res_fwd$knots, res_fwd$fixed_knots ), n_steps = 1L ) res_fwd_plus1$score - res_fwd$score # always >= 0: how much worse one step makes it
n_steps is an exploration knob, not a model-selection knob: it
forces the algorithm to take a fixed number of further iterations
regardless of whether they improve the score, so that the immediate
score landscape past the optimum becomes visible through history. A
shallow rise tells you the BIC optimum is soft and that a slightly
richer model would not cost much; a steep rise tells you the optimum
is sharp and the parsimony is justified. The same recipe — re-call
with fixed_knots = c(prev$knots, prev$fixed_knots), n_steps = k —
extends to as many follow-up steps as you like; see ?select_knots.
res_bwd <- vismeteor::select_knots(rates, cand, score_bic, backward = TRUE) sort(c(res_bwd$knots, res_bwd$fixed_knots)) res_bwd$score
plot(res_bwd$history$n_knots, res_bwd$history$score, type = "b", pch = 19, col = "blue", xlab = "number of interior knots", ylab = "BIC", main = "Backward elimination: BIC trajectory" ) points(tail(res_bwd$history$n_knots, 1L), res_bwd$score, pch = 19, col = "red", cex = 1.5 ) abline(h = res_bwd$score, lty = 2, col = "grey50")
The trajectory starts at the full candidate pool on the right and moves
left as knots are removed; gaps along the x-axis correspond to bulk
removal rounds (see bulk_gap in ?select_knots). The red point again
marks the score minimum.
Both searches walk the lattice of knot subsets greedily — each round commits to the locally best add or drop — but they start at opposite corners:
That is why the two paths can disagree: at any intermediate state
the score of a candidate move depends on what is currently selected.
Adding k_i to the empty set is a genuinely different operation from
adding k_i to a set that already contains its neighbours, and the
two searches encounter k_i in incompatible contexts.
Practical recommendation. For activity-profile fitting we
recommend backward = TRUE for the final analysis, even though the
function's API default is FALSE (the function is generic and
defaults to the cheaper-per-fit direction). The reason is qualitative,
not performance: starting from the overfit end ensures that all real
curvature is already present in the initial model, so the search only
has to decide what is redundant — which is the safer mode against the
forward "miss-by-combination" failure. Use backward = FALSE (forward)
for a quick, sparse first sketch or when the full-candidate initial
fit is itself too expensive (very large candidate pool, complex model
family).
In practice neither direction is uniformly faster: forward iterates
many small fits one knot at a time, while backward starts with the
single most expensive fit and then collapses fast via the bulk-removal
mechanism (bulk_gap = 4L by default; see ?select_knots).
In any case, running both directions and keeping the lower-BIC result
is a cheap robustness check; large disagreements (in either knot
count or score) are a warning that the chosen criterion is on the
flat part of its landscape and that the data alone may not justify a
unique parsimonious model. The n_steps argument lets either search
continue for a fixed number of iterations past its first local
minimum, which is useful for inspecting how flat or sharp that minimum
is.
best <- if (res_fwd$score <= res_bwd$score) { c(res_fwd$knots, res_fwd$fixed_knots) } else { c(res_bwd$knots, res_bwd$fixed_knots) } fit <- fit_glm(rates, sort(best)) # Data-driven population index of the shower, with 95% Wald CI # (log-scale CI on the limmag coefficient, exponentiated). limmag_row <- summary(fit)$coefficients["limmag", ] r_hat <- exp(limmag_row["Estimate"]) r_ci <- exp(limmag_row["Estimate"] + c(-1, 1) * 1.96 * limmag_row["Std. Error"]) c( estimate = unname(r_hat), lower = unname(r_ci[1]), upper = unname(r_ci[2]) )
The point estimate is the population index r implied by the
rate-vs-limmag slope of this dataset; the CI width tells you how
sharply that slope is pinned down by the observations (a tight band
means a well-determined r, a wide band means the data leave it
essentially free). It enters the ZHR prediction below.
sl_grid <- seq(139.2, 140.6, length.out = 400) pred_df <- data.frame(sl = sl_grid, limmag = 6.5, t = 1) link <- predict(fit, newdata = pred_df, type = "link", se.fit = TRUE) zhr <- exp(link$fit) zhr_lo <- exp(link$fit - 1.96 * link$se.fit) zhr_hi <- exp(link$fit + 1.96 * link$se.fit) plot(sl_grid, zhr, type = "n", ylim = range(c(zhr_lo, zhr_hi)), xlab = "solar longitude (deg)", ylab = "ZHR", main = "PER 2015 - fitted activity profile (95% CI)" ) polygon(c(sl_grid, rev(sl_grid)), c(zhr_lo, rev(zhr_hi)), col = adjustcolor("blue", alpha.f = 0.2), border = NA ) lines(sl_grid, zhr, col = "blue", lwd = 2) rug(best, col = "red")
The prediction is evaluated with t = 1 (one ZHR-hour) and
limmag = 6.5 (the reference limiting magnitude), so the response is
the ZHR per hour on the chosen reference scale. The red rug marks on
the x-axis show the selected interior knots; note that they sit next
to the activity peak rather than on top of it, as expected — that is
where the curvature is highest.
The shaded band is the pointwise 95 % confidence interval of the fit.
We obtain it on the link (log) scale — where the GLM's standard
errors are approximately Gaussian — and then transform back with
exp(). The band reflects parameter uncertainty of the spline + linear
predictor; it is not a prediction interval for individual
observations, which would also have to account for Poisson dispersion
of the counts themselves.
Raw residuals of individual Poisson observations with small expectations
are visually uninformative. A cleaner diagnostic groups the observations
into bins of fixed minimum meteor count using freq_quantile() and
compares the observed and expected totals per bin.
# Use a coarser binning here than for the candidate grid above # (>= 500 meteors per bin) so that bin uncertainties are small enough # to read systematic deviations from the diagnostic plot. rates$qres <- vismeteor::freq_quantile(rates$freq, 500) rates$pred <- predict(fit, type = "response") bins <- with(rates, { obs_n <- tapply(freq, qres, sum) pred_n <- tapply(pred, qres, sum) data.frame( sl = tapply(sl * freq, qres, sum) / obs_n, ratio = obs_n / pred_n, se = sqrt(obs_n) / pred_n ) }) nrow(bins) plot(bins$sl, bins$ratio, pch = 19, col = "blue", ylim = range(c( bins$ratio - 2 * bins$se, bins$ratio + 2 * bins$se )), xlab = "solar longitude (deg)", ylab = "observed / expected", main = "PER 2015 - binned residual ratios (+/- 1 sigma)" ) arrows(bins$sl, bins$ratio - bins$se, bins$sl, bins$ratio + bins$se, angle = 90, code = 3, length = 0.03, col = "blue" ) abline(h = 1, lty = 2, col = "grey50")
A well-fitting model scatters the bin ratios randomly around 1, with about two thirds of the one-sigma error bars crossing the dashed line. Systematic arches over solar longitude would indicate too few knots in a region; visible scatter with no trend is Poisson over-dispersion (not pursued here).
The same freq_quantile() helper drives both the candidate grid above
and the diagnostic here, but with different minimum bin sizes: a small
minimum (12) for a fine candidate grid that select_knots() can prune
from, and a larger minimum (500) for a small number of stable bins on
which residual deviations are visually detectable. In the narrow vignette
window the larger minimum yields only a handful of bins (see the
nrow(bins) output); a full activity-range analysis would lower this
minimum to keep enough bins for a visible trend.
?select_knots — the function reference.?freq_quantile — quantile bins used here for the residual diagnostic.?PER_2015_rates — the example data set.?load_vmdb — importing rate and magnitude observations from the
VMDB API.Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.