| select_knots | R Documentation |
Selects a parsimonious subset of knot_candidates as interior knots
for a regression spline by greedy forward addition or backward elimination,
scored by a user-supplied criterion (BIC, AIC, or any cross-validation
function). The function is model-agnostic: it only chooses which
knot positions to include and passes that vector to your score_fun;
your score_fun decides which basis (splines::ns,
splines::bs of any degree, …) and which model family to fit.
Typical use: fitting a smooth, parsimonious trend through noisy data — for
example an activity profile of a meteor shower across solar longitude —
where the number and position of knots should remain small and inspectable.
select_knots(
data,
knot_candidates,
score_fun,
backward = FALSE,
n_steps = NULL,
bulk_gap = 4L,
fixed_knots = numeric(0),
verbose = FALSE,
n_cores = 1L
)
data |
Data passed unchanged to |
knot_candidates |
Numeric vector of candidate interior-knot positions. Duplicates and unsorted input are tolerated (sorted/uniq'd internally). |
score_fun |
|
backward |
Logical. |
n_steps |
|
bulk_gap |
Integer (>= 0). Minimum index gap between knots removed in
the same round of backward elimination. Ignored when
|
fixed_knots |
Numeric vector of knot positions that must be present
in every fitted model during the search. In forward mode they are
set from the start, and further knots may be added on top; in
backward mode they are never proposed for removal, neither singly nor
in bulk. Need not be a subset of |
verbose |
Logical. If |
n_cores |
Integer (>= 1). Performance-only knob: the per-round
candidate scoring runs in parallel across
The function errors out with a clear message if Reproducibility deserves a note when |
The typical use case — meteor shower rates over the solar longitude or
time, for example — is a smooth, slowly varying signal whose shape is
not known in advance and whose curvature changes locally: too rigid for
a low-order polynomial, too noisy for a histogram. A regression spline
(typically cubic, e.g. splines::ns or
splines::bs(degree = 3)) with a modest number of well-placed
interior knots gives a smooth, locally flexible fit with interpretable
degrees of freedom. Picking that number is a bias/variance trade-off:
too many knots overfit (ringing, unstable derivatives, slow fits), too
few introduce bias. select_knots() automates the trade-off by
greedy forward addition or backward elimination, scored by a
user-defined criterion; you control which knots are even allowed
(knot_candidates) and what counts as "better" (score_fun).
Your score_fun must take (data, knots) and return a single
numeric value; lower is better. Typically it fits a model
with these interior knots (length(knots) == 0L means "no interior
knots") and reports an information criterion (stats::BIC,
stats::AIC) or a held-out score. knots arrives sorted, so
the fit code does not need to sort it again. A divergent or failed fit
should ideally return Inf; the function additionally wraps each
call in tryCatch() and treats errors as Inf so the search
continues robustly. See Examples for a runnable template.
Backward elimination can drop multiple knots per round in "bulk" mode.
When backward = TRUE and bulk_gap >= 1L, each round removes
several well-separated knots at once (minimum index gap bulk_gap
in the current sorted knot list). For a B-spline basis of degree
d, each basis function has support over d + 1 consecutive
knot intervals, so removing knots that are at least d + 1
positions apart has nearly additive effect on the score — giving
roughly a bulk_gap-fold speed-up at the cost of a small
approximation (mitigated by a verify-fit after each bulk round). The
default 4L matches d + 1 for cubic bases such as
splines::ns or splines::bs(degree = 3); for other bases
pick bulk_gap = degree + 1, or set bulk_gap = 0L for
strict one-knot-per-round behaviour.
The result splits the final knot set into two disjoint vectors:
knots (positions the algorithm itself selected) and
fixed_knots (the user-supplied anchors, echoed back, deduplicated
and sorted). The full vector to fit a model on is
c(knots, fixed_knots) (or its sorted form). With the default
n_steps = NULL the loop stops as soon as no further move improves
the score, so the end state is the score-optimum; with
n_steps > 0L the loop runs that many iterations regardless of
improvement, and the score-best point along the trajectory is recoverable
from history via the row at which.min(history$score).
If the starting state already is (locally) optimal — typical when
fixed_knots alone overfits in forward mode, or when
knot_candidates is so small that the full pool already minimises
the score in backward mode — the n_steps = NULL run terminates
immediately with history containing only the initial row and
score equal to the starting score; an explicit n_steps = N
run will instead take N worsening steps so the post-optimum
landscape is still observable.
Knots sit next to extrema, not on them. select_knots() chooses
knots for a good fit, not for detecting extrema of the response.
Knots typically land next to peaks or troughs — where the curvature is
highest — and not on them. Read local extrema off the shape of
the fitted curve, not from knots. A knot supplied
through fixed_knots is the exception: it sits wherever the user
puts it, since it is a constraint imposed on the search rather than a
finding produced by it.
A list with elements (in this order):
backwardLogical — the direction used.
knotsSorted numeric vector — the interior knots the
algorithm itself selected, with fixed_knots excluded.
Disjoint from fixed_knots by construction. The full knot
vector to fit a model on is c(knots, fixed_knots) (or its
sorted form). With the default n_steps = NULL this is the
score-optimal selection; with n_steps > 0L it can be a state
past the optimum.
fixed_knotsSorted numeric vector — the user-supplied fixed knots echoed back (deduplicated and sorted). Empty vector if the caller did not supply any.
scoreNumeric — the score at the end state, i.e. at
c(knots, fixed_knots).
n_stepsThe n_steps value used (NULL or a
positive integer).
historydata.frame of per-round records:
step, n_knots (total interior knot count including
fixed_knots), changed_knot (the knot added or removed
in that round; NA for the initial state and for bulk-removal
rounds), score, extra (boolean: TRUE when that
step worsened the score). For n_steps = NULL runs no
worsening step is taken, so extra is always FALSE.
select_knots() is for situations where knot positions are
meaningful and you want a small, inspectable set of interior knots scored
under a criterion you choose. If you instead want a continuous roughness
penalty over a dense knot grid, the mgcv package's penalised
B-splines (bs = "ps") or adaptive smoothers (bs = "ad") are
usually a better fit; for greedy stepwise selection on a hinge-function
basis, see the earth package.
Like every greedy stepwise procedure, select_knots() performs a
local search: in each round it commits to the locally best
add/drop. It therefore returns a local optimum of the score, which is
usually but not provably the global optimum — a knot combination
that beats every individually-best move can be unreachable once an
earlier, locally-attractive knot has been picked. The only way to
guarantee globality would be exhaustive enumeration over all
2^{|knot\_candidates|} subsets, which is exponential and
impractical for any realistic candidate pool. In practice the greedy
optimum is close; repeating the search in the opposite direction
(backward = TRUE) is a cheap robustness check.
ns, bs,
BIC, AIC
## Not run:
# Greedy knot selection on a simple synthetic signal.
set.seed(1)
n <- 200
x <- seq(0, 10, length.out = n)
y <- stats::rpois(n, lambda = 50 + 30 * sin(x))
dat <- data.frame(x = x, y = y)
# score_fun: fit a Poisson GLM with a natural cubic spline, return BIC.
# Lower is better.
fit <- \(d, knots) {
f <- if (length(knots) == 0L) {
y ~ splines::ns(x)
} else {
y ~ splines::ns(x, knots = knots)
}
stats::glm(f, family = stats::poisson(), data = d)
}
score_bic <- \(d, knots) stats::BIC(fit(d, knots))
cand <- seq(1, 9, by = 0.5)
res <- select_knots(dat, cand, score_bic, verbose = TRUE)
# Full knot vector to fit the final model on:
final_knots <- sort(c(res$knots, res$fixed_knots))
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.