select_knots: Greedy stepwise knot selection for a regression spline

select_knotsR Documentation

Greedy stepwise knot selection for a regression spline

Description

Selects a parsimonious subset of knot_candidates as interior knots for a regression spline by greedy forward addition or backward elimination, scored by a user-supplied criterion (BIC, AIC, or any cross-validation function). The function is model-agnostic: it only chooses which knot positions to include and passes that vector to your score_fun; your score_fun decides which basis (splines::ns, splines::bs of any degree, …) and which model family to fit. Typical use: fitting a smooth, parsimonious trend through noisy data — for example an activity profile of a meteor shower across solar longitude — where the number and position of knots should remain small and inspectable.

Usage

select_knots(
  data,
  knot_candidates,
  score_fun,
  backward = FALSE,
  n_steps = NULL,
  bulk_gap = 4L,
  fixed_knots = numeric(0),
  verbose = FALSE,
  n_cores = 1L
)

Arguments

data

Data passed unchanged to score_fun; typically a data.frame. The function itself does not inspect or mutate it.

knot_candidates

Numeric vector of candidate interior-knot positions. Duplicates and unsorted input are tolerated (sorted/uniq'd internally).

score_fun

\(data, knots) -> numeric scalar. Lower is better. The caller is responsible for fitting whatever model they want and returning a single criterion value (BIC, AIC, cross-validation error, ...).

backward

Logical. FALSE (default) = forward selection, TRUE = backward elimination.

n_steps

NULL (default) or a positive integer. NULL runs the greedy search until the next move would not improve the score and stops there — so the end state of the search IS the score optimum (under the current direction and constraints). A positive integer N runs exactly N iterations regardless of whether each one improves the score; this is a deliberate exploration mode, intended for inspecting the score landscape just past a previously-found optimum. The canonical recipe is to first call with n_steps = NULL to find the optimum, then re-call with fixed_knots = prev$knots and n_steps = 1L (or 2L, 3L) to take one or more controlled steps onward and see how rapidly the score deteriorates. The value 0L is not allowed (use NULL for "no steps past the optimum"); non-integer or negative values error out.

bulk_gap

Integer (>= 0). Minimum index gap between knots removed in the same round of backward elimination. Ignored when backward = FALSE. 0L disables bulk removal. The default 4L matches the support of a cubic spline basis (degree + 1 for cubic B-splines; the same value also fits splines::ns, which is cubic by construction); for higher-degree B-splines pick degree + 1 accordingly. Note that bulk_gap = 1L imposes no real separation constraint and therefore removes every improving knot in one round – a very aggressive setting that defeats the per-round verify-fit's role as a safety net.

fixed_knots

Numeric vector of knot positions that must be present in every fitted model during the search. In forward mode they are set from the start, and further knots may be added on top; in backward mode they are never proposed for removal, neither singly nor in bulk. Need not be a subset of knot_candidates – included regardless. Duplicates and unsorted input are tolerated. Default numeric(0) (no fixed knots).

verbose

Logical. If TRUE, prints per-round progress (cat() to stdout). Default FALSE.

n_cores

Integer (>= 1). Performance-only knob: the per-round candidate scoring runs in parallel across n_cores workers. Default 1L (serial, no extra dependency). When n_cores > 1L the base-R package parallel is loaded and parallel::mclapply() is used (fork-based on macOS/Linux; falls back to serial on Windows). Quick recipes:

  • n_cores = 1L – safe default, no extra package loaded.

  • n_cores = max(1L, parallel::detectCores() - 1L) – use all cores except one; good default for interactive multi-tasking.

  • n_cores = parallel::detectCores() – use every core; fastest but resource-hungry (the whole machine is busy).

The function errors out with a clear message if n_cores > 1L but parallel is not installed.

Reproducibility deserves a note when n_cores > 1L: mclapply() is fork-based and inherits the parent RNG state. If your score_fun uses randomness (e.g. cross-validation splits), set RNGkind("L'Ecuyer-CMRG") and seed via set.seed() before calling select_knots() to get reproducible per-worker streams; otherwise results can differ run to run and across n_cores.

Details

The typical use case — meteor shower rates over the solar longitude or time, for example — is a smooth, slowly varying signal whose shape is not known in advance and whose curvature changes locally: too rigid for a low-order polynomial, too noisy for a histogram. A regression spline (typically cubic, e.g. splines::ns or splines::bs(degree = 3)) with a modest number of well-placed interior knots gives a smooth, locally flexible fit with interpretable degrees of freedom. Picking that number is a bias/variance trade-off: too many knots overfit (ringing, unstable derivatives, slow fits), too few introduce bias. select_knots() automates the trade-off by greedy forward addition or backward elimination, scored by a user-defined criterion; you control which knots are even allowed (knot_candidates) and what counts as "better" (score_fun).

Your score_fun must take (data, knots) and return a single numeric value; lower is better. Typically it fits a model with these interior knots (length(knots) == 0L means "no interior knots") and reports an information criterion (stats::BIC, stats::AIC) or a held-out score. knots arrives sorted, so the fit code does not need to sort it again. A divergent or failed fit should ideally return Inf; the function additionally wraps each call in tryCatch() and treats errors as Inf so the search continues robustly. See Examples for a runnable template.

Backward elimination can drop multiple knots per round in "bulk" mode. When backward = TRUE and bulk_gap >= 1L, each round removes several well-separated knots at once (minimum index gap bulk_gap in the current sorted knot list). For a B-spline basis of degree d, each basis function has support over d + 1 consecutive knot intervals, so removing knots that are at least d + 1 positions apart has nearly additive effect on the score — giving roughly a bulk_gap-fold speed-up at the cost of a small approximation (mitigated by a verify-fit after each bulk round). The default 4L matches d + 1 for cubic bases such as splines::ns or splines::bs(degree = 3); for other bases pick bulk_gap = degree + 1, or set bulk_gap = 0L for strict one-knot-per-round behaviour.

The result splits the final knot set into two disjoint vectors: knots (positions the algorithm itself selected) and fixed_knots (the user-supplied anchors, echoed back, deduplicated and sorted). The full vector to fit a model on is c(knots, fixed_knots) (or its sorted form). With the default n_steps = NULL the loop stops as soon as no further move improves the score, so the end state is the score-optimum; with n_steps > 0L the loop runs that many iterations regardless of improvement, and the score-best point along the trajectory is recoverable from history via the row at which.min(history$score).

If the starting state already is (locally) optimal — typical when fixed_knots alone overfits in forward mode, or when knot_candidates is so small that the full pool already minimises the score in backward mode — the n_steps = NULL run terminates immediately with history containing only the initial row and score equal to the starting score; an explicit n_steps = N run will instead take N worsening steps so the post-optimum landscape is still observable.

Knots sit next to extrema, not on them. select_knots() chooses knots for a good fit, not for detecting extrema of the response. Knots typically land next to peaks or troughs — where the curvature is highest — and not on them. Read local extrema off the shape of the fitted curve, not from knots. A knot supplied through fixed_knots is the exception: it sits wherever the user puts it, since it is a constraint imposed on the search rather than a finding produced by it.

Value

A list with elements (in this order):

backward

Logical — the direction used.

knots

Sorted numeric vector — the interior knots the algorithm itself selected, with fixed_knots excluded. Disjoint from fixed_knots by construction. The full knot vector to fit a model on is c(knots, fixed_knots) (or its sorted form). With the default n_steps = NULL this is the score-optimal selection; with n_steps > 0L it can be a state past the optimum.

fixed_knots

Sorted numeric vector — the user-supplied fixed knots echoed back (deduplicated and sorted). Empty vector if the caller did not supply any.

score

Numeric — the score at the end state, i.e. at c(knots, fixed_knots).

n_steps

The n_steps value used (NULL or a positive integer).

history

data.frame of per-round records: step, n_knots (total interior knot count including fixed_knots), changed_knot (the knot added or removed in that round; NA for the initial state and for bulk-removal rounds), score, extra (boolean: TRUE when that step worsened the score). For n_steps = NULL runs no worsening step is taken, so extra is always FALSE.

When to use this

select_knots() is for situations where knot positions are meaningful and you want a small, inspectable set of interior knots scored under a criterion you choose. If you instead want a continuous roughness penalty over a dense knot grid, the mgcv package's penalised B-splines (bs = "ps") or adaptive smoothers (bs = "ad") are usually a better fit; for greedy stepwise selection on a hinge-function basis, see the earth package.

Like every greedy stepwise procedure, select_knots() performs a local search: in each round it commits to the locally best add/drop. It therefore returns a local optimum of the score, which is usually but not provably the global optimum — a knot combination that beats every individually-best move can be unreachable once an earlier, locally-attractive knot has been picked. The only way to guarantee globality would be exhaustive enumeration over all 2^{|knot\_candidates|} subsets, which is exponential and impractical for any realistic candidate pool. In practice the greedy optimum is close; repeating the search in the opposite direction (backward = TRUE) is a cheap robustness check.

See Also

ns, bs, BIC, AIC

Examples

## Not run: 
# Greedy knot selection on a simple synthetic signal.
set.seed(1)
n <- 200
x <- seq(0, 10, length.out = n)
y <- stats::rpois(n, lambda = 50 + 30 * sin(x))
dat <- data.frame(x = x, y = y)

# score_fun: fit a Poisson GLM with a natural cubic spline, return BIC.
# Lower is better.
fit <- \(d, knots) {
    f <- if (length(knots) == 0L) {
        y ~ splines::ns(x)
    } else {
        y ~ splines::ns(x, knots = knots)
    }
    stats::glm(f, family = stats::poisson(), data = d)
}
score_bic <- \(d, knots) stats::BIC(fit(d, knots))

cand <- seq(1, 9, by = 0.5)
res <- select_knots(dat, cand, score_bic, verbose = TRUE)
# Full knot vector to fit the final model on:
final_knots <- sort(c(res$knots, res$fixed_knots))

## End(Not run)


vismeteor documentation built on May 20, 2026, 1:07 a.m.