| Details | R Documentation |
This document gives mathematical and implementation details for Lagrangian Multiplier Smoothing Splines (LMSS) as implemented in lgspline.
LMSS fits smoothing splines directly in a monomial basis. Smoothness is imposed by external equality constraints rather than by embedding knots in a specialized spline basis. This keeps the fitted model interpretable at the coefficient level, but moves some complexity into constraint construction and constrained estimation.
B-spline fits can be converted to monomial form in principle, but multivariate tensor-product conversions can be high-dimensional, numerically unstable, and unavailable in standard software.
Consider an N \times q matrix of predictors
\mathbf{T} = (\mathbf{t}_1, \dots, \mathbf{t}_N)^{\top} and an
N \times 1 response vector \mathbf{y} = (y_1, \dots, y_N)^{\top}.
We assume the relationship follows a generalized linear model with unknown
smooth function f:
y_i \sim \mathcal{D}(g^{-1}(f(\mathbf{t}_i)),\, \sigma^2)
where \mathcal{D} is a distribution (e.g. exponential family or related) with mean
\mu_i = g^{-1}(f(\mathbf{t}_i)), link function g(\cdot), and
dispersion parameter \sigma^2. For Gaussian response with identity
link, observations are independently distributed as
y_i \mid \mathbf{t}_i, \sigma^2 \sim \mathcal{N}(f(\mathbf{t}_i), \sigma^2).
The objective is to estimate f by:
Partitioning the predictor space into K+1 mutually exclusive regions.
Fitting local polynomial models within each partition.
Enforcing smoothness at partition boundaries via Lagrangian multipliers.
Penalizing the integrated squared second derivative to discourage roughness.
Unlike other smoothing spline formulations, no post-fitting algebraic rearrangement or disentanglement of a spline basis is needed to obtain interpretable models. The polynomial expansions are homogeneous across partitions, and the relationship between predictor and response is explicit at the coefficient level.
To anchor the notation, in the single-predictor cubic case one would write
\hat{f}(t_i) = \hat{\beta}_{(0)} + \hat{\beta}_{(1)}t_i +
\hat{\beta}_{(2)}t_i^2 + \hat{\beta}_{(3)}t_i^3 =
\mathbf{x}_i^{\top}\hat{\boldsymbol{\beta}},
where \mathbf{x}_i = (1, t_i, t_i^2, t_i^3)^{\top}. The LMSS
formulation preserves exactly this kind of polynomial representation, but now
does so within each partition and then forces neighboring pieces to agree in
the smoothness conditions described below.
Core notation used throughout:
\mathbf{y}_{(N \times 1)}: Response vector.
\mathbf{T}_{(N \times q)}: Matrix of predictors.
\mathbf{X}_{(N \times P)}: Block-diagonal matrix of polynomial
expansions, with diagonal blocks \mathbf{X}_k of dimension
n_k \times p.
\boldsymbol{\Lambda}_{(P \times P)}: Block-diagonal penalty matrix,
with blocks \boldsymbol{\Lambda}_k of dimension p \times p.
\hat{\boldsymbol{\beta}}_{(P \times 1)}: Unconstrained penalized
estimate.
\tilde{\boldsymbol{\beta}}_{(P \times 1)}: Constrained coefficient
estimates.
\mathbf{G}_{(P \times P)}: Block-diagonal matrix with blocks
\mathbf{G}_k = (\mathbf{X}_k^{\top}\mathbf{W}_k\mathbf{D}_k\mathbf{X}_k + \boldsymbol{\Lambda}_k)^{-1},
where \mathbf{W}_k and \mathbf{D}_k are defined below.
\mathbf{A}_{(P \times r)}: Constraint matrix encoding smoothness
conditions. Reduced to linearly independent columns via pivoted QR
decomposition by default, or by the parallel_qr reduced-system
route when that option is active.
\mathbf{U}_{(P \times P)}:
\mathbf{I} - \mathbf{G}\mathbf{A}(\mathbf{A}^{\top}\mathbf{G}\mathbf{A})^{-1}\mathbf{A}^{\top}.
\mathbf{D}_{(N \times N)}: Diagonal matrix of user-supplied
observation weights (observation_weights or weights).
Defaults to the identity. These play the role of prior precision on
individual observations: a weight of 2 is equivalent to seeing that
observation twice.
\mathbf{W}_{(N \times N)}: Diagonal matrix of GLM working
weights. In the implementation these diagonal entries are whatever is
returned by glm_weight_function; by default this is the GLM
working weight
family$mu.eta(eta)^2 / family$variance(mu), optionally
multiplied by user-supplied observation weights. For Gaussian response with identity link,
\mathbf{W} = \mathbf{I}. For other families,
\mathbf{W} depends on the current fitted values and is updated at
each Newton-Raphson iteration. For the common canonical families used
by default, this matches the familiar Fisher-scoring weighting role.
\mathbf{V}_{(N \times N)}: Correlation matrix of errors.
When no correlation structure is specified, \mathbf{V} = \mathbf{I}.
Otherwise supplied via VhalfInv or estimated through
VhalfInv_fxn.
In the Gaussian identity case with unit weights and no correlation,
\mathbf{G}_k = (\mathbf{X}_k^{\top}\mathbf{X}_k + \boldsymbol{\Lambda}_k)^{-1}
and most formulas simplify accordingly. When \mathbf{D} or
\mathbf{W} appear in a formula, the product \mathbf{W}\mathbf{D}
means “GLM working weights times observation weights”; whenever one of
them is the identity it drops out.
Before these quantities reach the main fitting stage, the user-facing inputs
are parsed, standardized, and organized by process_input. When
the formula interface is used and auto_encode_factors = TRUE, that
preprocessing step also relies on helpers such as create_onehot
to encode factor levels before the design reaches lgspline.fit().
The notation in the remainder of this document therefore refers to the internal
objects that actually enter lgspline.fit(), not necessarily the raw
objects originally supplied by the user.
Most user-facing controls can also be supplied through grouped lists
(penalty_args, tuning_args, expansion_args,
constraint_args, qp_args, parallel_args,
covariance_args, return_args, and glm_args); these are
unpacked before entering the same fitting pipeline. Setting
dummy_fit = TRUE runs preprocessing, partitioning, expansion, and
penalty construction without estimating nonzero coefficients, which is useful
for inspecting X, A, partitions, and penalties before fitting.
For K knots (one predictor) or K+1 partitions (multiple
predictors) there are K+1 mutually exclusive partitions
\mathcal{P}_0, \dots, \mathcal{P}_{K}. Each observation i
belongs to exactly one partition. Within partition k, the function is
represented as a polynomial of degree p-1 in each predictor:
\hat{f}_k(\mathbf{t}) = \mathbf{x}^{\top}\tilde{\boldsymbol{\beta}}_k
where \mathbf{x} collects the polynomial basis terms (intercept,
linear, quadratic, cubic, and optionally quartic and interaction terms) and
\tilde{\boldsymbol{\beta}}_k are the corresponding coefficients. In
one predictor, the same idea can be written more explicitly as
\hat{f}(t_i) = \sum_{k=0}^{K}
\mathbf{x}_{ik}^{\top}\hat{\boldsymbol{\beta}}_k
\mathbf{1}(t_i \in \mathcal{P}_k),
which highlights that the unconstrained problem is just a collection of
local polynomial regressions. The expansions are homogeneous across
partitions, so coefficients are directly comparable. This is implemented via
get_polynomial_expansions.
The exact contents of \mathbf{x} are controlled by the basis-expansion
arguments documented in lgspline: include_quadratic_terms,
include_cubic_terms, include_quartic_terms,
include_2way_interactions, include_3way_interactions,
include_quadratic_interactions, exclude_interactions_for,
exclude_these_expansions, and custom_basis_fxn. Likewise,
just_linear_with_interactions and
just_linear_without_interactions determine which predictors remain
structurally linear even though they still participate in the same
partition-wise polynomial bookkeeping described here.
Letting p denote the number of basis terms per partition,
P = p(K+1) is the total number of coefficients. The full design matrix
\mathbf{X} and penalty matrix \boldsymbol{\Lambda} are
block-diagonal with K+1 blocks, so unconstrained estimation reduces to
K+1 independent penalized regressions, which appears as follows for the identity link case:
\hat{\boldsymbol{\beta}}_k = \mathbf{G}_k \mathbf{X}_k^{\top} \mathbf{W}_k\mathbf{D}_k\mathbf{y}_k, \quad
\mathbf{G}_k = (\mathbf{X}_k^{\top}\mathbf{W}_k\mathbf{D}_k\mathbf{X}_k + \boldsymbol{\Lambda}_k)^{-1}.
For Gaussian identity with unit weights this reduces to the familiar
\mathbf{G}_k = (\mathbf{X}_k^{\top}\mathbf{X}_k + \boldsymbol{\Lambda}_k)^{-1}.
The block-diagonal structure means these can be computed in parallel across
partitions. In the user-facing interface this is realized by supplying a
cluster through cl, controlling work splitting with chunk_size,
and enabling stages such as parallel_eigen,
parallel_unconstrained, parallel_penalty,
parallel_make_constraint, and parallel_qr. The
parallel_qr option does not change the estimator; it replaces one QR
decomposition of a tall transformed system with parallel cross-product
accumulation and a smaller dense solve, with fallback to base linear algebra
when needed. The eigendecomposition and matrix square roots of each
\mathbf{G}_k are computed by compute_G_eigen and can be
returned as G and Ghalf.
Fitted values for the canonical Gaussian case appear as
\tilde{\mathbf{y}} = \mathbf{X}\tilde{\boldsymbol{\beta}} = \mathbf{H}\mathbf{y}
for \mathbf{H} = \mathbf{X}\mathbf{U}\mathbf{G}\mathbf{X}^{\top}.
Without further intervention the piecewise polynomial will be discontinuous.
The central idea of LMSS is that smoothness is not hidden inside a special
basis, but instead imposed directly where neighboring partitions meet.
At each knot t_{k,k+1} between neighboring partitions k and
k+1, up to three smoothing constraints are imposed:
Continuity: \mathbf{x}_{k,k+1}^{\top}\boldsymbol{\beta}_k = \mathbf{x}_{k,k+1}^{\top}\boldsymbol{\beta}_{k+1}.
First-derivative continuity: \mathbf{x}_{k,k+1}^{\prime\top}\boldsymbol{\beta}_k = \mathbf{x}_{k,k+1}^{\prime\top}\boldsymbol{\beta}_{k+1}.
Second-derivative continuity: \mathbf{x}_{k,k+1}^{\prime\prime\top}\boldsymbol{\beta}_k = \mathbf{x}_{k,k+1}^{\prime\prime\top}\boldsymbol{\beta}_{k+1}.
where \mathbf{x}^{\prime} and \mathbf{x}^{\prime\prime} are
elementwise first and second derivatives of the basis with respect to
\mathbf{t}. For the familiar cubic single-predictor basis
\mathbf{x} = (1, t, t^2, t^3)^{\top}, these derivative vectors are
\mathbf{x}' = (0, 1, 2t, 3t^2)^{\top}, \qquad
\mathbf{x}'' = (0, 0, 2, 6t)^{\top}.
With K knots this yields up to 3K scalar constraints for one
predictor. With multiple spline-expanded predictors, the default combines the
corresponding first- and second-derivative rows before adding them to the
equality matrix. With one spline-expanded predictor and additional
non-spline effects, derivative constraints are kept separate by default.
The flag add_first_and_second_derivative_constraints can force either
construction. The constraints are collected as linear equations
\mathbf{A}^{\top}\boldsymbol{\beta} = \mathbf{0}
in a P \times r matrix \mathbf{A}. The constraint matrix is
built by make_constraint_matrix and returned in the fitted
object as A.
In higher dimensions or with many partitions, separate derivative
constraints can over-specify the fit and push it toward a single global
polynomial. Combining first- and second-derivative rows reduces redundant
equality constraints while retaining a smooth join condition. The constraint
level can be controlled via include_constrain_fitted,
include_constrain_first_deriv, include_constrain_second_deriv,
and add_first_and_second_derivative_constraints.
The companion flag include_constrain_interactions determines whether
the analogous mixed-partial constraints are imposed for interaction terms,
and no_intercept adds the special homogeneous equality constraint that
fixes the intercept at zero (the same behavior triggered by using
0 + in the formula interface).
Before computing the projection \mathbf{U}, the constraint matrix is
optionally reduced to a linearly independent subset of columns. When
qr_pivot_smoothing_constraints = TRUE (the default), this uses
pivoted QR decomposition; when parallel_qr = TRUE and a cluster is
supplied, the code instead partitions the rows of \mathbf{A} into
blocks \mathbf{A}_{(1)}, \dots, \mathbf{A}_{(B)} and computes
\mathbf{A}^{\top}\mathbf{A}
= \sum_{b = 1}^{B}\mathbf{A}_{(b)}^{\top}\mathbf{A}_{(b)}
in parallel. The linearly independent constraint basis is then recovered
from that reduced system, so the package avoids running one full serial QR on
a tall redundant matrix whenever the parallel route is well-conditioned.
If the reduced system appears unstable, the code falls back to pivoted QR.
This avoids numerical instability from redundant constraints and ensures
\mathbf{A}^{\top}\mathbf{G}\mathbf{A} is invertible. When
qr_pivot_smoothing_constraints = FALSE, the original equality columns
are carried forward unchanged.
The constrained estimate is derived via Lagrangian multipliers. Define the
P \times P projection matrix:
\mathbf{U} = \mathbf{I} - \mathbf{G}\mathbf{A}(\mathbf{A}^{\top}\mathbf{G}\mathbf{A})^{-1}\mathbf{A}^{\top}.
Then the constrained estimate is:
\tilde{\boldsymbol{\beta}} = \mathbf{U}\hat{\boldsymbol{\beta}}.
The matrix \mathbf{U} has the property that
\mathbf{U}\mathbf{G}\mathbf{U}^{\top} = \mathbf{U}\mathbf{G}, which is
used extensively in variance estimation and posterior draws. In words, the
unconstrained penalized estimate is projected back into the coefficient space
that satisfies the smoothness restrictions, and all subsequent uncertainty
calculations inherit that same projected geometry.
The projection is computed via get_U and, when requested,
returned in the fitted object as U through return_U = TRUE.
When the constraints are inhomogeneous
(\mathbf{A}^{\top}\boldsymbol{\beta} = \mathbf{c} with
\mathbf{c} \neq \mathbf{0}), a particular solution
\boldsymbol{\beta}_0 satisfying
\mathbf{A}^{\top}\boldsymbol{\beta}_0 = \mathbf{c} is added back
after projection, yielding the full Lagrangian solution
\mathbf{U}\hat{\boldsymbol{\beta}} + (\mathbf{I} - \mathbf{U})\boldsymbol{\beta}_0.
In lgspline and lgspline.fit, users realize
this by supplying extra equality columns in constraint_vectors
together with matching right-hand sides in constraint_values;
null_constraint provides the alternate shorthand documented in
lgspline when constraint_vectors is supplied and
constraint_values is left empty.
In practice \mathbf{U} is never explicitly formed during fitting.
The constrained estimate is obtained from a transformed OLS residual problem (the
\mathbf{G}^{1/2}\mathbf{r}^{*} trick) in four steps:
Obtain the unconstrained partition-wise
unconstrained estimate \hat{\boldsymbol{\beta}}.
Set \mathbf{y}^{*} = \mathbf{G}^{-1/2}\hat{\boldsymbol{\beta}} and
\mathbf{X}^{*} = \mathbf{G}^{1/2}\mathbf{A}.
Fit the linear model \mathbb{E}[\mathbf{y}^{*}] = \mathbf{X}^{*}\boldsymbol{\gamma}
by OLS using QR decomposition, or with the parallel_qr
cross-product route when that option is enabled.
Compute the residuals \mathbf{r}^{*} = \mathbf{y}^{*} - \mathbf{X}^{*}(\mathbf{X}^{*\top}\mathbf{X}^{*})^{-1}\mathbf{X}^{*\top}\mathbf{y}^{*} from that transformed OLS
fit and recover the constrained estimate by
\tilde{\boldsymbol{\beta}} = \mathbf{G}^{1/2}\mathbf{r}^{*}.
A scaling factor 1/\sqrt{K+1} is applied to both
\mathbf{X}^{*} and \mathbf{y}^{*} prior to the OLS call
and divided out afterward, keeping the absolute scale moderate when
the constraint matrix has many rows.
This is where parallel_qr enters most directly. Write the rows of the
transformed system as \mathbf{X}^{*}_{(1)}, \dots, \mathbf{X}^{*}_{(B)}
and \mathbf{y}^{*}_{(1)}, \dots, \mathbf{y}^{*}_{(B)}. The default
route fits Step 3 with a standard QR decomposition of the full tall matrix
\mathbf{X}^{*}. When parallel_qr = TRUE, the package instead
forms
\mathbf{M}
= \mathbf{X}^{*\top}\mathbf{X}^{*}
= \sum_{b = 1}^{B}\mathbf{X}^{*\top}_{(b)}\mathbf{X}^{*}_{(b)},
\qquad
\mathbf{u}
= \mathbf{X}^{*\top}\mathbf{y}^{*}
= \sum_{b = 1}^{B}\mathbf{X}^{*\top}_{(b)}\mathbf{y}^{*}_{(b)}
in parallel, solves the reduced dense system
\mathbf{M}\hat{\boldsymbol{\gamma}} = \mathbf{u}, and then computes
\mathbf{r}^{*} = \mathbf{y}^{*} - \mathbf{X}^{*}\hat{\boldsymbol{\gamma}}.
So the costly operation being replaced is not the Lagrangian projection
itself, but the single tall QR decomposition inside the transformed OLS
solve. Algebraically the fit is the same projection; computationally it is a
parallel reduction followed by a much smaller dense solve. If that reduced
system is judged too unstable, the code falls back to the serial
.lm.fit() or qr() route.
The same idea reappears in other equality-only solves used during tuning, active-set refinement, and constraint reduction. In each case the package replaces one tall QR or least-squares call with a chunked accumulation of cross-products, while keeping the surrounding notation and estimator unchanged.
The most expensive object in this approach remains the
P \times r transformed matrix \mathbf{X}^{*} = \mathbf{G}^{1/2}\mathbf{A},
but either route is far cheaper than working with the full
P \times P system directly.
Without correlation or SQP constraints, \mathbf{G} is stored and operated upon as a
list of K+1 small p \times p matrices rather than the full
P \times P block-diagonal, saving substantial memory when K
is large and allowing for parallelism.
When correlation is present, \mathbf{X}^{\top}\mathbf{V}^{-1}\mathbf{X}
is no longer block-diagonal, so the equality-only solve must use either
the full correlated system or, when available, the Woodbury reduction
(see .woodbury_decompose_V).
When additional inequality constraints are present, the same active-set
controller is still called first. Block structure now determines only how
each equality-only re-solve is carried out: partition-wise when available,
Woodbury on the low-rank correlated path, or full-system otherwise. Dense
solve.QP and damped SQP are retained as fallback
paths only when the active-set route is unavailable or does not converge.
For GLM responses with mean \mu_i = g^{-1}(\eta_i) and working
weights w_i = [V(\mu_i)\{g'(\mu_i)\}^2]^{-1}, the penalized
information becomes
\mathbf{G}_k^{-1} = \mathbf{X}_k^{\top}\mathbf{W}_k\mathbf{X}_k
+ \boldsymbol{\Lambda}_k with
\mathbf{W}_k = \mathrm{diag}(w_i : i \in \mathcal{P}_k).
Because \mathbf{W} is diagonal,
\mathbf{X}^{\top}\mathbf{W}\mathbf{X} remains block-diagonal and
the four-step procedure carries through with
\mathbf{G}_k = (\mathbf{X}_k^{\top}\mathbf{W}_k\mathbf{X}_k +
\boldsymbol{\Lambda}_k)^{-1} in place of
(\mathbf{X}_k^{\top}\mathbf{X}_k + \boldsymbol{\Lambda}_k)^{-1}.
Under the default glm_weight_function, the diagonal entries are
family$mu.eta(eta)^2 / family$variance(mu), optionally scaled by
observation weights.
The partition-wise unconstrained estimates are obtained by
unconstrained_fit_fxn, by default
unconstrained_fit_default, which initializes via an augmented
ridge trick (appending \boldsymbol{\Lambda}^{1/2} as
pseudo-observations to glm.fit) then refines using
damped_newton_r with nr_iterate. For
non-canonical log-link gamma regression, the keep_weighted_Lambda = TRUE
option correctly returns the augmented ridge estimate directly.
Path 1: Correlation structure present.
When a working correlation \mathbf{V} is present, such as for
marginal means models or generalized estimating equation (GEE)-like models,
\mathbf{V}^{-1} couples observations across partitions, so
partition-wise fitting is not directly available. Two sub-paths handle
this.
Path 1a (Gaussian identity + GEE) solves the whitened system
directly via
\tilde{\mathbf{G}} =
(\tilde{\mathbf{X}}^{\top}\tilde{\mathbf{X}} +
\boldsymbol{\Lambda}_{\mathrm{block}})^{-1} where
\tilde{\mathbf{X}} = \mathbf{V}^{-1/2}\mathbf{X}, then applies
the Lagrangian projection in the full P-space.
See .get_B_gee_gaussian.
Path 1b (non-Gaussian GEE) uses a damped SQP iteration on the
whitened system. Each quadratic subproblem first attempts the same
active-set refinement used elsewhere in the package; the dense
solve.QP bridge is used only as fallback.
See .get_B_gee_glm.
Both sub-paths have Woodbury-accelerated variants described below.
Path 2: Gaussian identity, no correlation.
The constrained estimate is obtained by a single Lagrangian projection
from the per-partition penalized least-squares cross-products (the four
steps in closed form). No outer iteration is needed. When K = 0 and
there are no additional constraints, this reduces to the ordinary
penalized closed form
\hat{\boldsymbol{\beta}} = \mathbf{G}\mathbf{X}^{\top}\mathbf{y}.
Implemented in .get_B_gaussian_nocorr.
Path 3: Non-Gaussian GLM, no correlation.
Unconstrained estimates are obtained separately within each partition,
then projected onto the constraint space. When the link is non-identity,
the information \mathbf{G} depends on the current fitted values
through the working weights, so the projection must be iterated. The
unconstrained anchor \hat{\boldsymbol{\beta}} is held fixed while
\mathbf{G} is recomputed at each iterate's constrained estimate:
\tilde{\boldsymbol{\beta}}^{(s+1)} =
\mathbf{U}^{(s)}\hat{\boldsymbol{\beta}}, \quad
\mathbf{U}^{(s)} = \mathbf{I} - \mathbf{G}^{(s)}\mathbf{A}
(\mathbf{A}^{\top}\mathbf{G}^{(s)}\mathbf{A})^{-1}\mathbf{A}^{\top}.
Iteration stops when the mean absolute coefficient change falls below
tol, or when the update begins increasing (in which case the
previous iterate is restored). The recomputation of weighted Gram
matrices, Schur corrections, and square-root information factors at each
step is handled by .solver_recompute_G_at_estimate. Implemented
in .get_B_glm_nocorr.
For structured correlation, write the penalized information as
\mathbf{G}^{-1}
= \mathbf{G}_{\mathrm{on}}^{-1} + \mathbf{G}_{\mathrm{off}}^{-1},
where \mathbf{G}_{\mathrm{on}}^{-1} contains the K+1
block-diagonal p \times p pieces of
\mathbf{X}^{\top}\mathbf{V}^{-1}\mathbf{X} + \boldsymbol{\Lambda},
and \mathbf{G}_{\mathrm{off}}^{-1} contains the cross-partition part.
Without correlation, \mathbf{G}_{\mathrm{off}}^{-1} = \mathbf{0}.
When the cross-partition part is low rank, the code factors it as
\mathbf{G}_{\mathrm{off}}^{-1} = \mathbf{E}\mathbf{J}\mathbf{E}^{\top},
with \mathbf{E} of size P \times r and \mathbf{J}
diagonal with entries \pm 1. Then
\mathbf{G}
= \mathbf{G}_{\mathrm{on}}
- \mathbf{G}_{\mathrm{on}}\mathbf{E}
(\mathbf{J}^{-1} + \mathbf{E}^{\top}\mathbf{G}_{\mathrm{on}}\mathbf{E})^{-1}
\mathbf{E}^{\top}\mathbf{G}_{\mathrm{on}}.
The only dense inverse is the small r \times r matrix
(\mathbf{J}^{-1} + \mathbf{E}^{\top}\mathbf{G}_{\mathrm{on}}\mathbf{E})^{-1}.
Define \mathbf{N} = \mathbf{G}_{\mathrm{on}}^{1/2}\mathbf{E} and a
thin QR decomposition \mathbf{N} = \mathbf{Q}\mathbf{R}. With
\mathbf{P} = \mathbf{Q}\mathbf{Q}^{\top} and
\mathbf{S} = \mathbf{I}_r + \mathbf{R}\mathbf{J}\mathbf{R}^{\top},
\mathbf{F}
= (\mathbf{I}_P + \mathbf{N}\mathbf{J}\mathbf{N}^{\top})^{-1}
= \mathbf{I}_P - \mathbf{P} + \mathbf{Q}\mathbf{S}^{-1}\mathbf{Q}^{\top},
so that
\mathbf{G}^{1/2}
= \mathbf{G}_{\mathrm{on}}^{1/2}\mathbf{F}^{1/2}, \qquad
\mathbf{G}^{-1/2}
= \mathbf{F}^{-1/2}\mathbf{G}_{\mathrm{on}}^{-1/2},
with
\mathbf{F}^{1/2}
= (\mathbf{I}_P - \mathbf{P}) +
\mathbf{Q}\mathbf{S}^{-1/2}\mathbf{Q}^{\top}, \qquad
\mathbf{F}^{-1/2}
= (\mathbf{I}_P - \mathbf{P}) +
\mathbf{Q}\mathbf{S}^{1/2}\mathbf{Q}^{\top}.
This preserves the same transformed OLS projection used in the independent case:
\mathbf{y}^{*} = \mathbf{G}^{1/2}\mathbf{w}, \qquad
\mathbf{X}^{*} = \mathbf{G}^{1/2}\mathbf{A}, \qquad
\tilde{\boldsymbol{\beta}} = \mathbf{G}^{1/2}\mathbf{r}^{*},
where \mathbf{w} = \mathbf{X}^{\top}\mathbf{V}^{-1}\mathbf{y}.
The full-space operations remain the QR on \mathbf{X}^{*} and a
rank-r correction through the manuscript basis \mathbf{Q}.
Implementation map. The code uses the same algebra but stores a numerically convenient representation. The correspondence is:
manuscript \mathbf{E} \leftrightarrow helper output
E;
manuscript diagonal sign matrix \mathbf{J}
\leftrightarrow helper output J;
manuscript inverse
(\mathbf{J}^{-1} + \mathbf{E}^{\top}\mathbf{G}_{\mathrm{on}}\mathbf{E})^{-1}
\leftrightarrow helper output inner_inv;
manuscript projector basis \mathbf{Q}
\leftrightarrow helper output Q;
manuscript projector \mathbf{P} = \mathbf{Q}\mathbf{Q}^{\top}
is not stored explicitly, but is represented through rank-r
multiplies involving Q;
manuscript factors \mathbf{F}^{1/2} and
\mathbf{F}^{-1/2} are stored internally as
\mathbf{I} - \mathbf{Q}\mathbf{C}\mathbf{Q}^{\top} and
\mathbf{I} + \mathbf{Q}\mathbf{C}_{\mathrm{inv}}\mathbf{Q}^{\top},
where C and C_inv are diagonal.
For the non-Gaussian Woodbury path
(.get_B_gee_glm_woodbury), the fixed correlation
perturbation \mathbf{V}^{-1} - \mathbf{I} and its product with
\mathbf{X} are precomputed once and held fixed
across Newton iterations. At each step, the weighted perturbation
\mathbf{X}^{\top}\mathbf{W}(\mathbf{V}^{-1} -
\mathbf{I})\mathbf{X} is formed using the
precomputed product, then split and re-factored via
.woodbury_redecompose_weighted.
The Woodbury path is used when r < 2P/3; otherwise the code falls
back to the dense whitened solve. If \mathbf{F} is not positive
definite, the dense path is also used.
See .get_B_gee_woodbury (Gaussian) and
.get_B_gee_glm_woodbury (non-Gaussian).
Different paths use different step-control strategies.
In the GEE paths (Paths 1a and 1b), the coefficient update is damped:
\boldsymbol{\beta}^{(s+1)} = (1 - \alpha_s)\boldsymbol{\beta}^{(s)}
+ \alpha_s\boldsymbol{\beta}_{\mathrm{cand}}^{(s)},
with \alpha_s = 2^{-d_s} and d_s increased whenever the
candidate gives non-finite or larger deviance. The loop terminates after
10 consecutive rejections or 100 total iterations, and exits early when
both coefficient change and deviance improvement fall below tol
after a burn-in period.
In Path 3 (non-Gaussian, no correlation), the outer projection loop has
no line search. The algorithm recomputes \mathbf{G} at the current
constrained estimate and applies a fresh projection. If the mean absolute
coefficient change begins increasing, the previous iterate is restored
and the loop stops. The partition-wise unconstrained estimates themselves
are obtained by damped Newton-Raphson inside
unconstrained_fit_default, where
damped_newton_r computes a Newton direction once per
iteration and halves the step size until the penalized log-likelihood
improves.
Suppose \mathrm{Cov}(\mathbf{y}) = \sigma^2\,\mathbf{V}(\boldsymbol{\theta})
for a known parametric family indexed by \boldsymbol{\theta} (e.g.,
AR(1) with \theta = \rho, Matern with
\boldsymbol{\theta} = (\ell, \nu), exchangeable with
\theta = \rho). The penalized generalized least-squares problem
becomes
\min_{\boldsymbol{\beta}}\;
(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^{\top}\mathbf{V}^{-1}
(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})
+ \boldsymbol{\beta}^{\top}\boldsymbol{\Lambda}\boldsymbol{\beta}
\quad \text{s.t.}\; \mathbf{A}^{\top}\boldsymbol{\beta} = \mathbf{0}.
The correlation matrix \mathbf{V} is supplied through the
fitted-object components Vhalf and VhalfInv, either
directly or via user functions Vhalf_fxn and VhalfInv_fxn.
When both are non-NULL, get_B dispatches to the GEE
paths (Path 1a or 1b). For built-in correlation structures, the required
square-root matrices are assembled numerically using
matsqrt, matinvsqrt, and invert.
Because the data are stored in partition ordering (all observations from
partition 0, then partition 1, etc.) while \mathbf{V} is in the
original observation ordering, a permutation is applied internally:
\mathbf{V}_{\mathrm{perm}}^{-1/2} =
\mathbf{V}^{-1/2}[\boldsymbol{\pi}, \boldsymbol{\pi}], where
\boldsymbol{\pi} = \texttt{unlist(order\_list)} maps original
indices to partition-ordered indices. The whitened design and response are
\tilde{\mathbf{X}} =
\mathbf{V}_{\mathrm{perm}}^{-1/2}\mathbf{X}_{\mathrm{block}} and
\tilde{\mathbf{y}} =
\mathbf{V}_{\mathrm{perm}}^{-1/2}\mathbf{y}.
The \mathbf{X} and \mathbf{y} inputs to
lgspline.fit are preserved in their unwhitened form.
Whitening is applied inside get_B and
blockfit_solve where the full N \times P
block-diagonal design is available, since applying
\mathbf{V}^{-1/2} to only the diagonal blocks of the partitioned
design would silently discard cross-partition contributions and corrupt
the Gram matrix.
Unlike the independent-errors case,
\mathbf{X}^{\top}\mathbf{V}^{-1}\mathbf{X} is not block-diagonal
because \mathbf{V}^{-1} introduces cross-partition coupling. The
unconstrained estimator
\hat{\boldsymbol{\beta}} =
(\mathbf{X}^{\top}\mathbf{V}^{-1}\mathbf{X} +
\boldsymbol{\Lambda})^{-1}
\mathbf{X}^{\top}\mathbf{V}^{-1}\mathbf{y}
requires a full P \times P solve, and the constraint projection
proceeds with
\mathbf{G} = (\mathbf{X}^{\top}\mathbf{V}^{-1}\mathbf{X} +
\boldsymbol{\Lambda})^{-1}.
For structured correlation matrices (AR(1), exchangeable, banded), the
perturbation \mathbf{V}^{-1} - \mathbf{I} is sparse and the
cross-partition coupling has low effective rank. In these cases the
Woodbury-accelerated paths
(.get_B_gee_woodbury,
.get_B_gee_glm_woodbury) recover the partition-wise
computational structure by decomposing the coupling into a
block-diagonal correction plus a low-rank remainder, as described in the
GLM Extension section. For dense or high-rank
\mathbf{V}^{-1}, the code falls back to the full whitened system
(.get_B_gee_gaussian, .get_B_gee_glm).
For non-Gaussian models with correlation, the deviance used for
convergence monitoring is computed in the whitened space by
.bf_gee_deviance. When the family supplies
custom_dev.resids, the raw deviance residuals
r_i = \mathrm{sign}(d_i)\sqrt{|d_i|} are divided by
\sqrt{w_i} and pre=dmultiplied by
\mathbf{V}_{\mathrm{perm}}^{-1/2} before squaring and averaging:
D_{\mathrm{GEE}} = \frac{1}{N}\left\|
\mathbf{V}_{\mathrm{perm}}^{-1/2}\,
\mathbf{w}^{-1/2}\mathbf{r}\right\|^{2},
where \mathbf{w} is the vector of working weights at the current
iterate, clamped below at \sqrt{\varepsilon_{\mathrm{mach}}}. When
only dev.resids is available, the function falls back to the
standard mean deviance; otherwise it uses mean squared error.
Correlation parameters \boldsymbol{\theta} are estimated by minimizing
a negative restricted log-likelihood (REML) objective. The criterion
implemented in lgspline is a central-limit-theorem-based working
approximation to a Laplace-style marginal likelihood criterion,
applied here solely to correlation structure estimation rather than
penalty parameter selection.
Let \mathbf{D} = \mathrm{diag}(d_i) be the observation weight matrix,
\tilde{\mathbf{W}} = \mathrm{diag}(\tilde{w}_i) the GLM working weight
matrix at the current fitted values, \mathbf{V} the correlation matrix
parameterized by \boldsymbol{\rho} (a vector on the unconstrained
real line), and \tilde{\sigma}^{2} the dispersion profiled at its
restricted maximum likelihood estimate. The negative REML objective
implemented in lgspline, scaled by 1/N, is
-\ell_{R}(\boldsymbol{\rho}) = \frac{1}{N}\!\left[
-\log|\mathbf{V}^{-1/2}|
+ \frac{N}{2}\log\tilde{\sigma}^{2}
+ \frac{1}{2\tilde{\sigma}^{2}}
(\mathbf{y} - \boldsymbol{\mu})^{\top}
\mathbf{D}\tilde{\mathbf{W}}^{-1}\mathbf{V}^{-1}
(\mathbf{y} - \boldsymbol{\mu})
+ \frac{1}{2}\log|(\tilde{\sigma}^{2}\mathbf{U}\mathbf{G})^{-1}|^{+}
\right],
where |\cdot|^{+} denotes the generalized determinant (product of
nonzero eigenvalues), and
\boldsymbol{\mu} = g^{-1}(\mathbf{X}\tilde{\boldsymbol{\beta}}) are
the fitted values on the response scale.
Gradients with respect to correlation parameters are available in
closed form for all built-in structures except Matern, which uses
finite-difference approximation due to the complexity of differentiating
the modified Bessel function K_{\nu} with respect to \nu. See
reml_grad_from_dV for the full gradient derivation and
notation. Custom analytic gradients can be supplied through REML_grad,
and a fully custom criterion can replace REML through
custom_VhalfInv_loss. The Toeplitz example in lgspline
demonstrates how to supply custom correlation structures with user-defined
gradient functions. Optimization over these working correlation parameters is
then carried out by the same quasi-Newton engine used elsewhere in the
package, with a finite-difference fallback when needed. In the user-facing interface, this
machinery is activated through correlation_structure,
correlation_id, spacetime, VhalfInv, Vhalf,
VhalfInv_fxn, Vhalf_fxn, VhalfInv_par_init,
REML_grad, custom_VhalfInv_loss, and
VhalfInv_logdet.
The gradient of the negative REML has three terms per parameter:
\frac{1}{2}\mathrm{tr}(\mathbf{V}^{-1}\partial\mathbf{V}/\partial\theta_j):
the log-determinant contribution.
-\frac{1}{2\tilde{\sigma}^{2}}\mathbf{r}^{\top}
(\partial\mathbf{V}/\partial\theta_j)\mathbf{r}:
the residual quadratic form contribution, where
\mathbf{r} = \mathrm{diag}(\sqrt{d_i}/\sqrt{\tilde{w}_i})
\mathbf{V}^{-1/2}(\mathbf{y} - \boldsymbol{\mu}).
-\frac{1}{2}\mathrm{tr}\!\left(\mathbf{M}^{+}\mathbf{X}_{*}^{\top}
\mathbf{V}^{-1}(\partial\mathbf{V}/\partial\theta_j)\mathbf{V}^{-1}
\tilde{\mathbf{W}}\mathbf{D}\mathbf{X}_{*}\right):
the REML correction, where
\mathbf{X}_{*} = \mathbf{X}\mathbf{U} is the constrained design and
\mathbf{M} = \mathbf{X}_{*}^{\top}\mathbf{V}^{-1}
\tilde{\mathbf{W}}\mathbf{D}\mathbf{X}_{*}
+ \mathbf{U}^{\top}\boldsymbol{\Lambda}\mathbf{U} is the projected
penalized information.
For each supported correlation family, the derivatives
\partial\mathbf{V}/\partial\theta_j are available in closed form,
enabling analytic gradient computation for use with the quasi-Newton
optimizer.
The quadratic penalty \boldsymbol{\Lambda} acts as the inverse prior
covariance of a Gaussian random effect on the spline coefficients, with
the smoothing parameter satisfying \lambda = \tau^{2}/\sigma^{2};
this is the same mixed model representation used by mgcv, and
Monte Carlo draws under the resulting Laplace-style approximation are
available via generate_posterior. For Gaussian responses
with identity link and no penalization, the implemented criterion
coincides exactly with classical REML. For non-Gaussian responses, the
criterion substitutes the Fisher information for the full penalized
log-likelihood Hessian, exploiting the CLT approximation that
\tilde{\mathbf{W}}^{-1/2}(\mathbf{y} - \boldsymbol{\mu}) is
approximately Gaussian; this yields a method-of-moments style estimator
for the correlation and variance parameters that depends only on
mean-variance relationships through \mathbf{W}, and therefore
generalizes naturally to quasi-likelihood and other settings where a
fully specified log-likelihood is unavailable. The REML correction term
\log|(\tilde{\sigma}^{2}\mathbf{U}\mathbf{G})^{-1}|^{+} uses a
generalized log-determinant so that only nonzero eigenvalues contribute
when rank deficiency arises from smoothness constraints or
identifiability conditions in \mathbf{A}. When
\boldsymbol{\Lambda} is full rank, the criterion coincides with
the exact marginal likelihood from integrating out
\boldsymbol{\beta} under its Gaussian prior; when
\boldsymbol{\Lambda} is rank-deficient, unpenalized coefficients
are projected out in the REML sense while penalized coefficients are
integrated through their prior. During penalty tuning, the block-diagonal
approximation is retained for GCV criteria and gradients; since GCV is
rotation-invariant the practical effect on automatic selection is expected
to be negligible, though this is not confirmed and the tuned penalties can
always be overridden.
The package provides several built-in correlation structures for modeling
spatial and temporal dependence. These are specified via
correlation_structure with group membership in correlation_id
and spatial or temporal coordinates in spacetime (an N-row
matrix). Exchangeable correlation does not require spacetime.
All positive scale parameters are estimated on the log scale,
with back-transform \exp(\cdot). Parameters constrained to
(0, 1) use a double-exponential back-transform of the form
\exp(-\exp(\eta)), so optimization still occurs on the
unconstrained real line while the correlation remains bounded.
Aliases: 'exchangeable', 'cs', 'CS',
'compoundsymmetric', 'compound-symmetric'.
A constant correlation \nu between any two observations within
the same cluster. Parameterization: \nu = \exp(-\exp(\rho)),
so \nu \in (0, 1). Only positive within-cluster correlation is
supported under this parameterization.
Aliases: 'spatial-exponential', 'spatialexponential',
'exp', 'exponential'.
Correlation decays exponentially with distance:
\exp(-\omega d) where d is Euclidean distance and
\omega > 0. Parameterization: \omega = \exp(\rho).
Mathematically equivalent to the power correlation \theta^{d}
with \theta = e^{-\omega}, but with better numerical properties
during optimization.
Aliases: 'ar1', 'ar(1)', 'AR(1)', 'AR1'.
Correlation depends on rank difference between observations:
\nu^{r} where r is the rank difference within cluster.
Parameterization: \nu = \exp(-\exp(\rho)),
so \nu \in (0, 1). Only positive autocorrelation is supported.
Aliases: 'gaussian', 'rbf', 'squared-exponential'.
Smooth decay with squared distance:
\exp(-d^{2}/(2\ell^{2})) where \ell is the length scale.
Parameterization: \ell = \exp(\rho).
Aliases: 'spherical', 'Spherical', 'cubic',
'sphere'.
Polynomial decay with a hard cutoff at range r:
1 - 1.5(d/r) + 0.5(d/r)^{3} for d \le r, and 0
otherwise. Parameterization: r = \exp(\rho).
Aliases: 'matern', 'Matern'.
Flexible correlation with adjustable smoothness:
(2^{1-\nu}/\Gamma(\nu))(\sqrt{2\nu}\,d/\ell)^{\nu}K_{\nu}(\sqrt{2\nu}\,d/\ell).
Two parameters: length scale \ell = \exp(\rho_1) and smoothness
\nu = \exp(\rho_2). No analytical gradient is available for
\nu due to the difficulty of differentiating the modified Bessel
function K_{\nu} with respect to its order, so finite differences
are used; this makes Matern slower and potentially less stable than
other structures.
Aliases: 'gamma-cosine', 'gammacosine',
'GammaCosine'.
Oscillatory dependence:
(d^{\alpha-1}e^{-\gamma d})/(\Gamma(\alpha)/\gamma^{\alpha})\cdot\cos(\omega d).
Three parameters: shape \alpha = \exp(\rho_1), rate
\gamma = \exp(\rho_2), frequency \omega = \exp(\rho_3).
Reduces to exponential when \alpha = 1 and \omega \approx 0.
Aliases: 'gaussian-cosine', 'gaussiancosine',
'GaussianCosine'.
Smooth oscillatory correlation:
\exp(-d^{2}/(2\ell^{2}))\cdot\cos(\omega d).
Two parameters: length scale \ell = \exp(\rho_1) and frequency
\omega = \exp(\rho_2). Reduces to Gaussian when
\omega \approx 0.
Correlation parameters are estimated on transformed scales; they must be
back-transformed for interpretation. When confint.lgspline is called and the
inverse Hessian from BFGS is available, confidence intervals are returned
on the untransformed (working) scale and should be back-transformed as
described in the examples for lgspline.
Custom correlation structures can be specified through:
VhalfInv_fxn: Creates \mathbf{V}^{-1/2}.
Vhalf_fxn: Creates \mathbf{V}^{1/2}. When omitted,
the code computes it by explicit inversion of VhalfInv.
REML_grad: Provides the analytical gradient of the REML
objective.
VhalfInv_logdet: Efficient log-determinant computation.
custom_VhalfInv_loss: Replaces the REML objective entirely.
These functions enter lgspline through
correlation_structure, VhalfInv_fxn, Vhalf_fxn,
REML_grad, and custom_VhalfInv_loss, and the fitted object
retains the resulting correlation machinery in components such as
VhalfInv_fxn, Vhalf_fxn, and
VhalfInv_params_estimates. When VhalfInv is supplied but
Vhalf is not, Vhalf is computed unconditionally as the inverse
of VhalfInv for all family/link combinations, since both
get_B and blockfit_solve require it for GEE
estimation.
Once the constrained estimate has been obtained, the next questions are how much flexibility the fitted model effectively used and how uncertainty should be propagated through the same constrained geometry. The quantities in this section are therefore all built on the projected information matrices from the previous sections.
The effective degrees of freedom is the trace of the hat matrix. In the
Gaussian identity case with observation weights and correlation, the fitted
linear operator is built from the dense GLS analogue
\mathbf{G}_{\mathrm{correct}} and can be written schematically as
\mathbf{H} =
\mathbf{V}^{-1/2}(\mathbf{W}\mathbf{D})^{1/2}
\mathbf{X}\mathbf{U}\mathbf{G}_{\mathrm{correct}}\mathbf{X}^{\top}
(\mathbf{W}\mathbf{D})^{1/2}\mathbf{V}^{-1/2},
where for Gaussian identity \mathbf{W} = \mathbf{I}. In the
no-correlation Gaussian case this reduces to the familiar
\mathbf{H} = \mathbf{X}\mathbf{U}\mathbf{G}\mathbf{X}^{\top}\mathbf{D}.
For Gaussian identity fits, the dispersion estimate is computed as a
weighted mean squared residual, optionally scaled by
N/(N - \mathrm{tr}(\mathbf{H})) when
unbias_dispersion = TRUE:
\tilde{\sigma}^{2} =
\frac{1}{N - \mathrm{tr}(\mathbf{H})}\|\mathbf{y} - \mathbf{\tilde{y}}_i \|^{2}.
More generally with weights, a correlation structure and non-linear link function:
\tilde{\sigma}^{2} =
\frac{1}{N - \mathrm{tr}(\mathbf{H})}\| \mathbf{V}^{-1/2}\mathbf{W}^{-1/2}\mathbf{D}^{1/2}(\mathbf{y} - \tilde{\mathbf{y}})\|^{2}.
This estimated dispersion is returned as sigmasq_tilde, and the
corresponding effective degrees of freedom trace is returned as
trace_XUGX. For non-Gaussian families, the fitting code delegates
dispersion estimation to dispersion_function; thus the package does
not assume a single closed-form Pearson-style formula outside the Gaussian
identity setting. The hat-matrix trace itself is assembled by
compute_trace_H in the dense correlation-aware case and by the
same blockwise products summarized by trace_XUGX in the simpler
no-correlation paths.
A concrete built-in non-Gaussian example is the Weibull AFT path, which pairs
weibull_family with weibull_dispersion_function,
weibull_glm_weight_function, and
weibull_schur_correction.
Users who want these quantities available for downstream inference should
keep estimate_dispersion = TRUE and return_varcovmat = TRUE
(the defaults), since wald_univariate,
confint.lgspline, and the prediction-standard-error path in
predict.lgspline all rely on the post-fit dispersion and
covariance components documented here.
The variance-covariance matrix of \tilde{\boldsymbol{\beta}} is
estimated as:
\mathrm{Var}(\tilde{\boldsymbol{\beta}}) = \tilde{\sigma}^{2}(\mathbf{U}\mathbf{G}^{1/2})(\mathbf{U}\mathbf{G}^{1/2})^{\top}
using the outer-product form for numerical stability. The result is returned
as varcovmat when return_varcovmat = TRUE. The algebraically
equivalent expression \tilde{\sigma}^{2}\mathbf{U}\mathbf{G} is not
used because \mathbf{G} is only positive semi-definite when the
penalty matrix \boldsymbol{\Lambda} has zero eigenvalues (e.g., the
intercept and linear terms under the smoothing spline penalty when
flat_ridge_penalty = 0), which can introduce negative diagonal
entries in finite precision arithmetic. The outer-product form also
guarantees symmetry.
This is the Bayesian posterior covariance, treating the penalty as a
Gaussian prior on the coefficients. When exact_varcovmat = TRUE,
a frequentist correction is additionally computed:
\mathrm{Var}_{\mathrm{exact}}(\tilde{\boldsymbol{\beta}}) =
\tilde{\sigma^2}\mathbf{U}\mathbf{G}^{1/2}(\mathbf{X}^{\top}\mathbf{W}\mathbf{D}\mathbf{V}^{-1}\mathbf{X})\mathbf{G}^{1/2}\mathbf{U}^{\top} =
\tilde{\sigma}^{2}\mathbf{U}\mathbf{G}\mathbf{U}^{\top}
- \tilde{\sigma}^{2}\mathbf{U}\mathbf{G}\boldsymbol{\Lambda}\mathbf{G}\mathbf{U}^{\top}.
The first term is the Bayesian posterior covariance; the second is a frequentist correction such that for Gaussian identity link (with or without correlation), this is the exact variance-covariance matrix of the constrained estimator.
When a correlation structure is present (VhalfInv non-NULL),
the block-diagonal \mathbf{G} is replaced by the full weighted GLS
analogue
\mathbf{G}_{\mathrm{correct}} =
\left(\mathbf{X}^{\top}\mathbf{W}\mathbf{D}\mathbf{V}^{-1}\mathbf{X}
+ \boldsymbol{\Lambda}\right)^{-1},
where \mathbf{W} = \mathbf{I} in the Gaussian identity case.
This dense matrix is what enters the correlation-aware \mathbf{U},
\mathrm{Var}(\tilde{\boldsymbol{\beta}}), and
\mathrm{Var}_{\mathrm{exact}}(\tilde{\boldsymbol{\beta}})
computations.
In user-facing terms, return_varcovmat controls whether this matrix is
stored at all, while exact_varcovmat switches between the default
posterior/Laplace approximation and the exact frequentist correction in the
Gaussian-identity setting. The stored covariance is what powers
wald_univariate, confint.lgspline, and
se.fit = TRUE in predict.lgspline; the
critical_value argument supplied at fit time is carried forward as the
default cutoff for those interval-producing helpers.
At the final iterate, \mathbf{G} is recomputed to reflect the
converged working weights and Schur corrections. The implementation
computes the weighted design
\mathbf{X}_{w}^{(k)} = \mathbf{X}_k \cdot \mathrm{diag}(\sqrt{\mathbf{w}_k}),
forms the weighted Gram matrix
\mathbf{X}_{w}^{(k)\top}\mathbf{X}_{w}^{(k)}, adds the Schur
correction, and performs eigendecomposition via
compute_G_eigen to obtain \mathbf{G}_k,
\mathbf{G}_k^{1/2}, and \mathbf{G}_k^{-1/2}. The relationship
\mathbf{G}_k = \mathbf{G}_k^{1/2}(\mathbf{G}_k^{1/2})^{\top} is
enforced exactly by construction, and the fitted object can retain these as
G and Ghalf when return_G = TRUE and
return_Ghalf = TRUE. The square-root factors are numerically
stabilized through the helper routines matsqrt and
matinvsqrt, which are also used elsewhere in the package when
dense analogues of \mathbf{G}^{1/2} or \mathbf{G}^{-1/2} are
required.
The penalty has a natural Gaussian-prior interpretation, so once the constrained estimator and its covariance are available, Bayesian-style posterior simulation follows almost immediately. This section records the interpretation that is already implicit in the fitted object and in the package's posterior simulation helpers.
A Bayesian interpretation follows from viewing the penalty as a Gaussian prior on the coefficients. Conditional on the fitted smoothing parameters, the code samples on the coefficient scale from
\boldsymbol{\beta}^{(m)} =
\tilde{\boldsymbol{\beta}} +
\sqrt{\tilde{\sigma}^{2}},\mathbf{U}\mathbf{G}^{1/2}\mathbf{z}^{(m)},
\qquad \mathbf{z}^{(m)} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}).
The coefficients are then back-transformed to the original response and predictor scales.
When inequality constraints are absent, these draws are i.i.d. Gaussian posterior draws around the fitted mode. The underlying coefficient-draw closure also contains an elliptical slice sampling route for active inequality constraints, using the same covariance factor to keep retained draws in the feasible region.
At the implementation level, standard Gaussian posterior draws may place positive mass on the infeasible region, so the constrained-draw closure instead targets the truncated posterior
\pi(\boldsymbol{\beta} \mid \mathbf{y}) \propto
\exp\!\left(-\frac{1}{2}(\boldsymbol{\beta} - \tilde{\boldsymbol{\beta}})^{\top}
\mathbf{G}^{-1}(\boldsymbol{\beta} - \tilde{\boldsymbol{\beta}})\right)
\mathbf{1}(\mathbf{C}^{\top}\boldsymbol{\beta} \succeq \mathbf{c}),
yielding credible intervals that respect the constraint boundaries.
The public generate_posterior wrapper forwards
enforce_qp_constraints to the stored constrained-draw closure, so
constrained draws can be requested directly from the user-facing interface.
When a working correlation structure is present, the companion helper
generate_posterior_correlation extends this idea by propagating
uncertainty in the fitted correlation parameters through the same
VhalfInv_fxn/Vhalf_fxn machinery described in the correlation
section, rather than conditioning only on fixed covariance parameters. Correlation
parameters are drawn from a multivariate normal distribution centered about their estimates
with the inverse approxiamte BFGS Hessian of the REML optimization problem
VhalfInv_params_vcov used by default (or a custom alternative as
supplied to the argument correlation_param_vcov_sc).
Inequality constraints of the form
\mathbf{C}^{\top}\boldsymbol{\beta} \succeq \mathbf{c} handle
shape restrictions such as monotonicity, convexity, or boundedness. In
the monomial basis, these are linear in \boldsymbol{\beta}.
Monotonicity at a grid of points requires the first derivative polynomial
to be non-negative there; convexity requires the second derivative to be
non-negative; range constraints bound the fitted values directly. All of
these translate to linear inequality constraints on the coefficient
vector.
The inequality pieces are assembled by process_qp, which
returns the qp_Amat, qp_bvec, and qp_meq objects
passed to solve.QP, together with a
quadprog flag indicating whether any inequality constraints are
active. For derivative-sign constraints, process_qp calls
.build_deriv_qp, which uses
make_derivative_matrix on the expansion-standardized
design and maps the derivative rows into the full P-dimensional
coefficient space partition by partition.
The sparsity pattern of \mathbf{C} is inspected automatically by
.detect_qp_global (equivalently
.solver_detect_qp_global). When every column of
\mathbf{C} has nonzero entries in only a single partition block,
the constraint system is block-separable and an active-set method can
replace the dense QP.
At each iteration, the active set \mathcal{A} (constraints
satisfied at equality) is appended to \mathbf{A} as additional
equality constraints:
\mathbf{A}_{\mathrm{aug}} = [\mathbf{A} \mid
\mathbf{C}_{\mathcal{A}}].
The constrained estimate is obtained by the same OLS projection with
\mathbf{A}_{\mathrm{aug}} in place of \mathbf{A}, and
since \mathbf{A}_{\mathrm{aug}} retains block-diagonal
compatibility, all operations remain partition-wise.
The active set is updated by checking primal feasibility (adding the
most-violated inactive constraint) and dual feasibility (dropping the
active constraint with the most negative Lagrange multiplier) until the
KKT conditions are satisfied. Lagrange multipliers for active
inequalities are recovered from the OLS fit used in the Lagrangian
projection: the fitted coefficients on
\mathbf{X}^* = \mathbf{G}^{1/2}\mathbf{A}_{\mathrm{aug}} give
the multipliers up to sign. Implemented in
.active_set_refine and
.check_kkt_partitionwise.
During penalty tuning, the active inequality set from the last accepted tuning evaluation is used as the next initial working set. This only changes where the add/drop loop starts; the KKT check still determines the final active set for each penalty value.
The method adds or drops one constraint at a time. It is usually fast when the final active set is small and well separated, but can slow down when many derivative constraints are nearly binding, redundant, or nearly collinear. The implementation tracks visited active sets, filters rank deficient augmented systems, and falls back to dense SQP if the active-set route does not converge within its iteration budget.
When any column of \mathbf{C} spans multiple partition blocks
(for example, cross-knot monotonicity constraints), the equality re-solve
switches to the full-system bridge rather than the partition-wise bridge.
The selection is automatic; dense SQP is retained as the fallback.
The dense SQP approach, implemented in .qp_refine,
solves a sequence of quadratic subproblems approximating the penalized
log-likelihood. At each iteration s:
Compute the information matrix
\mathbf{M}^{(s)} =
\mathbf{X}^{\top}\mathbf{W}^{(s)}\mathbf{X} +
\boldsymbol{\Lambda}_{\mathrm{block}} + \mathbf{S}^{(s)},
where \mathbf{S}^{(s)} is the Schur complement correction.
Compute the score vector via qp_score_function.
Solve the QP with solve.QP, where the
combined constraint matrix is
[\mathbf{A} \mid \mathbf{C}] with the first R columns
treated as equalities.
Apply a damped update
\boldsymbol{\beta}^{(s+1)} =
(1 - \alpha)\boldsymbol{\beta}^{(s)} +
\alpha\boldsymbol{\beta}_{\mathrm{QP}}, with
\alpha = 2^{-d} and d incremented upon deviance
increase.
A rescaling factor
\mathrm{sc} = \sqrt{\mathrm{mean}(|\mathbf{M}^{(s)}|)} is applied
to the Hessian and linear term before calling the QP solver. The
equality-constrained estimate from the Lagrangian projection serves as
a warm start. The qp_score_function defaults to the GLM score using
family$mu.eta(eta) / family$variance(mu); for custom models a
different score can be supplied. With dense VhalfInv, the same
score weight is applied after whitening.
The active set at the solution identifies binding inequality constraints.
The Lagrange multipliers quantify the cost of each: a multiplier of zero
means the constraint is not binding. The implementation stores active
constraint indices, the corresponding submatrix of the constraint matrix,
and the multiplier vector in the qp_info list returned alongside
coefficient estimates, with components lagrangian, iact,
and Amat_active. When
return_lagrange_multipliers = TRUE, the fitted object also stores
the final multiplier vector directly as lagrange_multipliers.
The qp_info$method string records which solver path produced the
final inequality-constrained estimate:
"active_set" = partition-wise active-set on the standard
block-diagonal path; "active_set_full" = active-set with
full-system equality re-solves; "active_set_woodbury" =
active-set with Woodbury equality re-solves on the correlated low-rank
path; "dense_qp_gee_gaussian" = dense QP fallback for
correlated Gaussian fits; "dense_qp_gee_glm" = dense SQP /
dense QP-subproblem fallback for correlated non-Gaussian fits.
The original assembled inequality data are retained in the fitted
object's quadprog_list component (containing qp_Amat,
qp_bvec, and qp_meq from process_qp) so
the final active set can be interpreted relative to the full
specification. The final \mathbf{U} used in constructing the
posterior variance-covariance matrix is built from both the equality
constraints and the active inequality constraints at the solution.
Built-in inequality constraints include:
Monotonicity: qp_monotonic_increase,
qp_monotonic_decrease. Enforced by requiring consecutive
fitted values to be non-decreasing (or non-increasing):
(\mathbf{x}_i - \mathbf{x}_{i-1})^{\top}\boldsymbol{\beta}
\geq 0. These are constructed by process_qp from
the partition-stacked block design reordered to observation order.
Derivative sign: qp_positive_derivative,
qp_negative_derivative. Enforced through the
first-derivative design matrix from
make_derivative_matrix. May be TRUE/FALSE
or a character/integer vector selecting specific predictors.
Second-derivative sign: qp_positive_2ndderivative,
qp_negative_2ndderivative. Same construction using the
second-derivative design matrix.
Response range: qp_range_lower, qp_range_upper.
Bounds on the linear predictor; for non-identity links, the bounds
are transformed to the link scale inside process_qp.
Custom constraints via qp_Amat_fxn, qp_bvec_fxn,
and qp_meq_fxn, which receive the design matrix structure
and return the constraint matrix, bound vector, and number of
equalities. These are commonly paired with a custom
qp_score_function when the quadratic approximation uses a
non-default likelihood. The low-level objects qp_Amat,
qp_bvec, and qp_meq remain documented in
lgspline but in the current implementation serve as
activation markers rather than being merged into the constraint set
assembled by process_qp.
Built-in constraints can be thinned via qp_observations. A single
numeric vector applies the same subset to every active built-in
constraint, while a named list keyed by "var:qp_<type>" or bare
"qp_<type>" lets different constraint types use different row
subsets. For range and monotonicity, matching keyed entries are unioned;
derivative entries dispatch per variable. Unknown keys are ignored with a
warning when include_warnings = TRUE.
After assembly, process_qp can also stabilize the partition-local
inequality blocks by QR pivoting them before the final QP solve. When
qr_pivot_inequality_constraints = TRUE, each partition-local block
of \mathbf{C} is scanned for linearly dependent columns and only the
pivot columns are carried forward, while the leading equality columns are
left untouched. If parallel_qr_qp = TRUE, those partition-wise QR
pivots are farmed out across workers in the same style as the package's
other parallel QR reductions. Built-in range and monotonicity constraints
are intentionally excluded from this reduction. For range bounds, keeping
the full block preserves the direct user-facing lower/upper-value meaning;
for monotonicity, the consecutive differences are defined in observation
order and can bridge neighboring partitions even when some individual
difference columns happen to sit inside one partition block.
When a model contains both spline terms (receiving K+1
partition-specific coefficient vectors constrained to smoothness) and
non-interactive linear terms (“flat” terms, specified via
just_linear_without_interactions, receiving a single shared
coefficient vector \mathbf{v} across all partitions), the standard
solver carries K+1 copies of \mathbf{v} linked by equality
constraints. Backfitting avoids this inflation by solving a
lower-dimensional problem at each step. Write the partition-k
design as
\mathbf{X}_k = [\mathbf{Z}_k \mid
\mathbf{X}_{\mathrm{flat}}^{(k)}], where \mathbf{Z}_k contains
the spline columns and \mathbf{X}_{\mathrm{flat}}^{(k)} the flat
columns. This is invoked when blockfit = TRUE, flat columns are
non-empty, and K > 0.
The design, penalty, and constraint matrices are split into spline and
flat components by .bf_split_components, which extracts
spline rows from \mathbf{A}, drops null columns, rank-reduces via
QR, and detects mixed constraints (columns of \mathbf{A} with
nonzero entries on both spline and flat rows).
Spline step. Holding \mathbf{v} fixed, the code forms
the adjusted response
\mathbf{y}_k - \mathbf{X}_{\mathrm{flat}}^{(k)}\mathbf{v} and
applies the spline-only Lagrangian projection via
.bf_lagrangian_project:
\tilde{\boldsymbol{\beta}}_{\mathrm{spline}}^{(k)} =
\mathbf{U}_{\mathrm{spline}}\mathbf{G}_{\mathrm{spline}}
\mathbf{Z}_k^{\top}(\mathbf{y}_k -
\mathbf{X}_{\mathrm{flat}}^{(k)}\mathbf{v}).
Flat step. Holding spline coefficients fixed, the shared flat vector is updated by pooled penalized regression on residuals:
\mathbf{v} = \left(\sum_{k=0}^{K}
\mathbf{X}_{\mathrm{flat}}^{(k)\top}
\mathbf{X}_{\mathrm{flat}}^{(k)} +
\boldsymbol{\Lambda}_{\mathrm{flat}}\right)^{-1}
\sum_{k=0}^{K}
\mathbf{X}_{\mathrm{flat}}^{(k)\top}
(\mathbf{y}_k - \mathbf{Z}_k
\tilde{\boldsymbol{\beta}}_{\mathrm{spline}}^{(k)}).
When the constraint matrix \mathbf{A} has mixed columns (nonzero
entries on both spline and flat rows), the flat update instead solves a
KKT system enforcing the residual equality constraint
\mathbf{A}_{\mathrm{flat}}^{\top}\mathbf{v} = \mathbf{c} -
\mathbf{A}_{\mathrm{spline}}^{\top}
\tilde{\boldsymbol{\beta}}_{\mathrm{spline}}
via .bf_constrained_flat_update.
Convergence is checked using the maximum absolute change across both spline and flat coefficients. In the weighted inner loop used by the GLM solvers, the flat-block change alone determines stopping.
blockfit_solve dispatches to one of four paths.
Case (a): Gaussian identity + GEE
(.bf_case_gauss_gee).
Whitening destroys block-diagonal structure, so the code skips
backfitting and performs the same full-system Gaussian GEE projection
used by get_B Path 1a. The result is split back into
spline and flat components for downstream assembly.
Case (b): Gaussian identity, no correlation
(.bf_case_gauss_no_corr).
Standard block-coordinate descent as described above. The spline-only
\mathbf{G}_{\mathrm{spline}} factors are precomputed once, and the
pooled flat penalized inverse is reused across iterations.
Case (c): GLM + GEE
(.bf_case_glm_gee).
Two stages. Stage 1 forms a warm start by running damped Newton-Raphson
on the unwhitened working response: each outer iteration computes
working responses and weights at the current linear predictor, then
calls .bf_inner_weighted for the inner backfitting loop.
Stage 2 refines the warm start on the full whitened system using the
damped SQP loop (.bf_sqp_loop), replicating the approach
used by .get_B_gee_glm.
Case (d): GLM without GEE
(.bf_case_glm_no_corr).
A damped Newton-Raphson outer loop updates working responses and weights,
while each inner iteration alternates between weighted spline and
weighted flat updates via .bf_inner_weighted. Deviance is
monitored across outer iterations for convergence and damping.
Because flat coefficients are shared by construction, the corresponding
equality constraints are satisfied exactly. Smoothness constraints on the
spline block are enforced by the spline-only Lagrangian projection.
After backfitting convergence, inequality constraints use the same
active-set-first refinement used by get_B. Block-separable
constraints use partition-wise equality re-solves through
.active_set_refine; cross-partition constraints use the
full-system bridge; and GEE paths use the whitened system, with
Woodbury equality re-solves when the low-rank gate succeeds. Dense SQP
through .bf_sqp_loop remains the fallback when active-set
refinement does not converge.
After convergence, the shared flat vector \mathbf{v} is copied
into each partition's coefficient vector, yielding
\boldsymbol{\beta}_k =
[\tilde{\boldsymbol{\beta}}_{\mathrm{spline}}^{(k)\top},
\mathbf{v}^{\top}]^{\top} for compatibility with downstream inference.
If blockfit_solve throws an error, a warning is issued
and the code falls back to get_B.
The topic of knot selection is not the main focus of the package, but the partition structure is central because every later design matrix, penalty, and smoothness constraint depends on it. The defaults in lgspline are therefore meant to be practical and transparent rather than theoretically final.
For a single predictor, the default partitioning is now handled by
make_partitions in the same k-means framework used more
generally: K+1 centers are fit on an internally standardized copy
of the predictor, controlled by standardize_predictors_for_knots, and
then returned on the raw scale. Custom knots can still be supplied via
custom_knots, in which case partition assignment is built directly
from those raw-scale breakpoints. The default number of knots K is
chosen adaptively based on N, p, q, and the GLM family.
For multivariate fits, the resulting partition metadata are returned as
make_partition_list and can be re-used in later calls to
lgspline. This is particularly useful when one wants to hold
the partition geometry fixed across repeated fits, for example while varying
penalties, families, or correlation structures.
For multiple predictors, K+1 cluster centers are identified by
k-means on an internally standardized predictor matrix via
make_partitions. This is the partitioning mechanism used to
determine the multivariate spline regions; see MacQueen (1967) for the
classical clustering formulation and Kisi et al. (2025) for a recent applied
example of k-means-driven partitioning in a nonlinear prediction
setting. Midpoints between neighboring centers
(those whose midpoint does not fall into a third cluster) serve as knot
locations. Observations are assigned to the nearest cluster center using
get.knnx, and the returned centers and knots are on
the original predictor scale. The resulting partition structure is a type
of Voronoi diagram and is stored in the fitted object as
make_partition_list. The do_not_cluster_on_these argument can
exclude certain predictors from clustering (e.g., a treatment indicator that
should not drive partitioning). The lower-level clustering behavior can be
further controlled by cluster_args and neighbor_tolerance,
while cluster_on_indicators determines whether binary predictors are
allowed to influence the partition geometry at all.
Higher-order polynomial terms can dramatically inflate or deflate the
magnitude of basis expansions, introducing numerical instability. All
polynomial basis expansions are scaled by
q_{0.69} - q_{0.31}, where q_{\zeta} is the \zeta-th
quantile of the expansion. For a standard normal distribution this quantity
is approximately 1, so the scaling is close to one standard deviation for
symmetric distributions. This fitting-stage rescaling is controlled by
standardize_expansions_for_fitting, while knot construction is
controlled separately by standardize_predictors_for_knots. The same
scaling is applied to the constraint matrix to maintain smoothness, and
coefficients are back-transformed to the original scale after fitting.
The penalty matrix \boldsymbol{\Lambda}_s penalizes the integrated
squared total curvature of the fitted function over the observed predictor
ranges. This is the step that makes the piecewise polynomial fit genuinely
behave like a smoothing spline rather than merely a constrained regression
spline. The package computes this penalty directly from the monomial
structure of the basis rather than by appealing to a pre-tabulated spline
basis. For a single partition k with basis expansion
\mathbf{x}_k = (\phi_1(\mathbf{t}), \ldots, \phi_p(\mathbf{t}))^{\top}
where each \phi_i(\mathbf{t}) = \prod_{j=1}^{q} t_j^{\alpha_{ij}}
is a multivariate monomial:
\boldsymbol{\beta}_k^{\top}\boldsymbol{\Lambda}_s\boldsymbol{\beta}_k
= \int_{\mathbf{a}}^{\mathbf{b}}
\|\tilde{f}_k''(\mathbf{t})\|^{2}\,d\mathbf{t},
where \mathbf{a} and \mathbf{b} are the observed predictor
minimums and maximums (computed globally from the data, not
partition-specific), and
\tilde{f}_k(\mathbf{t}) = \mathbf{x}_k^{\top}\boldsymbol{\beta}_k
is the fitted function for partition \mathcal{P}_k.
Total curvature operator.
The integrated squared second derivative decomposes into q
curvature operators, one per predictor. For predictor v, the
curvature operator D_v is defined as
D_v = \frac{\partial^{2}}{\partial t_v^{2}}
+ \sum_{s \neq v}\frac{\partial^{2}}{\partial t_v\,\partial t_s}.
That is, D_v captures both the pure second derivative with respect
to t_v and all mixed second partial derivatives involving t_v.
The penalty matrix entries are then
[\boldsymbol{\Lambda}_s]_{ij}
= \sum_{v=1}^{q}\int_{\mathbf{a}}^{\mathbf{b}}
D_v(\phi_i)\,D_v(\phi_j)\,d\mathbf{t}.
Monomial derivative rule.
For a monomial \phi(\mathbf{t}) = \prod_j t_j^{\alpha_j}, the
derivatives entering D_v have closed forms. The pure second
derivative is
\frac{\partial^{2}}{\partial t_v^{2}}\prod_j t_j^{\alpha_j}
= \alpha_v(\alpha_v - 1)\,t_v^{\alpha_v - 2}\prod_{j \neq v} t_j^{\alpha_j},
which is zero when \alpha_v < 2. The mixed second derivative is
\frac{\partial^{2}}{\partial t_v\,\partial t_s}\prod_j t_j^{\alpha_j}
= \alpha_v\alpha_s\,t_v^{\alpha_v - 1}t_s^{\alpha_s - 1}
\prod_{j \neq v,s} t_j^{\alpha_j},
which is zero when \alpha_v < 1 or \alpha_s < 1. Applying
D_v to a monomial \phi_i produces a sum of monomials with
known coefficients and exponent vectors.
Factorized integration.
Because every D_v(\phi_i) is polynomial, the product
D_v(\phi_i)\,D_v(\phi_j) is also polynomial and the multivariate
integral factorizes over predictors:
\int_{\mathbf{a}}^{\mathbf{b}}\prod_{j=1}^{q} t_j^{e_j}\,d\mathbf{t}
= \prod_{j=1}^{q}\frac{b_j^{e_j+1} - a_j^{e_j+1}}{e_j + 1}.
The integral runs over all q predictor ranges, including predictors
absent from the integrand, for which e_j = 0 and the factor is
b_j - a_j.
Single-predictor verification.
For q = 1 with expansion
\mathbf{x} = (1, t, t^{2}, t^{3})^{\top} on [a, b], the
curvature operator reduces to D_1 = \partial^{2}/\partial t^{2} (no
mixed partials exist), and the penalty matrix reduces to
\boldsymbol{\Lambda}_s
= \int_a^b \mathbf{x}''\mathbf{x}''^{\top}\,dt
= \begin{pmatrix}
0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 \\
0 & 0 & 4(b - a) & 6(b^{2} - a^{2}) \\
0 & 0 & 6(b^{2} - a^{2}) & 12(b^{3} - a^{3})
\end{pmatrix},
equivalent to the classical cubic smoothing spline penalty originally proposed by Reinsch.
Handling of non-spline predictors.
Predictors specified via just_linear_without_interactions or
just_linear_with_interactions do not receive higher-order
polynomial expansions in the design matrix. To ensure their curvature
contributions are still correctly computed (particularly through
interaction terms), the implementation temporarily appends phantom
higher-order columns (with zero data) for these predictors, computes
the full curvature penalty on the augmented basis, and then subsets the
result back to the original p \times p dimensions. This ensures
that interaction terms involving non-spline predictors receive appropriate
penalty contributions without affecting the rest of the estimation
pipeline.
Parallel computation.
Because the total penalty is an additive sum over predictors
(\boldsymbol{\Lambda}_s = \sum_{v=1}^{q}\boldsymbol{\Lambda}_{s,v}),
the computation can be parallelized by distributing the per-predictor
curvature matrices across workers via parallel::parLapply and
summing the results. This is controlled by the parallel_penalty
argument and is beneficial when q is large.
The penalty is computed by get_2ndDerivPenalty (single
predictor or subset) and get_2ndDerivPenalty_wrapper
(full assembly with optional parallelism and non-spline handling).
Because the smoothing penalty has zero eigenvalues for the intercept and linear terms (whose second derivatives vanish), an optional ridge penalty on lower-order terms is added for computational stability.
The full penalty block for partition k is:
\boldsymbol{\Lambda}_k
= \lambda_w\Bigl(\boldsymbol{\Lambda}_s +
\lambda_r\boldsymbol{\Lambda}_r +
\sum_{l=1}^{L}\lambda_{l,k}\boldsymbol{\Lambda}_{l,k}\Bigr),
and the full penalty matrix is
\boldsymbol{\Lambda}
= \mathrm{blockdiag}(\boldsymbol{\Lambda}_0, \ldots,
\boldsymbol{\Lambda}_K).
Here \boldsymbol{\Lambda}_s is the integrated curvature penalty,
\boldsymbol{\Lambda}_r is the ridge penalty on the null space, and
\boldsymbol{\Lambda}_{l,k} denotes any additional penalty matrix
attached to partition k. The tuning scalars are
\lambda_w (wiggle_penalty), \lambda_r
(flat_ridge_penalty), and \lambda_{l,k}
(represented through predictor_penalties and
partition_penalties in the current interface). Internally, the
package stores these components with implementation labels such as
L1, L2, L_predictor_list, and
L_partition_list. This assembly is handled by compute_Lambda.
The penalty matrix \boldsymbol{\Lambda} is stored as a list
of K+1 p \times p square, symmetric, positive semi-definite
matrices.
After the structural pieces of the model are fixed, the main remaining
question is how much smoothing to apply. In lgspline, that tuning is
performed with exact leave-one-out by default, or with generalized
cross-validation when tuning_criterion = "gcv". In either case, the
criterion is evaluated using the same constrained estimator that will be
used in the final model fit.
Penalty parameters are estimated on the log scale via exponential
parameterization (\lambda = \exp(\theta),
\theta \in \mathbb{R}), ensuring positivity. The chain rule factor
\partial\exp(\theta)/\partial\theta = \exp(\theta) = \lambda is
applied throughout. User-facing arguments (initial_wiggle,
initial_flat, predictor_penalties,
partition_penalties) accept values on the raw, natural scale;
conversion to log scale is handled internally. The final tuned values and
assembled penalty pieces are returned in the fitted object's
penalties component.
With tuning_criterion = "loo", the criterion is
\mathrm{LOO}
= \frac{1}{N}\sum_{i=1}^{N}\left(\frac{r_i}{1-h_{ii}}\right)^2,
where h_{ii} is the diagonal of the tuning hat matrix. For Gaussian
identity-link tuning with fixed transformed design, this is exact. With
correlation structure and/or non-identity links, lgspline applies the
same transformed or linearized tuning problem already used by GCV, so the
leave-one-out calculation is carried out on that working scale. In empirical
diagnostics, the criterion itself behaves stably, while the fully analytic
derivative of the observation-wise leverage term can be numerically delicate
in some problems; when that sensitivity matters in practice, the finite-
difference optimizer path (use_custom_bfgs = FALSE) remains
available. This exact observation-wise calculation is also the main reason
LOO tuning becomes more expensive than GCV as N grows; in routine use,
tuning_criterion = "gcv" is often the more practical choice once the
sample size is above about 250,000, although this varies by data set.
With tuning_criterion = "gcv", the package instead uses
\mathrm{GCV}_{u,\gamma}
= \frac{\sum_{i=1}^{N}D_{ii}\,r_i^{2}}
{N(1 - \gamma\bar{W})^{2}}
where \bar{W} = \mathrm{tr}(\mathbf{H})/N is the mean of the
hat-matrix diagonal. The symbol \gamma is set by
gcv_gamma and controls optional inflation of the
effective-degrees-of-freedom term.
For identity link, r_i = y_i - \hat{\eta}_i. For non-identity links,
the residuals are
r_i = g((y_i + \delta)/(1+2\delta)) - (\hat{\eta}_i + \delta)/(1+2\delta),
where \delta \geq 0 is a pseudocount that stabilizes the link
transformation, automatically tuned within tune_Lambda if not supplied
to delta.
Non-Gaussian families with observation weights
\omega_i have their residuals scaled by \omega_i. When the
family provides a custom deviance residual function, that function is used
in place of the link-scale residuals.
Pseudocount selection. The pseudocount \delta is chosen to
make the transformed response distribution most closely approximate a
t-distribution with N-1 degrees of freedom, in the sense of
minimizing the (optionally weighted) mean absolute deviation between the
sorted standardized transformed responses and the corresponding
t-quantiles. This is solved via Brent's method over
[10^{-64}, 1]. When the link is identity, or when the response
naturally lies in the domain of the link function, \delta = 0.
This behavior is exposed through the delta argument in
lgspline: supplying a fixed numeric value bypasses the internal
search, while leaving it NULL allows the tuning code to choose the
stabilizing pseudocount automatically when needed.
Meta-penalty regularization. A regularization term pulls the predictor- and partition-specific penalty parameters toward 1 on the raw scale:
P_{\mathrm{meta}}(\lambda_w, \lambda_{l,k})
= \frac{1}{2}c_{\mathrm{meta}}\sum_k\sum_l(\lambda_{l,k} - 1)^{2}
+ \frac{1}{2}\cdot 10^{-32}(\lambda_w - 1)^{2}
where c_{\mathrm{meta}} is a user-specified coefficient
(meta_penalty). The gradient of P_{\mathrm{meta}} on the log
scale, incorporating the exp chain rule, is
\partial P_{\mathrm{meta}}/\partial\theta_{l,k} =
c_{\mathrm{meta}}(\lambda_{l,k} - 1)\lambda_{l,k}
and
\partial P_{\mathrm{meta}}/\partial\theta_1 = 10^{-32}(\lambda_w - 1)\lambda_w.
The total objective is the selected tuning criterion plus
P_{\mathrm{meta}}.
Let \ell denote the selected tuning criterion and
\boldsymbol{\Lambda}_k =
\lambda_w(\boldsymbol{\Lambda}_s +
\lambda_r\boldsymbol{\Lambda}_r +
\sum_l\lambda_{l,k}\boldsymbol{\Lambda}_{l,k}).
The analytic gradient differentiates \ell through
\mathbf{G}, \mathbf{U}, and \mathbf{H} once, giving the
partition-level derivative
\mathbf{M}_k = \partial\ell/\partial\boldsymbol{\Lambda}_k. The
penalty derivatives then follow directly:
\frac{\partial\ell}{\partial\lambda_w}
= \frac{1}{\lambda_w}\sum_{k=0}^{K}
\mathrm{tr}(\mathbf{M}_k\boldsymbol{\Lambda}_k),\qquad
\frac{\partial\ell}{\partial\lambda_r}
= \lambda_w\sum_{k=0}^{K}
\mathrm{tr}(\mathbf{M}_k\boldsymbol{\Lambda}_r),
and
\frac{\partial\ell}{\partial\lambda_{l,k}}
= \lambda_w\,\mathrm{tr}
(\mathbf{M}_k\boldsymbol{\Lambda}_{l,k}).
Shared predictor penalties use the corresponding sum over partitions. Since
optimization is on \theta_j = \log\lambda_j, each derivative is
multiplied by \lambda_j. Thus, after
\mathbf{M}_k is available, each additional penalty gradient costs
only a blockwise trace product.
For exact leave-one-out, \ell is the mean squared PRESS residual on
the transformed problem, using blockwise constrained leverages rather than
forming the full hat matrix. For GCV, \ell uses the usual
trace-based denominator, optionally inflated by gcv_gamma.
Grid search initialization. The selected tuning criterion is
evaluated over a grid of candidate values for
(\lambda_w, \lambda_r) on the log scale. All combinations of
user-supplied candidate vectors (initial_wiggle and
initial_flat) are formed, and the combination yielding
the smallest finite criterion value is selected as the starting
point for BFGS optimization. Grid points producing non-finite
objective values are discarded. If all grid points fail, an error
is raised advising the user to check the data or adjust the grid.
When parallel_grideval = TRUE and a cluster is supplied,
these candidate fits are spread across workers. If more than six
workers are available, additional raw-scale penalty pairs are drawn
from [\min/10,\max\times 10] in each direction before the
starting point is chosen.
Damped BFGS optimizer. A custom damped BFGS quasi-Newton optimizer,
implemented in tune_Lambda through
.damped_bfgs(), minimizes the selected tuning criterion plus
P_{\mathrm{meta}}. When analytic gradients are not preferred, or when
the exact LOO leverage derivative appears too noisy for a given problem, a
base-R optim call with method "BFGS" provides
the finite-difference fallback. When parallel_bfgs = TRUE and a
cluster is supplied, the damping trial values are evaluated as a batch
across workers, and the best improving candidate is retained.
Iterations 1-2: steepest descent. The first two iterations use
steepest descent with a damping factor \alpha:
\boldsymbol{\phi}^{(t+1)} = \boldsymbol{\phi}^{(t)} - \alpha\nabla_{\boldsymbol{\phi}}.
Iterations 3+: BFGS. From iteration 3, an inverse Hessian
approximation \mathbf{K}^{(t)} is maintained via the standard secant
update. Let
\mathbf{s}^{(t)} = \boldsymbol{\phi}^{(t)} - \boldsymbol{\phi}^{(t-1)}
and
\mathbf{v}^{(t)} = \nabla^{(t)} - \nabla^{(t-1)}. The BFGS update
is:
\mathbf{K}^{(t+1)}
= (\mathbf{I} - \mathbf{u}\mathbf{s}\mathbf{v}^{\top})
\mathbf{K}^{(t)}
(\mathbf{I} - \mathbf{u}\mathbf{v}\mathbf{s}^{\top})
+ \mathbf{u}\mathbf{s}\mathbf{s}^{\top},
\qquad \mathbf{u} = (\mathbf{v}^{\top}\mathbf{s})^{-1}.
When |\mathbf{v}^{\top}\mathbf{s}| < 10^{-64}, the approximation is
reset to \mathbf{I} and the iteration is flagged for restart. The
search direction is
\mathbf{d}^{(t)} = -\mathbf{K}^{(t)}\nabla^{(t)}.
Step acceptance. A step is accepted if the new tuning criterion value
is no larger than the old one.
On rejection, \alpha is halved. If \alpha < 2^{-10} (early
iterations) or \alpha < 2^{-12} (later iterations), the optimizer
terminates with the best solution found.
With inequality constraints, the active set from the last accepted tuning
evaluation initializes the next coefficient solve, avoiding repeated
reconstruction of the same working set for nearby penalty values.
Convergence. After at least 10 iterations, the optimizer terminates
when the absolute change in the tuning criterion is less than
\epsilon, when
\|\boldsymbol{\phi}^{(t)} - \boldsymbol{\phi}^{(t-1)}\|_{\infty} < \epsilon,
or when accepted criterion changes remain below a small scale-aware
plateau threshold for several iterations. The plateau threshold is
\max\{\epsilon,\min(10^{-5},10^{-5}|\ell_{\mathrm{best}}|)\}.
Alternative. A base-R
optim call with method "BFGS" and
finite-difference gradients can be used instead via
use_custom_bfgs = FALSE.
Post-optimization sample-size adjustment. After optimization, the
tuned penalty parameters are divided by (N+1)/(N-1)
(equivalently multiplied by (N-1)/(N+1)) for both GCV- and
LOO-based tuning. This decreases the final penalties at small sample sizes.
The tuning loop is implemented in tune_Lambda.
Fixed effects and spline effects share the same partition-wise polynomial
bookkeeping. Predictors included only linearly via
just_linear_without_interactions or
just_linear_with_interactions remain structurally linear. The
first-derivative constraint then keeps their slope constant across
partitions, while spline-expanded predictors remain free to vary
nonlinearly.
When blockfit = TRUE is specified alongside
just_linear_without_interactions, the flat-block path provides an
alternative enforcement mechanism: flat coefficients are pooled across
partitions during backfitting. The point estimate agrees with constraint
projection, but uncertainty quantification differs; see the Blockfit section.
Because each partition retains an explicit polynomial representation, numerical integration can be carried out directly.
The user-facing method integrate.lgspline applies
Gauss-Legendre quadrature to predictions from predict.lgspline.
For a rectangular domain, integrate.lgspline constructs a
tensor-product grid of Gauss-Legendre nodes, evaluates the fitted model, and
forms the weighted sum. The fitted partition structure is handled internally.
The vars argument selects which predictors are integrated over.
Predictors not listed in vars are held fixed at
initial_values when supplied, or otherwise at the midpoint of their
observed training range. The optional B_predict argument makes it
possible to integrate posterior draws or other alternate coefficient sets,
and n_quad controls the number of Gauss-Legendre nodes used per
integrated dimension.
Integration is performed on the response scale by default. Setting
link_scale = TRUE instead integrates the linear predictor
\eta = f(\mathbf{t}). For identity-link Gaussian models the two scales
coincide.
When return_lagrange_multipliers = TRUE, the multiplier vector
\boldsymbol{\lambda} = (\mathbf{A}^{\top}\mathbf{G}\mathbf{A})^{-1}\mathbf{A}^{\top}\hat{\boldsymbol{\beta}}
is returned. These quantify the sensitivity of the penalized objective to
relaxing each smoothness or user-supplied equality constraint. When
constraint target values are nonzero
(\mathbf{A}^{\top}\boldsymbol{\beta}_0 \neq \mathbf{0}), the modified
formulation is used:
\boldsymbol{\lambda} = (\mathbf{A}^{\top}\mathbf{G}\mathbf{A})^{-1}\mathbf{A}^{\top}(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}_0)
where \mathbf{A}^{\top}\boldsymbol{\beta}_0 is the vector of
constraint target values. Multipliers are NULL when no constraints
are active (\mathbf{A} is NULL or K = 0).
For inequality constraints, multipliers are returned as computed by
solve.QP. The Lagrange multipliers for active
inequality constraints can be used diagnostically to identify which shape
constraints are most costly in terms of goodness of fit.
Standard S3 methods are provided for objects of class lgspline:
print.lgspline and summary.lgspline:
Provide concise model summaries, with
print.summary.lgspline formatting coefficient tables in a
familiar regression-style layout.
logLik.lgspline: Returns a standard logLik
object. For Gaussian responses with identity link, the exact
log-likelihood is computed. When a correlation structure is present via
VhalfInv, the log-likelihood includes the
\log|\mathbf{V}^{-1/2}| adjustment and the corresponding
whitened quadratic form. For other families, the method falls back to
family$aic or a deviance-based approximation. An
include_prior argument (default TRUE) optionally adds
the Gaussian prior penalty interpretation of the smoothing spline
penalty
-\frac{1}{2\sigma^{2}}\tilde{\boldsymbol{\beta}}^{\top}\boldsymbol{\Lambda}\tilde{\boldsymbol{\beta}}
to obtain a penalized MAP log-likelihood. Alternate coefficient lists
and fixed dispersions can be supplied through B_predict and
sigmasq_predict.
predict.lgspline: Produces fitted values and related
quantities (e.g., derivatives and standard errors through
se.fit = TRUE), lets new_predictors override
newdata, accepts alternate coefficient lists through
B_predict, and supports prediction on new predictor matrices
consistent with the original spline expansions.
coef.lgspline: Extracts partition-specific
coefficient vectors.
confint.lgspline: Extracts confidence intervals.
When the inverse Hessian from BFGS optimization is available for
correlation parameters, intervals for those correlation parameters are
returned on the working (transformed) scale and should be
back-transformed as described in the correlation section.
plot.lgspline: For one-dimensional fits, produces
base R graphics showing the fitted function (with optional partition-wise
formulas) and supports overlay via add = TRUE. For two or more
predictors, an interactive plotly-based visualization is
returned. Specific predictors may be selected via vars.
integrate.lgspline: Computes definite integrals of
the fitted surface over rectangular domains by Gauss-Legendre
quadrature.
Additional user-facing helpers include wald_univariate for
coefficient-wise Wald inference, generate_posterior for
posterior and posterior-predictive sampling,
generate_posterior_correlation for correlation-aware posterior
simulation, equation for closed-form display of the fitted
partition formulas, and
find_extremum for optimizing the fitted surface or a custom
acquisition function built from it.
Buse, A. and Lim, L. (1977). Cubic Splines as a Special Case of Restricted Least Squares. Journal of the American Statistical Association, 72, 64-68.
Eilers, P. H. and Marx, B. D. (1996). Flexible Smoothing with B-splines and Penalties. Statistical Science, 11(2), 89-121.
Ezhov, N., Neitzel, F. and Petrovic, S. (2018). Spline Approximation, Part 1: Basic Methodology. Journal of Applied Geodesy, 12(2), 139-155.
Goldfarb, D. and Idnani, A. (1983). A Numerically Stable Dual Method for Solving Strictly Convex Quadratic Programs. Mathematical Programming, 27(1), 1-33.
Harville, D. A. (1977). Maximum Likelihood Approaches to Variance Component Estimation and to Related Problems. Journal of the American Statistical Association, 72(358), 320-338.
Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models. Chapman & Hall/CRC.
Kisi, O., Heddam, S., Parmar, K. S., Petroselli, A., Kulls, C. and Zounemat-Kermani, M. (2025). Integration of Gaussian Process Regression and K Means Clustering for Enhanced Short Term Rainfall Runoff Modeling. Scientific Reports, 15, 7444.
MacQueen, J. B. (1967). Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1, 281-297. University of California Press.
McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. Chapman & Hall, 2nd edition.
Murray, I., Adams, R. P. and MacKay, D. J. C. (2010). Elliptical Slice Sampling. Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), 9, 541-548.
Nocedal, J. and Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer.
Kim, Y.-J. and Gu, C. (2004). Smoothing Spline Gaussian Regression: More Scalable Computation via Efficient Approximation. Journal of the Royal Statistical Society: Series B, 66(2), 337-356.
Patterson, H. D. and Thompson, R. (1971). Recovery of Inter-Block Information When Block Sizes Are Unequal. Biometrika, 58, 545-554.
Pya, N. and Wood, S. N. (2015). Shape Constrained Additive Models. Statistics and Computing, 25(3), 543-559.
Reinsch, C. H. (1967). Smoothing by Spline Functions. Numerische Mathematik, 10, 177-183.
Ruppert, D., Wand, M. P. and Carroll, R. J. (2003). Semiparametric Regression. Cambridge University Press.
Searle, S. R., Casella, G. and McCulloch, C. E. (2006). Variance Components. Wiley.
Wahba, G. (1990). Spline Models for Observational Data. SIAM.
Wood, S. N. (2006). On Confidence Intervals for Generalized Additive Models Based on Penalized Regression Splines. Australian & New Zealand Journal of Statistics, 48(4), 445-464.
Wood, S. N. (2011). Fast Stable Restricted Maximum Likelihood and Marginal Likelihood Estimation of Semiparametric Generalized Linear Models. Journal of the Royal Statistical Society: Series B, 73(1), 3-36.
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R. CRC Press, 2nd edition.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.