# bigspline: Fits Smoothing Spline In taylerablake/thin-plate-splines: Smoothing Splines for Large Samples

## Description

Given a real-valued response vector \mathbf{y}=\{y_{i}\}_{n\times1} and a real-valued predictor vector \mathbf{x}=\{x_{i}\}_{n\times 1} with a ≤q x_{i} ≤q b \ \forall i, a smoothing spline model has the form

y_{i}=η(x_{i})+e_{i}

where y_{i} is the i-th observation's respone, x_{i} is the i-th observation's predictor, η is an unknown smooth function relating the response and predictor, and e_{i}\sim\mathrm{N}(0,σ^{2}) is iid Gaussian error.

## Usage

 1 2 3 bigspline(x,y,type="cub",nknots=30,rparm=0.01,xmin=min(x), xmax=max(x),alpha=1,lambdas=NULL,se.fit=FALSE, rseed=1234,knotcheck=TRUE) 

## Arguments

 x Predictor vector. y Response vector. Must be same length as x. type Type of spline for x. Options include type="lin" for linear, type="cub" for cubic, type="cub0" for different cubic, and type="per" for cubic periodic. See Spline Types section. nknots Scalar giving maximum number of knots to bin-sample. Use more knots for more jagged functions. rparm Rounding parameter for x. Use rparm=NA to fit unrounded solution. Rounding parameter must be in interval (0,1]. xmin Minimum x value (i.e., a). Used to transform data to interval [0,1]. xmax Maximum x value (i.e., b). Used to transform data to interval [0,1]. alpha Manual tuning parameter for GCV score. Using alpha=1 gives unbaised esitmate. Using a larger alpha enforces a smoother estimate. lambdas Vector of global smoothing parameters to try. Default estimates smoothing parameter that minimizes GCV score. se.fit Logical indicating if the standard errors of fitted values should be estimated. rseed Random seed. Input to set.seed to reproduce same knots when refitting same model. Use rseed=NULL to generate a different sample of knots each time. knotcheck If TRUE, only unique knots are used (for stability).

## Details

To estimate η I minimize the penalized least-squares functional

\frac{1}{n}∑_{i=1}^{n}(y_{i}-η(x_{i}))^{2}+λ \int [\ddot{η}(x)]^2 dx

where \ddot{η} denotes the second derivative of η and λ≥q0 is a smoothing parameter that controls the trade-off between fitting and smoothing the data.

Default use of the function estimates λ by minimizing the GCV score:

\mbox{GCV}(λ) = \frac{n\|(\mathbf{I}_{n}-\mathbf{S}_{λ})\mathbf{y}\|^{2}}{[n-\mathrm{tr}(\mathbf{S}_{λ})]^2}

where \mathbf{I}_{n} is the identity matrix and \mathbf{S}_{λ} is the smoothing matrix (see Computational Details).

Using the rounding parameter input rparm can greatly speed-up and stabilize the fitting for large samples. When rparm is used, the spline is fit to a set of unique data points after rounding; the unique points are determined using the efficient algorithm described in Helwig (2013). For typical cases, I recommend using rparm=0.01, but smaller rounding parameters (e,g., rparm=0.001) may be needed for particularly jagged functions (or when x has outliers).

## Value

 fitted.values Vector of fitted values corresponding to the original data points in x (if rparm=NA) or the rounded data points in xunique (if rparm is used). se.fit Vector of standard errors of fitted.values (if input se.fit=TRUE). x Predictor vector (same as input). y Response vector (same as input). type Type of spline that was used. xunique Unique elements of x after rounding (if rparm is used). yunique Mean of y for unique elements of x after rounding (if rparm is used). funique Vector giving frequency of each element of xunique (if rparm is used). sigma Estimated error standard deviation, i.e., \hat{σ}. ndf Data frame with two elements: n is total sample size, and df is effective degrees of freedom of fit model (trace of smoothing matrix). info Model fit information: vector containing the GCV, multiple R-squared, AIC, and BIC of fit model (assuming Gaussian error). xrng Predictor range: xrng=c(xmin,xmax). myknots Bin-sampled spline knots used for fit. rparm Rounding parameter for x (same as input). lambda Optimal smoothing parameter. coef Spline basis function coefficients. coef.csqrt Matrix square-root of covariace matrix of coef. Use tcrossprod(coef.csqrt) to get covariance matrix of coef.

## Warnings

Cubic and cubic periodic splines transform the predictor to the interval [0,1] before fitting. So input xmin must be less than or equal to min(x), and input xmax must be greater than or equal to max(x).

When using rounding parameters, output fitted.values corresponds to unique rounded predictor scores in output xunique. Use predict.bigspline function to get fitted values for full y vector.

## Computational Details

According to smoothing spline theory, the function η can be approximated as

η(x) = d_{0} + d_{1}φ_{1}(x) + ∑_{h=1}^{q}c_{h}ρ(x,x_{h}^{*})

where the φ_{1} is a linear function, ρ is the reproducing kernel of the contrast (nonlinear) space, and \{x_{h}^{*}\}_{h=1}^{q} are the selected spline knots.

This implies that the penalized least-squares functional can be rewritten as

\|\mathbf{y} - \mathbf{K}\mathbf{d} - \mathbf{J}\mathbf{c}\|^{2} + nλ\mathbf{c}'\mathbf{Q}\mathbf{c}

where \mathbf{K}=\{φ(x_{i})\}_{n \times 2} is the null space basis function matrix, \mathbf{J}=\{ρ(x_{i},x_{h}^{*})\}_{n \times q} is the contrast space basis funciton matrix, \mathbf{Q}=\{ρ(x_{g}^{*},x_{h}^{*})\}_{q \times q} is the penalty matrix, and \mathbf{d}=(d_{0},d_{1})' and \mathbf{c}=(c_{1},…,c_{q})' are the unknown basis function coefficients.

Given the smoothing parameter λ, the optimal basis function coefficients have the form

≤ft(\begin{array}{cc} \hat{\mathbf{d}} \\ \hat{\mathbf{c}} \end{array}\right) = ≤ft(\begin{array}{cc} \mathbf{K'K} & \mathbf{K}'\mathbf{J} \\ \mathbf{J}'\mathbf{K} & \mathbf{J}'\mathbf{J} + nλ\mathbf{Q} \end{array}\right)^{\dagger} ≤ft(\begin{array}{c} \mathbf{K}' \\ \mathbf{J}' \end{array}\right)\mathbf{y}

where (\cdot)^{\dagger} denotes the pseudoinverse of the input matrix.

Given the optimal coefficients, the fitted values are given by \hat{\mathbf{y}} = \mathbf{K}\hat{\mathbf{d}}+\mathbf{J}\hat{\mathbf{c}} = \mathbf{S}_{λ}\mathbf{y}, where

\mathbf{S}_{λ} = ≤ft(\begin{array}{cc} \mathbf{K} & \mathbf{J} \end{array}\right) ≤ft(\begin{array}{cc} \mathbf{K'K} & \mathbf{K}'\mathbf{J} \\ \mathbf{J}'\mathbf{K} & \mathbf{J}'\mathbf{J} + nλ\mathbf{Q} \end{array}\right)^{\dagger} ≤ft(\begin{array}{c} \mathbf{K}' \\ \mathbf{J}' \end{array}\right)

is the smoothing matrix, which depends on λ.

## Spline Types

For a linear spline (type="lin") with x \in [0,1], the needed functions are

φ_{1}(x) = 0 \qquad \mbox{and} \qquad ρ(x,z) = k_{1}(x)k_{1}(z)+k_{2}(|x-z|)

where k_{1}(x)=x-0.5, k_{2}(x)=\frac{1}{2}≤ft(k_{1}^{2}(x) - \frac{1}{12} \right); in this case \mathbf{K}=\mathbf{1}_{n} and \mathbf{d}=d_{0}.

For a cubic spline (type="cub") with x \in [0,1], the needed functions are

φ_{1}(x) = k_{1}(x) \qquad \mbox{and} \qquad ρ(x,z) = k_{2}(x)k_{2}(z)-k_{4}(|x-z|)

where k_{1} and k_{2} are defined above, and k_{4}(x)=\frac{1}{24}≤ft(k_{1}^{4}(x) - \frac{k_{1}^{2}(x)}{2} + \frac{7}{240} \right).

For a different cubic spline (type="cub0") with x \in [0,1], the needed functions are

φ_{1}(x) = x \qquad \mbox{and} \qquad ρ(x,z) = (x \wedge z)^2[3(x \vee z) - (x \wedge z)]/6

where (x \wedge z) = \min(x,z) and (x \vee z) = \max(x,z).

Note that type="cub" and type="cub0" use different definitions of the averaging operator in the null space. The overall spline estimates should be the same (up to approximation accuracy), but the null and constrast space effect functions will differ (see predict.bigspline). See Helwig (2013) and Gu (2013) for a further discussion of polynomial splines.

For a periodic cubic spline (type="per") with x \in [0,1], the needed functions are

φ_{1}(x) = 0 \qquad \mbox{and} \qquad ρ(x,z) = -k_{4}(|x-z|)

where k_{4}(x) is defined as it was for type="cub"; in this case \mathbf{K}=\mathbf{1}_{n} and \mathbf{d}=d_{0}.

## Note

The spline is estimated using penalized least-squares, which does not require the Gaussian error assumption. However, the spline inference information (e.g., standard errors and fit information) requires the Gaussian error assumption.

## Author(s)

Nathaniel E. Helwig <[email protected]>

## References

Gu, C. (2013). Smoothing spline ANOVA models, 2nd edition. New York: Springer.

Helwig, N. E. (2013). Fast and stable smoothing spline analysis of variance models for large samples with applications to electroencephalography data analysis. Unpublished doctoral dissertation. University of Illinois at Urbana-Champaign.

Helwig, N. E. and Ma, P. (2015). Fast and stable multiple smoothing parameter selection in smoothing spline analysis of variance models with large samples. Journal of Computational and Graphical Statistics, 24, 715-732.

Helwig, N. E. and Ma, P. (2016). Smoothing spline ANOVA for super-large samples: Scalable computation via rounding parameters. Statistics and Its Interface, 9, 433-444.

## Examples

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 ########## EXAMPLE 1 ########## # define relatively smooth function set.seed(773) myfun <- function(x){ sin(2*pi*x) } x <- runif(10^6) y <- myfun(x) + rnorm(10^6) # linear, cubic, different cubic, and periodic splines linmod <- bigspline(x,y,type="lin") linmod cubmod <- bigspline(x,y) cubmod cub0mod <- bigspline(x,y,type="cub0") cub0mod permod <- bigspline(x,y,type="per") permod ########## EXAMPLE 2 ########## # define more jagged function set.seed(773) myfun <- function(x){ 2*x + cos(4*pi*x) } x <- runif(10^6)*4 y <- myfun(x) + rnorm(10^6) # try different numbers of knots r1mod <- bigspline(x,y,nknots=20) crossprod( myfun(r1mod$xunique) - r1mod$fitted )/length(r1mod$fitted) r2mod <- bigspline(x,y,nknots=30) crossprod( myfun(r2mod$xunique) - r2mod$fitted )/length(r2mod$fitted) r3mod <- bigspline(x,y,nknots=40) crossprod( myfun(r3mod$xunique) - r3mod$fitted )/length(r3mod$fitted) ########## EXAMPLE 3 ########## # define more jagged function set.seed(773) myfun <- function(x){ 2*x + cos(4*pi*x) } x <- runif(10^6)*4 y <- myfun(x) + rnorm(10^6) # try different rounding parameters r1mod <- bigspline(x,y,rparm=0.05) crossprod( myfun(r1mod$xunique) - r1mod$fitted )/length(r1mod$fitted) r2mod <- bigspline(x,y,rparm=0.02) crossprod( myfun(r2mod$xunique) - r2mod$fitted )/length(r2mod$fitted) r3mod <- bigspline(x,y,rparm=0.01) crossprod( myfun(r3mod$xunique) - r3mod$fitted )/length(r3mod$fitted) 

taylerablake/thin-plate-splines documentation built on Sept. 19, 2017, 9:45 a.m.