bigspline: Fits Smoothing Spline
In bigsplines: Smoothing Splines for Large Samples

Description Usage Arguments Details Value Warnings Computational Details Spline Types Note Author(s) References Examples

Given a real-valued response vector \mathbf{y}=\{y_{i}\}_{n\times1} and a real-valued predictor vector \mathbf{x}=\{x_{i}\}_{n\times 1} with a ≤q x_{i} ≤q b \ \forall i, a smoothing spline model has the form

y_{i}=η(x_{i})+e_{i}

where y_{i} is the i-th observation's respone, x_{i} is the i-th observation's predictor, η is an unknown smooth function relating the response and predictor, and e_{i}\sim\mathrm{N}(0,σ^{2}) is iid Gaussian error.

1
2
3

bigspline(x,y,type="cub",nknots=30,rparm=0.01,xmin=min(x),
          xmax=max(x),alpha=1,lambdas=NULL,se.fit=FALSE,
          rseed=1234,knotcheck=TRUE)

`x`	Predictor vector.
`y`	Response vector. Must be same length as `x`.
`type`	Type of spline for `x`. Options include `type="lin"` for linear, `type="cub"` for cubic, `type="cub0"` for different cubic, and `type="per"` for cubic periodic. See Spline Types section.
`nknots`	Scalar giving maximum number of knots to bin-sample. Use more knots for more jagged functions.
`rparm`	Rounding parameter for `x`. Use `rparm=NA` to fit unrounded solution. Rounding parameter must be in interval (0,1].
`xmin`	Minimum `x` value (i.e., a). Used to transform data to interval [0,1].
`xmax`	Maximum `x` value (i.e., b). Used to transform data to interval [0,1].
`alpha`	Manual tuning parameter for GCV score. Using `alpha=1` gives unbaised esitmate. Using a larger alpha enforces a smoother estimate.
`lambdas`	Vector of global smoothing parameters to try. Default estimates smoothing parameter that minimizes GCV score.
`se.fit`	Logical indicating if the standard errors of fitted values should be estimated.
`rseed`	Random seed. Input to `set.seed` to reproduce same knots when refitting same model. Use `rseed=NULL` to generate a different sample of knots each time.
`knotcheck`	If `TRUE`, only unique knots are used (for stability).

To estimate η I minimize the penalized least-squares functional

\frac{1}{n}∑_{i=1}^{n}(y_{i}-η(x_{i}))^{2}+λ \int [\ddot{η}(x)]^2 dx

where \ddot{η} denotes the second derivative of η and λ≥q0 is a smoothing parameter that controls the trade-off between fitting and smoothing the data.

Default use of the function estimates λ by minimizing the GCV score:

\mbox{GCV}(λ) = \frac{n\|(\mathbf{I}_{n}-\mathbf{S}_{λ})\mathbf{y}\|^{2}}{[n-\mathrm{tr}(\mathbf{S}_{λ})]^2}

where \mathbf{I}_{n} is the identity matrix and \mathbf{S}_{λ} is the smoothing matrix (see Computational Details).

Using the rounding parameter input rparm can greatly speed-up and stabilize the fitting for large samples. When rparm is used, the spline is fit to a set of unique data points after rounding; the unique points are determined using the efficient algorithm described in Helwig (2013). For typical cases, I recommend using rparm=0.01, but smaller rounding parameters (e,g., rparm=0.001) may be needed for particularly jagged functions (or when x has outliers).

`fitted.values`	Vector of fitted values corresponding to the original data points in `x` (if `rparm=NA`) or the rounded data points in `xunique` (if `rparm` is used).
`se.fit`	Vector of standard errors of `fitted.values` (if input `se.fit=TRUE)`.
`x`	Predictor vector (same as input).
`y`	Response vector (same as input).
`type`	Type of spline that was used.
`xunique`	Unique elements of `x` after rounding (if `rparm` is used).
`yunique`	Mean of `y` for unique elements of `x` after rounding (if `rparm` is used).
`funique`	Vector giving frequency of each element of `xunique` (if `rparm` is used).
`sigma`	Estimated error standard deviation, i.e., \hat{σ}.
`ndf`	Data frame with two elements: `n` is total sample size, and `df` is effective degrees of freedom of fit model (trace of smoothing matrix).
`info`	Model fit information: vector containing the GCV, multiple R-squared, AIC, and BIC of fit model (assuming Gaussian error).
`xrng`	Predictor range: `xrng=c(xmin,xmax)`.
`myknots`	Bin-sampled spline knots used for fit.
`rparm`	Rounding parameter for `x` (same as input).
`lambda`	Optimal smoothing parameter.
`coef`	Spline basis function coefficients.
`coef.csqrt`	Matrix square-root of covariace matrix of `coef`. Use `tcrossprod(coef.csqrt)` to get covariance matrix of `coef`.

Cubic and cubic periodic splines transform the predictor to the interval [0,1] before fitting. So input xmin must be less than or equal to min(x), and input xmax must be greater than or equal to max(x).

When using rounding parameters, output fitted.values corresponds to unique rounded predictor scores in output xunique. Use predict.bigspline function to get fitted values for full y vector.

According to smoothing spline theory, the function η can be approximated as

η(x) = d_{0} + d_{1}φ_{1}(x) + ∑_{h=1}^{q}c_{h}ρ(x,x_{h}^{*})

where the φ_{1} is a linear function, ρ is the reproducing kernel of the contrast (nonlinear) space, and \{x_{h}^{*}\}_{h=1}^{q} are the selected spline knots.

This implies that the penalized least-squares functional can be rewritten as

\|\mathbf{y} - \mathbf{K}\mathbf{d} - \mathbf{J}\mathbf{c}\|^{2} + nλ\mathbf{c}'\mathbf{Q}\mathbf{c}

where \mathbf{K}=\{φ(x_{i})\}_{n \times 2} is the null space basis function matrix, \mathbf{J}=\{ρ(x_{i},x_{h}^{*})\}_{n \times q} is the contrast space basis funciton matrix, \mathbf{Q}=\{ρ(x_{g}^{*},x_{h}^{*})\}_{q \times q} is the penalty matrix, and \mathbf{d}=(d_{0},d_{1})' and \mathbf{c}=(c_{1},…,c_{q})' are the unknown basis function coefficients.

Given the smoothing parameter λ, the optimal basis function coefficients have the form

≤ft(\begin{array}{cc} \hat{\mathbf{d}} \\ \hat{\mathbf{c}} \end{array}\right) = ≤ft(\begin{array}{cc} \mathbf{K'K} & \mathbf{K}'\mathbf{J} \\ \mathbf{J}'\mathbf{K} & \mathbf{J}'\mathbf{J} + nλ\mathbf{Q} \end{array}\right)^{\dagger} ≤ft(\begin{array}{c} \mathbf{K}' \\ \mathbf{J}' \end{array}\right)\mathbf{y}

where (\cdot)^{\dagger} denotes the pseudoinverse of the input matrix.

Given the optimal coefficients, the fitted values are given by \hat{\mathbf{y}} = \mathbf{K}\hat{\mathbf{d}}+\mathbf{J}\hat{\mathbf{c}} = \mathbf{S}_{λ}\mathbf{y}, where

\mathbf{S}_{λ} = ≤ft(\begin{array}{cc} \mathbf{K} & \mathbf{J} \end{array}\right) ≤ft(\begin{array}{cc} \mathbf{K'K} & \mathbf{K}'\mathbf{J} \\ \mathbf{J}'\mathbf{K} & \mathbf{J}'\mathbf{J} + nλ\mathbf{Q} \end{array}\right)^{\dagger} ≤ft(\begin{array}{c} \mathbf{K}' \\ \mathbf{J}' \end{array}\right)

is the smoothing matrix, which depends on λ.

For a linear spline (type="lin") with x \in [0,1], the needed functions are

φ_{1}(x) = 0 \qquad \mbox{and} \qquad ρ(x,z) = k_{1}(x)k_{1}(z)+k_{2}(|x-z|)

where k_{1}(x)=x-0.5, k_{2}(x)=\frac{1}{2}≤ft(k_{1}^{2}(x) - \frac{1}{12} \right); in this case \mathbf{K}=\mathbf{1}_{n} and \mathbf{d}=d_{0}.

For a cubic spline (type="cub") with x \in [0,1], the needed functions are

φ_{1}(x) = k_{1}(x) \qquad \mbox{and} \qquad ρ(x,z) = k_{2}(x)k_{2}(z)-k_{4}(|x-z|)

where k_{1} and k_{2} are defined above, and k_{4}(x)=\frac{1}{24}≤ft(k_{1}^{4}(x) - \frac{k_{1}^{2}(x)}{2} + \frac{7}{240} \right).

For a different cubic spline (type="cub0") with x \in [0,1], the needed functions are

φ_{1}(x) = x \qquad \mbox{and} \qquad ρ(x,z) = (x \wedge z)^2[3(x \vee z) - (x \wedge z)]/6

where (x \wedge z) = \min(x,z) and (x \vee z) = \max(x,z).

Note that type="cub" and type="cub0" use different definitions of the averaging operator in the null space. The overall spline estimates should be the same (up to approximation accuracy), but the null and constrast space effect functions will differ (see predict.bigspline). See Helwig (2013) and Gu (2013) for a further discussion of polynomial splines.

For a periodic cubic spline (type="per") with x \in [0,1], the needed functions are

φ_{1}(x) = 0 \qquad \mbox{and} \qquad ρ(x,z) = -k_{4}(|x-z|)

where k_{4}(x) is defined as it was for type="cub"; in this case \mathbf{K}=\mathbf{1}_{n} and \mathbf{d}=d_{0}.

The spline is estimated using penalized least-squares, which does not require the Gaussian error assumption. However, the spline inference information (e.g., standard errors and fit information) requires the Gaussian error assumption.

Nathaniel E. Helwig <helwig@umn.edu>

Gu, C. (2013). Smoothing spline ANOVA models, 2nd edition. New York: Springer.

Helwig, N. E. (2013). Fast and stable smoothing spline analysis of variance models for large samples with applications to electroencephalography data analysis. Unpublished doctoral dissertation. University of Illinois at Urbana-Champaign.

Helwig, N. E. and Ma, P. (2015). Fast and stable multiple smoothing parameter selection in smoothing spline analysis of variance models with large samples. Journal of Computational and Graphical Statistics, 24, 715-732.

Helwig, N. E. and Ma, P. (2016). Smoothing spline ANOVA for super-large samples: Scalable computation via rounding parameters. Statistics and Its Interface, 9, 433-444.

##########   EXAMPLE 1   ##########

# define relatively smooth function
set.seed(773)
myfun <- function(x){ sin(2*pi*x) }
x <- runif(10^6)
y <- myfun(x) + rnorm(10^6)

# linear, cubic, different cubic, and periodic splines
linmod <- bigspline(x,y,type="lin")
linmod
cubmod <- bigspline(x,y)
cubmod
cub0mod <- bigspline(x,y,type="cub0")
cub0mod
permod <- bigspline(x,y,type="per")
permod


##########   EXAMPLE 2   ##########

# define more jagged function
set.seed(773)
myfun <- function(x){ 2*x + cos(4*pi*x) }
x <- runif(10^6)*4
y <- myfun(x) + rnorm(10^6)

# try different numbers of knots
r1mod <- bigspline(x,y,nknots=20)
crossprod( myfun(r1mod$xunique) - r1mod$fitted )/length(r1mod$fitted)
r2mod <- bigspline(x,y,nknots=30)
crossprod( myfun(r2mod$xunique) - r2mod$fitted )/length(r2mod$fitted)
r3mod <- bigspline(x,y,nknots=40)
crossprod( myfun(r3mod$xunique) - r3mod$fitted )/length(r3mod$fitted)


##########   EXAMPLE 3   ##########

# define more jagged function
set.seed(773)
myfun <- function(x){ 2*x + cos(4*pi*x) }
x <- runif(10^6)*4
y <- myfun(x) + rnorm(10^6)

# try different rounding parameters
r1mod <- bigspline(x,y,rparm=0.05)
crossprod( myfun(r1mod$xunique) - r1mod$fitted )/length(r1mod$fitted)
r2mod <- bigspline(x,y,rparm=0.02)
crossprod( myfun(r2mod$xunique) - r2mod$fitted )/length(r2mod$fitted)
r3mod <- bigspline(x,y,rparm=0.01)
crossprod( myfun(r3mod$xunique) - r3mod$fitted )/length(r3mod$fitted)