Description Usage Arguments Details Value Warnings Computational Details Skip Iteration Random Effects Note Author(s) References Examples

Given a real-valued response vector *\mathbf{y}=\{y_{i}\}_{n\times1}*, a Smoothing Spline Anova (SSA) has the form

*y_{i}= η(\mathbf{x}_{i}) + e_{i}*

where *y_{i}* is the *i*-th observation's respone, *\mathbf{x}_{i}=(x_{i1},…,x_{ip})* is the *i*-th observation's nonparametric predictor vector, *η* is an unknown smooth function relating the response and nonparametric predictors, and *e_{i}\sim\mathrm{N}(0,σ^{2})* is iid Gaussian error. Function can fit additive models, and also allows for 2-way and 3-way interactions between any number of predictors (see Details and Examples).

1 2 3 4 5 |

`formula` |
An object of class " |

`data` |
Optional data frame, list, or environment containing the variables in |

`type` |
List of smoothing spline types for predictors in |

`nknots` |
Two possible options: (a) scalar giving total number of random knots to sample, or (b) vector indexing which rows of |

`rparm` |
List of rounding parameters for each predictor. See Details. |

`lambdas` |
Vector of global smoothing parameters to try. Default |

`skip.iter` |
Logical indicating whether to skip the iterative smoothing parameter update. Using |

`se.fit` |
Logical indicating if the standard errors of the fitted values should be estimated. |

`rseed` |
Random seed for knot sampling. Input is ignored if |

`gcvopts` |
Control parameters for optimization. List with 3 elements: (a) |

`knotcheck` |
If |

`gammas` |
List of initial smoothing parameters for each predictor. See Details. |

`weights` |
Vector of positive weights for fitting (default is vector of ones). |

`random` |
Adds random effects to model (see Random Effects section). |

`remlalg` |
REML algorithm for estimating variance components (see Random Effects section). Input is ignored if |

`remliter` |
Maximum number of iterations for REML estimation of variance components. Input is ignored if |

`remltol` |
Convergence tolerance for REML estimation of variance components. Input is ignored if |

`remltau` |
Initial estimate of variance parameters for REML estimation of variance components. Input is ignored if |

The `formula`

syntax is similar to that used in `lm`

and many other R regression functions. Use `y~x`

to predict the response `y`

from the predictor `x`

. Use `y~x1+x2`

to fit an additive model of the predictors `x1`

and `x2`

, and use `y~x1*x2`

to fit an interaction model. The syntax `y~x1*x2`

includes the interaction and main effects, whereas the syntax `y~x1:x2`

is not supported. See Computational Details for specifics about how nonparametric effects are estimated.

See `bigspline`

for definitions of `type="cub"`

, `type="cub0"`

, and `type="per"`

splines, which can handle one-dimensional predictors. See Appendix of Helwig and Ma (2015) for information about `type="tps"`

and `type="nom"`

splines. Note that `type="tps"`

can handle one-, two-, or three-dimensional predictors. I recommend using `type="cub"`

if the predictor scores have no extreme outliers; when outliers are present, `type="tps"`

may produce a better result.

Using the rounding parameter input `rparm`

can greatly speed-up and stabilize the fitting for large samples. For typical cases, I recommend using `rparm=0.01`

for cubic and periodic splines, but smaller rounding parameters may be needed for particularly jagged functions. For thin-plate splines, the data are NOT transformed to the interval [0,1] before fitting, so the rounding parameter should be on the raw data scale. Also, for `type="tps"`

you can enter one rounding parameter for each predictor dimension. Use `rparm=1`

for ordinal and nominal splines.

`fitted.values` |
Vector of fitted values corresponding to the original data points in |

`se.fit` |
Vector of standard errors of |

`yvar` |
Response vector. |

`xvars` |
List of predictors. |

`type` |
Type of smoothing spline that was used for each predictor. |

`yunique` |
Mean of |

`xunique` |
Unique rows of |

`sigma` |
Estimated error standard deviation, i.e., |

`ndf` |
Data frame with two elements: |

`info` |
Model fit information: vector containing the GCV, multiple R-squared, AIC, and BIC of fit model (assuming Gaussian error). |

`modelspec` |
List containing specifics of fit model (needed for prediction). |

`converged` |
Convergence status: |

`tnames` |
Names of the terms in model. |

`random` |
Random effects formula (same as input). |

`tau` |
Variance parameters such that |

`blup` |
Best linear unbiased predictors (if |

`call` |
Called model in input |

Cubic and cubic periodic splines transform the predictor to the interval [0,1] before fitting.

When using rounding parameters, output `fitted.values`

corresponds to unique rounded predictor scores in output `xunique`

. Use `predict.bigssa`

function to get fitted values for full `yvar`

vector.

To estimate *η* I minimize the penalized least-squares functional

*\frac{1}{n}∑_{i=1}^{n}≤ft(y_{i} - η(\mathbf{x}_{i}) \right)^{2} + λ J(η)*

where *J(\cdot)* is a nonnegative penalty functional quantifying the roughness of *η* and *λ>0* is a smoothing parameter controlling the trade-off between fitting and smoothing the data. Note that for *p>1* nonparametric predictors, there are additional *θ_{k}* smoothing parameters embedded in *J*.

The penalized least squares functioncal can be rewritten as

* \|\mathbf{y} - \mathbf{K}\mathbf{d} - \mathbf{J}_{θ}\mathbf{c}\|^{2} + nλ\mathbf{c}'\mathbf{Q}_{θ}\mathbf{c} *

where *\mathbf{K}=\{φ(x_{i})\}_{n \times m}* is the null (parametric) space basis function matrix, *\mathbf{J}_{θ}=∑_{k=1}^{s}θ_{k}\mathbf{J}_{k}* with *\mathbf{J}_{k}=\{ρ_{k}(\mathbf{x}_{i},\mathbf{x}_{h}^{*})\}_{n \times q}* denoting the *k*-th contrast space basis funciton matrix, *\mathbf{Q}_{θ}=∑_{k=1}^{s}θ_{k}\mathbf{Q}_{k}* with *\mathbf{Q}_{k}=\{ρ_{k}(\mathbf{x}_{g}^{*},\mathbf{x}_{h}^{*})\}_{q \times q}* denoting the *k*-th penalty matrix, and *\mathbf{d}=(d_{0},…,d_{m})'* and *\mathbf{c}=(c_{1},…,c_{q})'* are the unknown basis function coefficients. The optimal smoothing parameters are chosen by minimizing the GCV score (see `bigspline`

).

Note that this function uses the efficient SSA reparameterization described in Helwig (2013) and Helwig and Ma (2015); using is parameterization, there is one unique smoothing parameter per predictor (*γ_{j}*), and these *γ_{j}* parameters determine the structure of the *θ_{k}* parameters in the tensor product space. To evaluate the GCV score, this function uses the improved (scalable) SSA algorithm discussed in Helwig (2013) and Helwig and Ma (2015).

For *p>1* predictors, initial values for the *γ_{j}* parameters (that determine the structure of the *θ_{k}* parameters) are estimated using the smart starting algorithm described in Helwig (2013) and Helwig and Ma (2015).

Default use of this function (`skip.iter=TRUE`

) fixes the *γ_{j}* parameters afer the smart start, and then finds the global smoothing parameter *λ* (among the input `lambdas`

) that minimizes the GCV score. This approach typically produces a solution very similar to the more optimal solution using `skip.iter=FALSE`

.

Setting `skip.iter=FALSE`

uses the same smart starting algorithm as setting `skip.iter=TRUE`

. However, instead of fixing the *γ_{j}* parameters afer the smart start, using `skip.iter=FALSE`

iterates between estimating the optimal *λ* and the optimal *γ_{j}* parameters. The R function `nlm`

is used to minimize the GCV score with respect to the *γ_{j}* parameters, which can be time consuming for models with many predictors and/or a large number of knots.

The input `random`

adds random effects to the model assuming a variance components structure. Both nested and crossed random effects are supported. In all cases, the random effects are assumed to be indepedent zero-mean Gaussian variables with the variance depending on group membership.

Random effects are distinguished by vertical bars ("|"), which separate expressions for design matrices (left) from group factors (right). For example, the syntax `~1|group`

includes a random intercept for each level of `group`

, whereas the syntax `~1+x|group`

includes both a random intercept and a random slope for each level of `group`

. For crossed random effects, parentheses are needed to distinguish different terms, e.g., `~(1|group1)+(1|group2)`

includes a random intercept for each level of `group1`

and a random intercept for each level of `group2`

, where both `group1`

and `group2`

are factors. For nested random effects, the syntax `~group|subject`

can be used, where both `group`

and `subject`

are factors such that the levels of `subject`

are nested within those of `group`

.

The input `remlalg`

determines the REML algorithm used to estimate the variance components. Setting `remlalg="FS"`

uses a Fisher Scoring algorithm (default). Setting `remlalg="NR"`

uses a Newton-Raphson algorithm. Setting `remlalg="EM"`

uses an Expectation Maximization algorithm. Use `remlalg="none"`

to fit a model with known variance components (entered through `remltau`

).

The input `remliter`

sets the maximum number of iterations for the REML estimation. The input `remltol`

sets the convergence tolerance for the REML estimation, which is determined via relative change in the REML log-likelihood. The input `remltau`

sets the initial estimates of variance parameters; default is `remltau = rep(1,ntau)`

where `ntau`

is the number of variance components.

The spline is estimated using penalized least-squares, which does not require the Gaussian error assumption. However, the spline inference information (e.g., standard errors and fit information) requires the Gaussian error assumption.

Nathaniel E. Helwig <[email protected]>

Gu, C. (2013). *Smoothing spline ANOVA models, 2nd edition*. New York: Springer.

Helwig, N. E. (2013). *Fast and stable smoothing spline analysis of variance models for large samples with applications to electroencephalography data analysis*. Unpublished doctoral dissertation. University of Illinois at Urbana-Champaign.

Helwig, N. E. (2016). Efficient estimation of variance components in nonparametric mixed-effects models with large samples. *Statistics and Computing, 26*, 1319-1336.

Helwig, N. E. and Ma, P. (2015). Fast and stable multiple smoothing parameter selection in smoothing spline analysis of variance models with large samples. *Journal of Computational and Graphical Statistics, 24*, 715-732.

Helwig, N. E. and Ma, P. (2016). Smoothing spline ANOVA for super-large samples: Scalable computation via rounding parameters. *Statistics and Its Interface, 9*, 433-444.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 | ```
########## EXAMPLE 1 ##########
# define univariate function and data
set.seed(773)
myfun <- function(x){ sin(2*pi*x) }
x <- runif(500)
y <- myfun(x) + rnorm(500)
# cubic, periodic, and thin-plate spline models with 20 knots
cubmod <- bigssa(y~x,type="cub",nknots=20,se.fit=TRUE)
cubmod
permod <- bigssa(y~x,type="per",nknots=20,se.fit=TRUE)
permod
tpsmod <- bigssa(y~x,type="tps",nknots=20,se.fit=TRUE)
tpsmod
########## EXAMPLE 2 ##########
# function with two continuous predictors
set.seed(773)
myfun <- function(x1v,x2v){sin(2*pi*x1v)+log(x2v+.1)+cos(pi*(x1v-x2v))}
x1v <- runif(500)
x2v <- runif(500)
y <- myfun(x1v,x2v) + rnorm(500)
# cubic splines with 50 randomly selected knots
intmod <- bigssa(y~x1v*x2v,type=list(x1v="cub",x2v="cub"),nknots=50)
intmod
crossprod( myfun(x1v,x2v) - intmod$fitted.values )/500
# fit additive model (with same knots)
addmod <- bigssa(y~x1v+x2v,type=list(x1v="cub",x2v="cub"),nknots=50)
addmod
crossprod( myfun(x1v,x2v) - addmod$fitted.values )/500
########## EXAMPLE 3 ##########
# function with two continuous and one nominal predictor (3 levels)
set.seed(773)
myfun <- function(x1v,x2v,x3v){
fval <- rep(0,length(x1v))
xmeans <- c(-1,0,1)
for(j in 1:3){
idx <- which(x3v==letters[j])
fval[idx] <- xmeans[j]
}
fval[idx] <- fval[idx] + cos(4*pi*(x1v[idx]))
fval <- (fval + sin(3*pi*x1v*x2v+pi)) / sqrt(2)
}
x1v <- runif(500)
x2v <- runif(500)
x3v <- sample(letters[1:3],500,replace=TRUE)
y <- myfun(x1v,x2v,x3v) + rnorm(500)
# 3-way interaction with 50 knots
cuimod <- bigssa(y~x1v*x2v*x3v,type=list(x1v="cub",x2v="cub",x3v="nom"),nknots=50)
crossprod( myfun(x1v,x2v,x3v) - cuimod$fitted.values )/500
# fit correct interaction model with 50 knots
cubmod <- bigssa(y~x1v*x2v+x1v*x3v,type=list(x1v="cub",x2v="cub",x3v="nom"),nknots=50)
crossprod( myfun(x1v,x2v,x3v) - cubmod$fitted.values )/500
# fit model using 2-dimensional thin-plate and nominal
x1new <- cbind(x1v,x2v)
x2new <- x3v
tpsmod <- bigssa(y~x1new*x2new,type=list(x1new="tps",x2new="nom"),nknots=50)
crossprod( myfun(x1v,x2v,x3v) - tpsmod$fitted.values )/500
########## EXAMPLE 4 ##########
# function with four continuous predictors
set.seed(773)
myfun <- function(x1v,x2v,x3v,x4v){
sin(2*pi*x1v) + log(x2v+.1) + x3v*cos(pi*(x4v))
}
x1v <- runif(500)
x2v <- runif(500)
x3v <- runif(500)
x4v <- runif(500)
y <- myfun(x1v,x2v,x3v,x4v) + rnorm(500)
# fit cubic spline model with x3v*x4v interaction
cubmod <- bigssa(y~x1v+x2v+x3v*x4v,type=list(x1v="cub",x2v="cub",x3v="cub",x4v="cub"),nknots=50)
crossprod( myfun(x1v,x2v,x3v,x4v) - cubmod$fitted.values )/500
``` |

taylerablake/thin-plate-splines documentation built on Sept. 19, 2017, 9:45 a.m.

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.