Description Usage Arguments Details Value Warnings Computational Details Skip Iteration Random Effects Note Author(s) References Examples

Given a real-valued response vector *\mathbf{y}=\{y_{i}\}_{n\times1}*, a semiparametric regression model has the form

*y_{i}= η(\mathbf{x}_{i}) + ∑_{j=1}^{t}b_{j}z_{ij} + e_{i}*

where *y_{i}* is the *i*-th observation's respone, *\mathbf{x}_{i}=(x_{i1},…,x_{ip})* is the *i*-th observation's nonparametric predictor vector, *η* is an unknown smooth function relating the response and nonparametric predictors, *\mathbf{z}_{i}=(z_{i1},…,z_{it})* is the *i*-th observation's parametric predictor vector, and *e_{i}\sim\mathrm{N}(0,σ^{2})* is iid Gaussian error. Function can fit both additive and interactive non/parametric effects, and allows for 2-way and 3-way interactions between nonparametric and parametric effects (see Details and Examples).

1 2 3 4 5 |

`formula` |
An object of class " |

`data` |
Optional data frame, list, or environment containing the variables in |

`type` |
List of smoothing spline types for predictors in |

`nknots` |
Two possible options: (a) scalar giving total number of random knots to sample, or (b) vector indexing which rows of |

`rparm` |
List of rounding parameters for each predictor. See Details. |

`lambdas` |
Vector of global smoothing parameters to try. Default |

`skip.iter` |
Logical indicating whether to skip the iterative smoothing parameter update. Using |

`se.fit` |
Logical indicating if the standard errors of the fitted values should be estimated. |

`rseed` |
Random seed for knot sampling. Input is ignored if |

`gcvopts` |
Control parameters for optimization. List with 3 elements: (a) |

`knotcheck` |
If |

`thetas` |
List of initial smoothing parameters for each predictor subspace. See Details. |

`weights` |
Vector of positive weights for fitting (default is vector of ones). |

`random` |
Adds random effects to model (see Random Effects section). |

`remlalg` |
REML algorithm for estimating variance components (see Random Effects section). Input is ignored if |

`remliter` |
Maximum number of iterations for REML estimation of variance components. Input is ignored if |

`remltol` |
Convergence tolerance for REML estimation of variance components. Input is ignored if |

`remltau` |
Initial estimate of variance parameters for REML estimation of variance components. Input is ignored if |

The `formula`

syntax is similar to that used in `lm`

and many other R regression functions. Use `y~x`

to predict the response `y`

from the predictor `x`

. Use `y~x1+x2`

to fit an additive model of the predictors `x1`

and `x2`

, and use `y~x1*x2`

to fit an interaction model. The syntax `y~x1*x2`

includes the interaction and main effects, whereas the syntax `y~x1:x2`

only includes the interaction. See Computational Details for specifics about how non/parametric effects are estimated.

See `bigspline`

for definitions of `type="cub"`

, `type="cub0"`

, and `type="per"`

splines, which can handle one-dimensional predictors. See Appendix of Helwig and Ma (2015) for information about `type="tps"`

and `type="nom"`

splines. Note that `type="tps"`

can handle one-, two-, or three-dimensional predictors. I recommend using `type="cub"`

if the predictor scores have no extreme outliers; when outliers are present, `type="tps"`

may produce a better result.

Using the rounding parameter input `rparm`

can greatly speed-up and stabilize the fitting for large samples. For typical cases, I recommend using `rparm=0.01`

for cubic and periodic splines, but smaller rounding parameters may be needed for particularly jagged functions. For thin-plate splines, the data are NOT transformed to the interval [0,1] before fitting, so the rounding parameter should be on the raw data scale. Also, for `type="tps"`

you can enter one rounding parameter for each predictor dimension. Use `rparm=1`

for ordinal and nominal splines.

`fitted.values` |
Vector of fitted values corresponding to the original data points in |

`se.fit` |
Vector of standard errors of |

`yvar` |
Response vector. |

`xvars` |
List of predictors. |

`type` |
Type of smoothing spline that was used for each predictor. |

`yunique` |
Mean of |

`xunique` |
Unique rows of |

`sigma` |
Estimated error standard deviation, i.e., |

`ndf` |
Data frame with two elements: |

`info` |
Model fit information: vector containing the GCV, multiple R-squared, AIC, and BIC of fit model (assuming Gaussian error). |

`modelspec` |
List containing specifics of fit model (needed for prediction). |

`converged` |
Convergence status: |

`tnames` |
Names of the terms in model. |

`random` |
Random effects formula (same as input). |

`tau` |
Variance parameters such that |

`blup` |
Best linear unbiased predictors (if |

`call` |
Called model in input |

Cubic and cubic periodic splines transform the predictor to the interval [0,1] before fitting.

When using rounding parameters, output `fitted.values`

corresponds to unique rounded predictor scores in output `xunique`

. Use `predict.bigssp`

function to get fitted values for full `yvar`

vector.

To estimate *η* I minimize the penalized least-squares functional

*\frac{1}{n}∑_{i=1}^{n}≤ft(y_{i} - η(\mathbf{x}_{i}) - \textstyle∑_{j=1}^{t}b_{j}z_{ij} \right)^{2} + λ J(η)*

where *J(\cdot)* is a nonnegative penalty functional quantifying the roughness of *η* and *λ>0* is a smoothing parameter controlling the trade-off between fitting and smoothing the data. Note that for *p>1* nonparametric predictors, there are additional *θ_{k}* smoothing parameters embedded in *J*.

The penalized least squares functioncal can be rewritten as

* \|\mathbf{y} - \mathbf{K}\mathbf{d} - \mathbf{J}_{θ}\mathbf{c}\|^{2} + nλ\mathbf{c}'\mathbf{Q}_{θ}\mathbf{c} *

where *\mathbf{K}=\{φ(x_{i}),\mathbf{z}_{i}\}_{n \times m}* is the parametric space basis function matrix, *\mathbf{J}_{θ}=∑_{k=1}^{s}θ_{k}\mathbf{J}_{k}* with *\mathbf{J}_{k}=\{ρ_{k}(\mathbf{x}_{i},\mathbf{x}_{h}^{*})\}_{n \times q}* denoting the *k*-th contrast space basis funciton matrix, *\mathbf{Q}_{θ}=∑_{k=1}^{s}θ_{k}\mathbf{Q}_{k}* with *\mathbf{Q}_{k}=\{ρ_{k}(\mathbf{x}_{g}^{*},\mathbf{x}_{h}^{*})\}_{q \times q}* denoting the *k*-th penalty matrix, and *\mathbf{d}=(d_{0},…,d_{m})'* and *\mathbf{c}=(c_{1},…,c_{q})'* are the unknown basis function coefficients. The optimal smoothing parameters are chosen by minimizing the GCV score (see `bigspline`

).

Note that this function uses the classic smoothing spline parameterization (see Gu, 2013), so there is more than one smoothing parameter per predictor (if interactions are included in the model). To evaluate the GCV score, this function uses the improved (scalable) SSA algorithm discussed in Helwig (2013) and Helwig and Ma (2015).

For *p>1* predictors, initial values for the *θ_{k}* parameters are estimated using Algorithm 3.2 described in Gu and Wahba (1991).

Default use of this function (`skip.iter=TRUE`

) fixes the *θ_{k}* parameters afer the smart start, and then finds the global smoothing parameter *λ* (among the input `lambdas`

) that minimizes the GCV score. This approach typically produces a solution very similar to the more optimal solution using `skip.iter=FALSE`

.

Setting `skip.iter=FALSE`

uses the same smart starting algorithm as setting `skip.iter=TRUE`

. However, instead of fixing the *θ_{k}* parameters afer the smart start, using `skip.iter=FALSE`

iterates between estimating the optimal *λ* and the optimal *θ_{k}* parameters. The R function `nlm`

is used to minimize the GCV score with respect to the *θ_{k}* parameters, which can be time consuming for models with many predictors.

The input `random`

adds random effects to the model assuming a variance components structure. Both nested and crossed random effects are supported. In all cases, the random effects are assumed to be indepedent zero-mean Gaussian variables with the variance depending on group membership.

Random effects are distinguished by vertical bars ("|"), which separate expressions for design matrices (left) from group factors (right). For example, the syntax `~1|group`

includes a random intercept for each level of `group`

, whereas the syntax `~1+x|group`

includes both a random intercept and a random slope for each level of `group`

. For crossed random effects, parentheses are needed to distinguish different terms, e.g., `~(1|group1)+(1|group2)`

includes a random intercept for each level of `group1`

and a random intercept for each level of `group2`

, where both `group1`

and `group2`

are factors. For nested random effects, the syntax `~group|subject`

can be used, where both `group`

and `subject`

are factors such that the levels of `subject`

are nested within those of `group`

.

The input `remlalg`

determines the REML algorithm used to estimate the variance components. Setting `remlalg="FS"`

uses a Fisher Scoring algorithm (default). Setting `remlalg="NR"`

uses a Newton-Raphson algorithm. Setting `remlalg="EM"`

uses an Expectation Maximization algorithm. Use `remlalg="none"`

to fit a model with known variance components (entered through `remltau`

).

The input `remliter`

sets the maximum number of iterations for the REML estimation. The input `remltol`

sets the convergence tolerance for the REML estimation, which is determined via relative change in the REML log-likelihood. The input `remltau`

sets the initial estimates of variance parameters; default is `remltau = rep(1,ntau)`

where `ntau`

is the number of variance components.

The spline is estimated using penalized least-squares, which does not require the Gaussian error assumption. However, the spline inference information (e.g., standard errors and fit information) requires the Gaussian error assumption.

Nathaniel E. Helwig <[email protected]>

Gu, C. (2013). *Smoothing spline ANOVA models, 2nd edition*. New York: Springer.

Gu, C. and Wahba, G. (1991). Minimizing GCV/GML scores with multiple smoothing parameters via the Newton method. *SIAM Journal on Scientific and Statistical Computing, 12*, 383-398.

Helwig, N. E. (2013). *Fast and stable smoothing spline analysis of variance models for large samples with applications to electroencephalography data analysis*. Unpublished doctoral dissertation. University of Illinois at Urbana-Champaign.

Helwig, N. E. (2016). Efficient estimation of variance components in nonparametric mixed-effects models with large samples. *Statistics and Computing, 26*, 1319-1336.

Helwig, N. E. and Ma, P. (2015). Fast and stable multiple smoothing parameter selection in smoothing spline analysis of variance models with large samples. *Journal of Computational and Graphical Statistics, 24*, 715-732.

Helwig, N. E. and Ma, P. (2016). Smoothing spline ANOVA for super-large samples: Scalable computation via rounding parameters. *Statistics and Its Interface, 9*, 433-444.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | ```
########## EXAMPLE ##########
# function with four continuous predictors
set.seed(773)
myfun <- function(x1v,x2v,x3v,x4v){
sin(2*pi*x1v) + log(x2v+.1) + x3v*cos(pi*(x4v))
}
x1v <- runif(500)
x2v <- runif(500)
x3v <- runif(500)
x4v <- runif(500)
y <- myfun(x1v,x2v,x3v,x4v) + rnorm(500)
# fit cubic spline model with x3v*x4v interaction and x3v as "cub"
# (includes x3v and x4v main effects)
cubmod <- bigssp(y~x1v+x2v+x3v*x4v,type=list(x1v="cub",x2v="cub",x3v="cub",x4v="cub"),nknots=50)
crossprod( myfun(x1v,x2v,x3v,x4v) - cubmod$fitted.values )/500
# fit cubic spline model with x3v*x4v interaction and x3v as "cub0"
# (includes x3v and x4v main effects)
cubmod <- bigssp(y~x1v+x2v+x3v*x4v,type=list(x1v="cub",x2v="cub",x3v="cub0",x4v="cub"),nknots=50)
crossprod( myfun(x1v,x2v,x3v,x4v) - cubmod$fitted.values )/500
# fit model with x3v*x4v interaction treating x3v as parametric effect
# (includes x3v and x4v main effects)
cubmod <- bigssp(y~x1v+x2v+x3v*x4v,type=list(x1v="cub",x2v="cub",x3v="prm",x4v="cub"),nknots=50)
crossprod( myfun(x1v,x2v,x3v,x4v) - cubmod$fitted.values )/500
# fit cubic spline model with x3v:x4v interaction and x3v as "cub"
# (excludes x3v and x4v main effects)
cubmod <- bigssp(y~x1v+x2v+x3v:x4v,type=list(x1v="cub",x2v="cub",x3v="cub",x4v="cub"),nknots=50)
crossprod( myfun(x1v,x2v,x3v,x4v) - cubmod$fitted.values )/500
# fit cubic spline model with x3v:x4v interaction and x3v as "cub0"
# (excludes x3v and x4v main effects)
cubmod <- bigssp(y~x1v+x2v+x3v:x4v,type=list(x1v="cub",x2v="cub",x3v="cub0",x4v="cub"),nknots=50)
crossprod( myfun(x1v,x2v,x3v,x4v) - cubmod$fitted.values )/500
# fit model with x3v:x4v interaction treating x3v as parametric effect
# (excludes x3v and x4v main effects)
cubmod <- bigssp(y~x1v+x2v+x3v:x4v,type=list(x1v="cub",x2v="cub",x3v="prm",x4v="cub"),nknots=50)
crossprod( myfun(x1v,x2v,x3v,x4v) - cubmod$fitted.values )/500
``` |

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.