# krscvNOMAD: Categorical Kernel Regression Spline Cross-Validation In crs: Categorical Regression Splines

## Description

`krscvNOMAD` computes NOMAD-based (Nonsmooth Optimization by Mesh Adaptive Direct Search, Abramson, Audet, Couture and Le Digabel (2011)) cross-validation directed search for a regression spline estimate of a one (1) dimensional dependent variable on an `r`-dimensional vector of continuous and nominal/ordinal (`factor`/`ordered`) predictors.

## Usage

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29``` ```krscvNOMAD(xz, y, degree.max = 10, segments.max = 10, degree.min = 0, segments.min = 1, cv.df.min = 1, complexity = c("degree-knots","degree","knots"), knots = c("quantiles","uniform","auto"), basis = c("additive","tensor","glp","auto"), cv.func = c("cv.ls","cv.gcv","cv.aic"), degree = degree, segments = segments, lambda = lambda, lambda.discrete = FALSE, lambda.discrete.num = 100, random.seed = 42, max.bb.eval = 10000, initial.mesh.size.real = "r0.1", initial.mesh.size.integer = "1", min.mesh.size.real = paste("r",sqrt(.Machine\$double.eps),sep=""), min.mesh.size.integer = "1", min.poll.size.real = "1", min.poll.size.integer = "1", opts=list(), nmulti = 0, tau = NULL, weights = NULL, singular.ok = FALSE) ```

## Arguments

 `y` continuous univariate vector `xz` continuous and/or nominal/ordinal (`factor`/`ordered`) predictors `degree.max` the maximum degree of the B-spline basis for each of the continuous predictors (default `degree.max=10`) `segments.max` the maximum segments of the B-spline basis for each of the continuous predictors (default `segments.max=10`) `degree.min` the minimum degree of the B-spline basis for each of the continuous predictors (default `degree.min=0`) `segments.min` the minimum segments of the B-spline basis for each of the continuous predictors (default `segments.min=1`) `cv.df.min` the minimum degrees of freedom to allow when conducting cross-validation (default `cv.df.min=1`) `complexity` a character string (default `complexity="degree-knots"`) indicating whether model ‘complexity’ is determined by the degree of the spline or by the number of segments (‘knots’). This option allows the user to use cross-validation to select either the spline degree (number of knots held fixed) or the number of knots (spline degree held fixed) or both the spline degree and number of knots `knots` a character string (default `knots="quantiles"`) specifying where knots are to be placed. ‘quantiles’ specifies knots placed at equally spaced quantiles (equal number of observations lie in each segment) and ‘uniform’ specifies knots placed at equally spaced intervals. If `knots="auto"`, the knot type will be automatically determined by cross-validation `basis` a character string (default `basis="additive"`) indicating whether the additive or tensor product B-spline basis matrix for a multivariate polynomial spline or generalized B-spline polynomial basis should be used. Note this can be automatically determined by cross-validation if `cv=TRUE` and `basis="auto"`, and is an ‘all or none’ proposition (i.e. interaction terms for all predictors or for no predictors given the nature of ‘tensor products’). Note also that if there is only one predictor this defaults to `basis="additive"` to avoid unnecessary computation as the spline bases are equivalent in this case `cv.func` a character string (default `cv.func="cv.ls"`) indicating which method to use to select smoothing parameters. `cv.gcv` specifies generalized cross-validation (Craven and Wahba (1979)), `cv.aic` specifies expected Kullback-Leibler cross-validation (Hurvich, Simonoff, and Tsai (1998)), and `cv.ls` specifies least-squares cross-validation `degree` integer/vector specifying the degree of the B-spline basis for each dimension of the continuous `x` `segments` integer/vector specifying the number of segments of the B-spline basis for each dimension of the continuous `x` (i.e. number of knots minus one) `lambda` real/vector for the categorical predictors. If it is not NULL, it will be the starting value(s) for lambda `lambda.discrete` if `lambda.discrete=TRUE`, the bandwidth will be discretized into `lambda.discrete.num+1` points and `lambda` will be chosen from these points `lambda.discrete.num` a positive integer indicating the number of discrete values that lambda can assume - this parameter will only be used when `lambda.discrete=TRUE` `random.seed` when it is not missing and not equal to 0, the initial points will be generated using this seed when `nmulti > 0` `max.bb.eval` argument passed to the NOMAD solver (see `snomadr` for further details) `initial.mesh.size.real` argument passed to the NOMAD solver (see `snomadr` for further details) `initial.mesh.size.integer` argument passed to the NOMAD solver (see `snomadr` for further details) `min.mesh.size.real` argument passed to the NOMAD solver (see `snomadr` for further details) `min.mesh.size.integer` arguments passed to the NOMAD solver (see `snomadr` for further details) `min.poll.size.real` arguments passed to the NOMAD solver (see `snomadr` for further details) `min.poll.size.integer` arguments passed to the NOMAD solver (see `snomadr` for further details) `opts` list of optional arguments to be passed to `snomadr` `nmulti` integer number of times to restart the process of finding extrema of the cross-validation function from different (random) initial points (default `nmulti=0`) `tau` if non-null a number in (0,1) denoting the quantile for which a quantile regression spline is to be estimated rather than estimating the conditional mean (default `tau=NULL`) `weights` an optional vector of weights to be used in the fitting process. Should be ‘NULL’ or a numeric vector. If non-NULL, weighted least squares is used with weights ‘weights’ (that is, minimizing ‘sum(w*e^2)’); otherwise ordinary least squares is used. `singular.ok` a logical value (default `singular.ok=FALSE`) that, when `FALSE`, discards singular bases during cross-validation (a check for ill-conditioned bases is performed).

## Details

`krscvNOMAD` computes NOMAD-based cross-validation for a regression spline estimate of a one (1) dimensional dependent variable on an `r`-dimensional vector of continuous and nominal/ordinal (`factor`/`ordered`) predictors. Numerical search for the optimal `degree`/`segments`/`lambda` is undertaken using `snomadr`.

The optimal `K`/`lambda` combination is returned along with other results (see below for return values). The method uses kernel functions appropriate for categorical (ordinal/nominal) predictors which avoids the loss in efficiency associated with sample-splitting procedures that are typically used when faced with a mix of continuous and nominal/ordinal (`factor`/`ordered`) predictors.

For the continuous predictors the regression spline model employs either the additive or tensor product B-spline basis matrix for a multivariate polynomial spline via the B-spline routines in the GNU Scientific Library (https://www.gnu.org/software/gsl/) and the `tensor.prod.model.matrix` function.

For the discrete predictors the product kernel function is of the ‘Li-Racine’ type (see Li and Racine (2007) for details).

## Value

`krscvNOMAD` returns a `crscv` object. Furthermore, the function `summary` supports objects of this type. The returned objects have the following components:

 `K` scalar/vector containing optimal degree(s) of spline or number of segments `K.mat` vector/matrix of values of `K` evaluated during search `degree.max` the maximum degree of the B-spline basis for each of the continuous predictors (default `degree.max=10`) `segments.max` the maximum segments of the B-spline basis for each of the continuous predictors (default `segments.max=10`) `degree.min` the minimum degree of the B-spline basis for each of the continuous predictors (default `degree.min=0`) `segments.min` the minimum segments of the B-spline basis for each of the continuous predictors (default `segments.min=1`) `restarts` number of restarts during search, if any `lambda` optimal bandwidths for categorical predictors `lambda.mat` vector/matrix of optimal bandwidths for each degree of spline `cv.func` objective function value at optimum `cv.func.vec` vector of objective function values at each degree of spline or number of segments in `K.mat`

## Author(s)

Jeffrey S. Racine racinej@mcmaster.ca and Zhenghua Nie niez@mcmaster.ca

## References

Abramson, M.A. and C. Audet and G. Couture and J.E. Dennis Jr. and S. Le Digabel (2011), “The NOMAD project”. Software available at https://www.gerad.ca/nomad.

Craven, P. and G. Wahba (1979), “Smoothing Noisy Data With Spline Functions,” Numerische Mathematik, 13, 377-403.

Hurvich, C.M. and J.S. Simonoff and C.L. Tsai (1998), “Smoothing Parameter Selection in Nonparametric Regression Using an Improved Akaike Information Criterion,” Journal of the Royal Statistical Society B, 60, 271-293.

Le Digabel, S. (2011), “Algorithm 909: NOMAD: Nonlinear Optimization With The MADS Algorithm”. ACM Transactions on Mathematical Software, 37(4):44:1-44:15.

Li, Q. and J.S. Racine (2007), Nonparametric Econometrics: Theory and Practice, Princeton University Press.

Ma, S. and J.S. Racine and L. Yang (2015), “Spline Regression in the Presence of Categorical Predictors,” Journal of Applied Econometrics, Volume 30, 705-717.

Ma, S. and J.S. Racine (2013), “Additive Regression Splines with Irrelevant Categorical and Continuous Regressors,” Statistica Sinica, Volume 23, 515-541.

`loess`, `npregbw`,
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23``` ```set.seed(42) ## Simulated data n <- 1000 x <- runif(n) z <- round(runif(n,min=-0.5,max=1.5)) z.unique <- uniquecombs(as.matrix(z)) ind <- attr(z.unique,"index") ind.vals <- sort(unique(ind)) dgp <- numeric(length=n) for(i in 1:nrow(z.unique)) { zz <- ind == ind.vals[i] dgp[zz] <- z[zz]+cos(2*pi*x[zz]) } y <- dgp + rnorm(n,sd=.1) xdata <- data.frame(x,z=factor(z)) ## Compute the optimal K and lambda, determine optimal number of knots, set ## spline degree for x to 3 cv <- krscvNOMAD(x=xdata,y=y,complexity="knots",degree=c(3),segments=c(5)) summary(cv) ```