# np.cv: Cross-validation bandwidth selection in nonparametric...

## Description

From a sample {(Y_i, t_i): i=1,...,n}, this routine computes, for each l_n considered, an optimal bandwidth for estimating m in the regression model

Y_i= m(t_i) + ε_i.

The regression function, m, is a smooth but unknown function, and the random errors, {ε_i}, are allowed to be time series. The optimal bandwidth is selected by means of the leave-(2l_n + 1)-out cross-validation procedure. Kernel smoothing is used.

## Usage

 ```1 2``` ```np.cv(data = data, h.seq = NULL, num.h = 50, w = NULL, num.ln = 1, ln.0 = 0, step.ln = 2, estimator = "NW", kernel = "quadratic") ```

## Arguments

 `data` `data[, 1]` contains the values of the response variable, Y; `data[, 2]` contains the values of the explanatory variable, t. `h.seq` sequence of considered bandwidths in the CV function. If `NULL` (the default), `num.h` equidistant values between zero and a quarter of the range of t_i are considered. `num.h` number of values used to build the sequence of considered bandwidths. If `h.seq` is not `NULL`, `num.h=length(h.seq)`. Otherwise, the default is 50. `w` support interval of the weigth function in the CV function. If `NULL` (the default), (q_{0.1}, q_{0.9}) is considered, where q_p denotes the quantile of order p of {t_i}. `num.ln` number of values for l_n: 2l_{n} + 1 observations around each point t_i are eliminated to estimate m(t_i) in the CV function. The default is 1. `ln.0` minimum value for l_n. The default is 0. `step.ln` distance between two consecutives values of l_n. The default is 2. `estimator` allows us the choice between “NW” (Nadaraya-Watson) or “LLP” (Local Linear Polynomial). The default is “NW”. `kernel` allows us the choice between “gaussian”, “quadratic” (Epanechnikov kernel), “triweight” or “uniform” kernel. The default is “quadratic”.

## Details

A weight function (specifically, the indicator function 1_{[w[1] , w[2]]}) is introduced in the CV function to allow elimination (or at least significant reduction) of boundary effects from the estimate of m(t_i).

For more details, see Chu and Marron (1991).

## Value

 `h.opt` dataframe containing, for each `ln` considered, the selected value for the bandwidth. `CV.opt` `CV.opt[k]` is the minimum value of the CV function when de k-th value of `ln` is considered. `CV` matrix containing the values of the CV function for each bandwidth and `ln` considered. `w` support interval of the weigth function in the CV function. `h.seq` sequence of considered bandwidths in the CV function.

## Author(s)

German Aneiros Perez [email protected]

Ana Lopez Cheda [email protected]

## References

Chu, C-K and Marron, J.S. (1991) Comparison of two bandwidth selectors with dependent errors. The Annals of Statistics 19, 1906-1918.

Other related functions are: `np.est`, `np.gcv`, `plrm.est`, `plrm.gcv` and `plrm.cv`.
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39``` ```# EXAMPLE 1: REAL DATA data <- matrix(10,120,2) data(barnacles1) barnacles1 <- as.matrix(barnacles1) data[,1] <- barnacles1[,1] data <- diff(data, 12) data[,2] <- 1:nrow(data) aux <- np.cv(data, ln.0=1,step.ln=1, num.ln=2) aux\$h.opt plot.ts(aux\$CV) par(mfrow=c(2,1)) plot(aux\$h.seq,aux\$CV[,1], xlab="h", ylab="CV", type="l", main="ln=1") plot(aux\$h.seq,aux\$CV[,2], xlab="h", ylab="CV", type="l", main="ln=2") # EXAMPLE 2: SIMULATED DATA ## Example 2a: independent data set.seed(1234) # We generate the data n <- 100 t <- ((1:n)-0.5)/n m <- function(t) {0.25*t*(1-t)} f <- m(t) epsilon <- rnorm(n, 0, 0.01) y <- f + epsilon data_ind <- matrix(c(y,t),nrow=100) # We apply the function a <-np.cv(data_ind) a\$CV.opt CV <- a\$CV h <- a\$h.seq plot(h,CV,type="l") ```