Cross-validation bandwidth selection in nonparametric regression models

Description

From a sample {(Y_i, t_i): i=1,...,n}, this routine computes, for each l_n considered, an optimal bandwidth for estimating m in the regression model

Y_i= m(t_i) + ε_i.

The regression function, m, is a smooth but unknown function, and the random errors, {ε_i}, are allowed to be time series. The optimal bandwidth is selected by means of the leave-(2l_n + 1)-out cross-validation procedure. Kernel smoothing is used.

Usage

1
2
np.cv(data = data, h.seq = NULL, num.h = 50, w = NULL, num.ln = 1, 
ln.0 = 0, step.ln = 2, estimator = "NW", kernel = "quadratic")

Arguments

data

data[, 1] contains the values of the response variable, Y;

data[, 2] contains the values of the explanatory variable, t.

h.seq

sequence of considered bandwidths in the CV function. If NULL (the default), num.h equidistant values between zero and a quarter of the range of t_i are considered.

num.h

number of values used to build the sequence of considered bandwidths. If h.seq is not NULL, num.h=length(h.seq). Otherwise, the default is 50.

w

support interval of the weigth function in the CV function. If NULL (the default), (q_{0.1}, q_{0.9}) is considered, where q_p denotes the quantile of order p of {t_i}.

num.ln

number of values for l_n: 2l_{n} + 1 observations around each point t_i are eliminated to estimate m(t_i) in the CV function. The default is 1.

ln.0

minimum value for l_n. The default is 0.

step.ln

distance between two consecutives values of l_n. The default is 2.

estimator

allows us the choice between “NW” (Nadaraya-Watson) or “LLP” (Local Linear Polynomial). The default is “NW”.

kernel

allows us the choice between “gaussian”, “quadratic” (Epanechnikov kernel), “triweight” or “uniform” kernel. The default is “quadratic”.

Details

A weight function (specifically, the indicator function 1_{[w[1] , w[2]]}) is introduced in the CV function to allow elimination (or at least significant reduction) of boundary effects from the estimate of m(t_i).

For more details, see Chu and Marron (1991).

Value

h.opt

dataframe containing, for each ln considered, the selected value for the bandwidth.

CV.opt

CV.opt[k] is the minimum value of the CV function when de k-th value of ln is considered.

CV

matrix containing the values of the CV function for each bandwidth and ln considered.

w

support interval of the weigth function in the CV function.

h.seq

sequence of considered bandwidths in the CV function.

Author(s)

German Aneiros Perez ganeiros@udc.es

Ana Lopez Cheda ana.lopez.cheda@udc.es

References

Chu, C-K and Marron, J.S. (1991) Comparison of two bandwidth selectors with dependent errors. The Annals of Statistics 19, 1906-1918.

See Also

Other related functions are: np.est, np.gcv, plrm.est, plrm.gcv and plrm.cv.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# EXAMPLE 1: REAL DATA
data <- matrix(10,120,2)
data(barnacles1)
barnacles1 <- as.matrix(barnacles1)
data[,1] <- barnacles1[,1]
data <- diff(data, 12)
data[,2] <- 1:nrow(data)

aux <- np.cv(data, ln.0=1,step.ln=1, num.ln=2)
aux$h.opt
plot.ts(aux$CV)

par(mfrow=c(2,1))
plot(aux$h.seq,aux$CV[,1], xlab="h", ylab="CV", type="l", main="ln=1")
plot(aux$h.seq,aux$CV[,2], xlab="h", ylab="CV", type="l", main="ln=2")



# EXAMPLE 2: SIMULATED DATA
## Example 2a: independent data

set.seed(1234)
# We generate the data
n <- 100
t <- ((1:n)-0.5)/n
m <- function(t) {0.25*t*(1-t)}
f <- m(t)

epsilon <- rnorm(n, 0, 0.01)
y <-  f + epsilon
data_ind <- matrix(c(y,t),nrow=100)

# We apply the function
a <-np.cv(data_ind)
a$CV.opt

CV <- a$CV
h <- a$h.seq
plot(h,CV,type="l")

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.