LSCV | R Documentation |
Least-squares cross-validation function for the Nadaraya-Watson estimator
LSCV(
x,
y,
bw,
weights = NULL,
same = FALSE,
degree = 0,
kernel = "gaussian",
order = 2,
PIT = FALSE,
chunks = 0,
robust.iterations = 0,
cores = 1
)
x |
A numeric vector, matrix, or data frame containing observations. For density, the points used to compute the density. For kernel regression, the points corresponding to explanatory variables. |
y |
A numeric vector of dependent variable values. |
bw |
Candidate bandwidth values: scalar, vector, or a matrix (with columns corresponding to columns of |
weights |
A numeric vector of observation weights (typically counts) to
perform weighted operations. If null, |
same |
Logical: use the same bandwidth for all columns of |
degree |
Integer: 0 for locally constant estimator (Nadaraya–Watson), 1 for locally linear (Cleveland's LOESS), 2 for locally quadratic (use with care, less stable, requires larger bandwidths) |
kernel |
Character describing the desired kernel type. NB: due to limited machine precision, even Gaussian has finite support. |
order |
An integer: 2, 4, or 6. Order-2 kernels are the standard kernels that are positive everywhere. Orders 4 and 6 produce some negative values, which reduces bias but may hamper density estimation. |
PIT |
If TRUE, the Probability Integral Transform (PIT) is applied to all columns
of |
chunks |
Integer: the number of chunks to split the task into (limits
RAM usage but increases overhead). |
robust.iterations |
The number of robustifying iterations (due to Cleveland, 1979). If greater than 0, |
cores |
Integer: the number of CPU cores to use. High core count = high RAM usage. Note: since LSCV requires zeroing out the diagonals of the weight matrix, repeated observations are combined first; the de-duplication is therefore forced in cross-validation. The only situation where de-duplication can be skipped is passing de-duplicated data sets from outside (e.g. inside optimisers). |
A numeric vector of the same length as bw
or nrow(bw)
.
set.seed(1) # Creating a data set with many duplicates
n.uniq <- 1000
n <- 4000
inds <- sort(ceiling(runif(n, 0, n.uniq)))
x.uniq <- sort(rnorm(n.uniq))
y.uniq <- 1 + 0.2*x.uniq + 0.3*sin(x.uniq) + rnorm(n.uniq)
x <- x.uniq[inds]
y <- y.uniq[inds]
w <- 1 + runif(n, 0, 2) # Relative importance
data.table::setDTthreads(1) # For measuring pure gains and overhead
RcppParallel::setThreadOptions(numThreads = 1)
bw.grid <- seq(0.1, 1.2, 0.05)
ncores <- if (.Platform$OS.type == "windows") 1 else 2
CV <- LSCV(x, y, bw.grid, weights = w, cores = ncores) # Parallel capabilities
bw.opt <- bw.grid[which.min(CV)]
g <- seq(-3.5, 3.5, 0.05)
yhat <- kernelSmooth(x, y, xout = g, weights = w,
bw = bw.opt, deduplicate.xout = FALSE)
oldpar <- par(mfrow = c(2, 1), mar = c(2, 2, 2, 0)+.1)
plot(bw.grid, CV, bty = "n", xlab = "", ylab = "", main = "Cross-validation")
plot(x.uniq, y.uniq, bty = "n", xlab = "", ylab = "", main = "Optimal fit")
points(g, yhat, pch = 16, col = 2, cex = 0.5)
par(oldpar)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.