LSCV: Least-squares cross-validation function for the...
In smoothemplik: Smoothed Empirical Likelihood

LSCV	R Documentation

Least-squares cross-validation function for the Nadaraya-Watson estimator

Description

Least-squares cross-validation function for the Nadaraya-Watson estimator

Usage

LSCV(
  x,
  y,
  bw,
  weights = NULL,
  same = FALSE,
  degree = 0,
  kernel = "gaussian",
  order = 2,
  PIT = FALSE,
  chunks = 0,
  robust.iterations = 0,
  cores = 1
)

Arguments

`x`	A numeric vector, matrix, or data frame containing observations. For density, the points used to compute the density. For kernel regression, the points corresponding to explanatory variables.
`y`	A numeric vector of dependent variable values.
`bw`	Candidate bandwidth values: scalar, vector, or a matrix (with columns corresponding to columns of `x`).
`weights`	A numeric vector of observation weights (typically counts) to perform weighted operations. If null, `rep(1, NROW(x))` is used. In all calculations, the total number of observations is assumed to be the sum of `weights`.
`same`	Logical: use the same bandwidth for all columns of `x`?
`degree`	Integer: 0 for locally constant estimator (Nadaraya–Watson), 1 for locally linear (Cleveland's LOESS), 2 for locally quadratic (use with care, less stable, requires larger bandwidths)
`kernel`	Character describing the desired kernel type. NB: due to limited machine precision, even Gaussian has finite support.
`order`	An integer: 2, 4, or 6. Order-2 kernels are the standard kernels that are positive everywhere. Orders 4 and 6 produce some negative values, which reduces bias but may hamper density estimation.
`PIT`	If TRUE, the Probability Integral Transform (PIT) is applied to all columns of `x` via `ecdf` in order to map all values into the [0, 1] range. May be an integer vector of indices of columns to which the PIT should be applied.
`chunks`	Integer: the number of chunks to split the task into (limits RAM usage but increases overhead). `0` = auto-select (making sure that no matrix has more than 2^27 elements).
`robust.iterations`	The number of robustifying iterations (due to Cleveland, 1979). If greater than 0, `xout` is ignored.
`cores`	Integer: the number of CPU cores to use. High core count = high RAM usage. Note: since LSCV requires zeroing out the diagonals of the weight matrix, repeated observations are combined first; the de-duplication is therefore forced in cross-validation. The only situation where de-duplication can be skipped is passing de-duplicated data sets from outside (e.g. inside optimisers).

Value

A numeric vector of the same length as bw or nrow(bw).

Examples

set.seed(1)  # Creating a data set with many duplicates
n.uniq <- 1000
n <- 4000
inds <- sort(ceiling(runif(n, 0, n.uniq)))
x.uniq <- sort(rnorm(n.uniq))
y.uniq <- 1 + 0.2*x.uniq + 0.3*sin(x.uniq) + rnorm(n.uniq)
x <- x.uniq[inds]
y <- y.uniq[inds]
w <- 1 + runif(n, 0, 2) # Relative importance
data.table::setDTthreads(1) # For measuring pure gains and overhead
RcppParallel::setThreadOptions(numThreads = 1)
bw.grid <- seq(0.1, 1.2, 0.05)
ncores <- if (.Platform$OS.type == "windows") 1 else 2
CV <- LSCV(x, y, bw.grid, weights = w, cores = ncores)  # Parallel capabilities
bw.opt <- bw.grid[which.min(CV)]
g <- seq(-3.5, 3.5, 0.05)
yhat <- kernelSmooth(x, y, xout = g, weights = w,
                     bw = bw.opt, deduplicate.xout = FALSE)
oldpar <- par(mfrow = c(2, 1), mar = c(2, 2, 2, 0)+.1)
plot(bw.grid, CV, bty = "n", xlab = "", ylab = "", main = "Cross-validation")
plot(x.uniq, y.uniq, bty = "n", xlab = "", ylab = "", main = "Optimal fit")
points(g, yhat, pch = 16, col = 2, cex = 0.5)
par(oldpar)

smoothemplik documentation built on Aug. 22, 2025, 1:11 a.m.