bw.CV | R Documentation |
Finds the optimal bandwidth by minimising the density cross-valication or least-squares criteria.
Remember that since usually, the CV function is highly non-linear, the return value should be taken with a grain of salt.
With non-smooth kernels (such as uniform), it will oftern return the local minimum after starting from a reasonable value.
The user might want to standardise the input matrix x
by column (divide by some estimator of scale, like sd
or IQR
) and examine the behaviour of the CV criterion as a function of unique bandwidth (same
argument).
If it seems that the optimum is unique, then they may proceed by multiplying the bandwidth by the scale measure,
and start the search for the optimal bandwidth in multiple dimensions.
bw.CV(
x,
y = NULL,
weights = NULL,
kernel = "gaussian",
order = 2,
PIT = FALSE,
chunks = 0,
robust.iterations = 0,
degree = 0,
start.bw = NULL,
same = FALSE,
tol = 1e-04,
try.grid = TRUE,
ndeps = 1e-05,
verbose = FALSE,
attach.attributes = FALSE,
control = list()
)
x |
A numeric vector, matrix, or data frame containing observations. For density, the points used to compute the density. For kernel regression, the points corresponding to explanatory variables. |
y |
A numeric vector of responses (dependent variable) if the user wants least-squares cross-validation. |
weights |
A numeric vector of observation weights (typically counts) to
perform weighted operations. If null, |
kernel |
Character describing the desired kernel type. NB: due to limited machine precision, even Gaussian has finite support. |
order |
An integer: 2, 4, or 6. Order-2 kernels are the standard kernels that are positive everywhere. Orders 4 and 6 produce some negative values, which reduces bias but may hamper density estimation. |
PIT |
If TRUE, the Probability Integral Transform (PIT) is applied to all columns
of |
chunks |
Integer: the number of chunks to split the task into (limits
RAM usage but increases overhead). |
robust.iterations |
Passed to |
degree |
Passed to |
start.bw |
Numeric vector: initial value for bandwidth search. |
same |
Logical: use the same bandwidth for all columns of |
tol |
Relative tolerance used by the optimiser as the stopping criterion. |
try.grid |
Logical: if true, 10 different bandwidths around the rule-of-thumb
one are tried with multiplier |
ndeps |
Numerical-difference epsilon. Puts a lower bound on the result: the estimated optimal bw cannot be less than this value. |
verbose |
Logical: print out the optimiser return code for diagnostics? |
attach.attributes |
Logical: if TRUE, returns the output of 'optim()' for diagnostics. |
control |
List: extra arguments to pass to the control-argument list of 'optim'. |
If y
is NULL and only x
is supplied, returns the density-cross-validated bandwidth (DCV).
If y
is supplied, then, returns the least-squares-cross-validated bandwidth (LSCV).
Numeric vector or scalar of the optimal bandwidth.
set.seed(1) # Creating a data set with many duplicates
n.uniq <- 200
n <- 500
inds <- sort(ceiling(runif(n, 0, n.uniq)))
x.uniq <- sort(rnorm(n.uniq))
y.uniq <- 1 + 0.1*x.uniq + sin(x.uniq) + rnorm(n.uniq)
x <- x.uniq[inds]
y <- y.uniq[inds]
w <- 1 + runif(n, 0, 2) # Relative importance
data.table::setDTthreads(1) # For measuring the pure gains and overhead
RcppParallel::setThreadOptions(numThreads = 1)
bw.grid <- seq(0.1, 1.3, 0.2)
CV <- LSCV(x, y, bw.grid, weights = w)
bw.init <- bw.grid[which.min(CV)]
bw.opt <- bw.CV(x, y, w) # 0.49, very close
g <- seq(-3.5, 3.5, 0.05)
yhat <- kernelSmooth(x, y, g, w, bw.opt, deduplicate.xout = FALSE)
oldpar <- par(mfrow = c(2, 1), mar = c(2, 2, 2, 0)+.1)
plot(bw.grid, CV, bty = "n", xlab = "", ylab = "", main = "Cross-validation")
points(bw.opt, LSCV(x, y, bw.opt, w), col = 2, pch = 15)
plot(x.uniq, y.uniq, bty = "n", xlab = "", ylab = "", main = "Optimal fit")
points(g, yhat, pch = 16, col = 2, cex = 0.5)
par(oldpar)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.