prepareKernel: Check the data for kernel estimation

prepareKernelR Documentation

Check the data for kernel estimation

Description

Checks if the order is 2, 4, or 6, transforms the objects into matrices, checks the dimensions, provides the bandwidth, creates default arguments to pass to the C++ functions, carries out de-duplication for speed-up etc.

Usage

prepareKernel(
  x,
  y = NULL,
  xout = NULL,
  weights = NULL,
  bw = NULL,
  kernel = c("gaussian", "uniform", "triangular", "epanechnikov", "quartic"),
  order = 2,
  convolution = FALSE,
  sparse = FALSE,
  deduplicate.x = TRUE,
  deduplicate.xout = TRUE,
  no.dedup = FALSE,
  PIT = FALSE
)

Arguments

x

A numeric vector, matrix, or data frame containing observations. For density, the points used to compute the density. For kernel regression, the points corresponding to explanatory variables.

y

Optional: a vector of dependent variable values.

xout

A vector or a matrix of data points with ncol(xout) = ncol(x) at which the user desires to compute the weights, density, or predictions. In other words, this is the requested evaluation grid. If NULL, then x itself is used as the grid.

weights

A numeric vector of observation weights (typically counts) to perform weighted operations. If null, rep(1, NROW(x)) is used. In all calculations, the total number of observations is assumed to be the sum of weights.

bw

Bandwidth for the kernel: a scalar or a vector of the same length as ncol(x). Since it is the crucial parameter in many applications, a warning is thrown if the bandwidth is not supplied, and then, Silverman's rule of thumb (via bw.row()) is applied to *every dimension* of x.

kernel

Character describing the desired kernel type. NB: due to limited machine precision, even Gaussian has finite support.

order

An integer: 2, 4, or 6. Order-2 kernels are the standard kernels that are positive everywhere. Orders 4 and 6 produce some negative values, which reduces bias but may hamper density estimation.

convolution

Logical: if FALSE, returns the usual kernel. If TRUE, returns the convolution kernel that is used in density cross-validation.

sparse

Logical: TODO (ignored)

deduplicate.x

Logical: if TRUE, full duplicates in the input x and y are counted and transformed into weights; subsetting indices to reconstruct the duplicated data set from the unique one are also returned.

deduplicate.xout

Logical: if TRUE, full duplicates in the input xout are counted; subsetting indices to reconstruct the duplicated data set from the unique one are returned.

no.dedup

Logical: if TRUE, sets deduplicate.x and deduplicate.xout to FALSE (shorthand).

PIT

If TRUE, the Probability Integral Transform (PIT) is applied to all columns of x via ecdf in order to map all values into the [0, 1] range. May be an integer vector of indices of columns to which the PIT should be applied.

Value

A list of arguments that are taken by [kernelDensity()] and [kernelSmooth()].

Examples

# De-duplication facilities
set.seed(1)  # Creating a data set with many duplicates
n.uniq <- 10000
n <- 60000
inds <- ceiling(runif(n, 0, n.uniq))
x.uniq <- matrix(rnorm(n.uniq*10), ncol = 10)
x <- x.uniq[inds, ]
y <- runif(n.uniq)[inds]
xout <- x.uniq[ceiling(runif(n.uniq*3, 0, n.uniq)), ]
w <- runif(n)
print(system.time(a1 <- prepareKernel(x, y, xout, w, bw = 0.5)))
print(system.time(a2 <- prepareKernel(x, y, xout, w, bw = 0.5,
                  deduplicate.x = FALSE, deduplicate.xout = FALSE)))
print(c(object.size(a1), object.size(a2)) / 1024) # Kilobytes used
# Speed-memory trade-off: 4 times smaller, takes 0.2 s, but reduces the
# number of matrix operations by a factor of
1 - prod(1 - a1$duplicate.stats[1:2])    # 95% fewer operations
sum(a1$weights) - sum(a2$weights)  # Should be 0 or near machine epsilon

smoothemplik documentation built on Aug. 22, 2025, 1:11 a.m.