kernelMixedSmooth: Smoothing with conditioning on discrete and continuous...

kernelMixedSmoothR Documentation

Smoothing with conditioning on discrete and continuous variables

Description

Smoothing with conditioning on discrete and continuous variables

Usage

kernelMixedSmooth(
  x,
  y,
  by,
  xout = NULL,
  byout = NULL,
  weights = NULL,
  parallel = FALSE,
  cores = 1,
  preschedule = TRUE,
  ...
)

Arguments

x

A numeric vector, matrix, or data frame containing observations. For density, the points used to compute the density. For kernel regression, the points corresponding to explanatory variables.

y

A numeric vector of dependent variable values.

by

A variable containing unique identifiers of discrete categories.

xout

A vector or a matrix of data points with ncol(xout) = ncol(x) at which the user desires to compute the weights, density, or predictions. In other words, this is the requested evaluation grid. If NULL, then x itself is used as the grid.

byout

A variable containing unique identifiers of discrete categories for the output grid (same points as xout)

weights

A numeric vector of observation weights (typically counts) to perform weighted operations. If null, rep(1, NROW(x)) is used. In all calculations, the total number of observations is assumed to be the sum of weights.

parallel

Logical: if TRUE, parallelises the calculation over the unique values of by. At this moment, supports only parallel::mclapply (therefore, will not work on Windows).

cores

Integer: the number of CPU cores to use. High core count = high RAM usage. If the number of unique values of 'by' is less than the number of cores requested, then, only length(unique(by)) cores are used.

preschedule

Logical: passed as mc.preschedule to mclapply.

...

Passed to kernelSmooth (usually bw, gaussian for both; degree and robust.iterations for "smooth"),

Value

A numeric vector of the kernel estimate of the same length as nrow(xout).

Examples

set.seed(1)
n <- 1000
z1 <- rbinom(n, 1, 0.5)
z2 <- rbinom(n, 1, 0.5)
x <- rnorm(n)
u <- rnorm(n)
y <- 1 + x^2 + z1 + 2*z2 + z1*z2 + u
by <- as.integer(interaction(list(z1, z2)))
out <- expand.grid(x = seq(-4, 4, 0.25), by = 1:4)
yhat <- kernelMixedSmooth(x = x, y = y, by = by, bw = 1, degree = 1,
                          xout = out$x, byout = out$by)
plot(x, y)
for (i in 1:4) lines(out$x[out$by == i], yhat[out$by == i], col = i+1, lwd = 2)
legend("top", c("00", "10", "01", "11"), col = 2:5, lwd  = 2)

# The function works faster if there are duplicated values of the
# conditioning variables in the prediction grid and there are many
# observations; this is illustrated by the following example
# without a custom grid
# In this example, ignore the fact that the conditioning variable is rounded
# and therefore contains measurement error (ruining consistency)
x  <- rnorm(10000)
xout <- rnorm(5000)
xr <- round(x)
xrout <- round(xout)
w <- runif(10000, 1, 3)
y  <- 1 + x^2 + rnorm(10000)
by <- rep(1:4, each = 2500)
byout <- rep(1:4, each = 1250)
system.time(kernelMixedSmooth(x = x, y = y, by = by, weights = w,
                              xout = xout, byout = byout, bw = 1))
#  user  system elapsed
# 0.144   0.000   0.144
system.time(km1 <- kernelMixedSmooth(x = xr, y = y, by = by, weights = w,
                                     xout = xrout, byout = byout, bw = 1))
#  user  system elapsed
# 0.021   0.000   0.022
system.time(km2 <- kernelMixedSmooth(x = xr, y = y, by = by, weights = w,
                     xout = xrout, byout = byout, bw = 1, no.dedup = TRUE))
#  user  system elapsed
# 0.138   0.001   0.137
all.equal(km1, km2)

# Parallel capabilities shine in large data sets
if (.Platform$OS.type != "windows") {
# A function to carry out the same estimation in multiple cores
pFun <- function(n) kernelMixedSmooth(x = rep(x, 2), y = rep(y, 2),
         weights = rep(w, 2), by = rep(by, 2),
         bw = 1, degree = 0, parallel = TRUE, cores = n)
system.time(pFun(1))  # 0.6--0.7 s
system.time(pFun(2))  # 0.4--0.5 s
}

smoothemplik documentation built on Aug. 22, 2025, 1:11 a.m.