| npplregbw | R Documentation |
npplregbw computes a bandwidth object for a partially linear
kernel regression estimate of a one (1) dimensional dependent variable
on p+q-variate explanatory data, using the model Y = X\beta
+ \Theta (Z) + \epsilon given a set of
estimation points, training points (consisting of explanatory data and
dependent data), and a bandwidth specification, which can be a
rbandwidth object, or a bandwidth vector, bandwidth type and
kernel type.
npplregbw(...)
## S3 method for class 'formula'
npplregbw(formula, data, subset, na.action, call, ...)
## Default S3 method:
npplregbw(xdat = stop("invoked without data `xdat'"),
ydat = stop("invoked without data `ydat'"),
zdat = stop("invoked without data `zdat'"),
bandwidth.compute = TRUE,
bws,
degree = NULL,
degree.select = c("manual", "coordinate", "exhaustive"),
search.engine = c("nomad+powell", "cell", "nomad"),
nomad = FALSE,
nomad.nmulti = 0L,
degree.min = NULL,
degree.max = NULL,
degree.start = NULL,
degree.restarts = 0L,
degree.max.cycles = 20L,
degree.verify = FALSE,
scale.factor.search.lower = NULL,
ftol,
itmax,
nmulti,
remin,
small,
tol,
...)
## S3 method for class 'plbandwidth'
npplregbw(xdat = stop("invoked without data `xdat'"),
ydat = stop("invoked without data `ydat'"),
zdat = stop("invoked without data `zdat'"),
bws,
nmulti,
...)
These arguments identify the linear, nonparametric, formula, and bandwidth inputs.
bandwidth.compute |
a logical value which specifies whether to do a numerical search for
bandwidths or not. If set to |
bws |
a bandwidth specification. This can be set as a If left unspecified, |
call |
the original function call. This is passed internally by
|
data |
an optional data frame, list or environment (or object
coercible to a data frame by |
formula |
a symbolic description of variables on which bandwidth selection is to be performed. The details of constructing a formula are described below. |
na.action |
a function which indicates what should happen when the data contain
|
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
xdat |
a |
ydat |
a one (1) dimensional numeric or integer vector of dependent data, each
element |
zdat |
a |
These arguments control automatic local-polynomial degree search.
degree.max |
optional scalar or integer vector giving upper bounds for automatic
degree search over continuous |
degree.max.cycles |
positive integer giving the maximum number of coordinate-search
sweeps over the degree vector. Ignored for |
degree.min |
optional scalar or integer vector giving lower bounds for automatic
degree search over continuous |
degree.restarts |
non-negative integer giving the number of additional deterministic
coordinate-search restarts. Ignored for |
degree.select |
character string controlling local-polynomial degree handling for
the nonparametric |
degree.start |
optional starting degree vector for automatic coordinate search. If
omitted, the search starts from the degree-zero local-constant
baseline on the continuous |
degree.verify |
logical value indicating whether a coordinate-search solution should
be exhaustively verified over the admissible degree grid after the
heuristic phase completes. Available only for
|
These controls define lower admissibility bounds for continuous fixed-bandwidth search.
scale.factor.search.lower |
optional nonnegative scalar giving the hard lower admissibility
bound for continuous fixed-bandwidth search candidates. Defaults to
|
These arguments control fixed local-polynomial specification for the nonparametric component.
degree |
for local-polynomial partially linear fits, polynomial degree
specification for each continuous nonparametric regressor in
|
These arguments control the optional NOMAD direct-search route for local-polynomial degree and bandwidth search.
nomad |
logical shortcut for the recommended automatic local-polynomial
NOMAD route for the nonparametric |
nomad.nmulti |
non-negative integer controlling the inner
|
search.engine |
character string controlling the automatic local-polynomial search
backend for the nonparametric |
These controls set optimizer tolerances and restart behavior.
ftol |
tolerance on the value of the cross-validation function
evaluated at located minima. Defaults to |
itmax |
integer number of iterations before failure in the numerical
optimization routine. Defaults to |
nmulti |
integer number of times to restart the process of finding extrema of
the cross-validation function from different (random) initial
points. Defaults to |
remin |
a logical value which when set as |
small |
a small number, at about the precision of the data type
used. Defaults to |
tol |
tolerance on the position of located minima of the
cross-validation function. Defaults to |
These arguments collect remaining controls passed through S3 methods.
... |
additional arguments supplied to specify the regression type,
bandwidth type, kernel types, selection methods, and so on. To do
this, you may specify any of |
The scale.factor.* controls are dimensionless search
controls. The package converts scale factors to bandwidths using the
estimator-specific scaling encoded in the bandwidth object, including
kernel order and the number of continuous variables relevant for the
estimator. Users should not pre-multiply these controls by sample-size
or standard-deviation factors.
scale.factor.init controls the deterministic first search
start when that control is exposed. scale.factor.init.lower
and scale.factor.init.upper define the random multistart
interval when exposed. scale.factor.search.lower is the lower
admissibility bound for continuous fixed-bandwidth search candidates.
The effective first start is max(scale.factor.init,
scale.factor.search.lower) when both controls are present, and the
effective random-start lower endpoint is
max(scale.factor.init.lower, scale.factor.search.lower).
scale.factor.init.upper must be at least that effective lower
endpoint; the package errors rather than silently expanding the user's
interval.
When scale.factor.search.lower is NULL, an existing
bandwidth object's stored floor is inherited when available;
otherwise the package default 0.1 is used. Explicit bandwidths
supplied for storage with bandwidth.compute = FALSE are not
rewritten by the search floor.
Categorical search-start controls such as dfac.init,
lbd.init, and hbd.init have separate semantics and are
not affected by scale.factor.search.lower.
Documentation guide: see npregbw for component
nonparametric regression bandwidth controls, np.kernels
for kernels, np.options for global options, and
plot for plotting options.
The partially linear bandwidth-selection argument surface is easiest
to read by decision group: linear xdat inputs,
nonparametric zdat inputs, and existing bandwidth inputs;
local-polynomial/NOMAD controls for the nonparametric component;
numerical search and feasibility controls; formula-interface
controls; and additional bandwidth, kernel, and support controls that
are passed to the component npregbw searches.
For S3 plotting help, use methods("plot") and query
class-specific help topics such as ?plot.npregression and
?plot.rbandwidth. You can inspect implementations with
getS3method("plot","npregression").
npplregbw implements a variety of methods for nonparametric
regression on multivariate (q-variate) explanatory data defined
over a set of possibly continuous and/or discrete (unordered, ordered)
data. The approach is based on Li and Racine (2003), who employ
‘generalized product kernels’ that admit a mix of continuous and
discrete data types.
Three classes of kernel estimators for the continuous data types are
available: fixed, adaptive nearest-neighbor, and generalized
nearest-neighbor. Adaptive nearest-neighbor bandwidths change with
each sample realization in the set, x_i, when estimating the
density at the point x. Generalized nearest-neighbor bandwidths change
with the point at which the density is estimated, x. Fixed bandwidths
are constant over the support of x.
npplregbw may be invoked either with a formula-like
symbolic
description of variables on which bandwidth selection is to be
performed or through a simpler interface whereby data is passed
directly to the function via the xdat, ydat, and
zdat
parameters. Use of these two interfaces is mutually exclusive.
Data contained in the data frame zdat may be a mix of continuous
(default), unordered discrete (to be specified in the data frame
zdat using factor), and ordered discrete (to be
specified in the data frame zdat using
ordered). Data can be entered in an arbitrary order and
data types will be detected automatically by the routine (see
np for details).
Data for which bandwidths are to be estimated may be specified
symbolically. A typical description has the form dependent
data ~ parametric explanatory data
| nonparametric explanatory data,
where dependent data is a univariate response, and
parametric explanatory data and
nonparametric explanatory
data are both series of variables specified by name, separated by
the separation character '+'. For example, y1 ~ x1 + x2 | z1
specifies that the bandwidth object for the partially linear model with
response y1, linear parametric regressors x1 and
x2, and
nonparametric regressor z1 is to be estimated. See below for
further examples.
A variety of kernels may be specified by the user. Kernels implemented for continuous data types include the second, fourth, sixth, and eighth order Gaussian and Epanechnikov kernels, and the uniform kernel. Unordered discrete data types use a variation on Aitchison and Aitken's (1976) kernel, while ordered data types use a variation of the Wang and van Ryzin (1981) kernel.
When the nonparametric component is estimated with
regtype="lp" and degree.select != "manual",
npplregbw can jointly determine the zdat-side degree
vector and the associated bandwidth coordinates. With
search.engine="cell", the criterion is profiled over the degree
grid using cached coordinate-wise or exhaustive search together with
repeated fixed-degree bandwidth solves. With
search.engine="nomad" or "nomad+powell", the criterion
is optimized directly over the joint degree/bandwidth space using
crs::snomadr(); "nomad+powell" then performs one Powell
hot start and keeps the better of the direct NOMAD and polished
solutions. For the nonparametric regression component, this
polynomial-adaptive joint-search route follows Hall and Racine (2015).
Setting nomad=TRUE is a convenience preset for this automatic
LP route, not a generic optimizer alias. For partially linear
regression it expands any missing values to the equivalent long-form
call
npplregbw(...,
regtype = "lp",
search.engine = "nomad+powell",
degree.select = "coordinate",
bernstein.basis = TRUE,
degree.min = 0L,
degree.max = 10L,
degree.verify = FALSE,
bwtype = "fixed")
Compatible explicit tuning arguments are respected. Incompatible explicit settings fail fast so the shortcut never silently changes user-selected semantics.
if bwtype is set to fixed, an object containing bandwidths
(or scale factors if bwscaling = TRUE) is returned. If it is set to
generalized_nn or adaptive_nn, then instead the kth nearest
neighbors are returned for the continuous variables while the discrete
kernel bandwidths are returned for the discrete variables. Bandwidths
are stored in a list under the component name bw. Each element
is an rbandwidth object. The first
element of the list corresponds to the regression of Y on Z.
Each subsequent element is the bandwidth object corresponding to the
regression of the ith column of X on Z. See examples
for more information.
If you are using data of mixed types, then it is advisable to use the
data.frame function to construct your input data and not
cbind, since cbind will typically not work as
intended on mixed data types and will coerce the data to the same
type.
Caution: multivariate data-driven bandwidth selection methods are, by
their nature, computationally intensive. Virtually all methods
require dropping the ith observation from the data set, computing an
object, repeating this for all observations in the sample, then
averaging each of these leave-one-out estimates for a given
value of the bandwidth vector, and only then repeating this a large
number of times in order to conduct multivariate numerical
minimization/maximization. Furthermore, due to the potential for local
minima/maxima, restarting this procedure a large number of times may
often be necessary. This can be frustrating for users possessing
large datasets. For exploratory purposes, you may wish to override the
default search tolerances, say, setting ftol=.01 and tol=.01 and
conduct multistarting (the default is to restart min(2, ncol(zdat))
times) as is done for a number of examples. Once the procedure
terminates, you can restart search with default tolerances using those
bandwidths obtained from the less rigorous search (i.e., set
bws=bw on subsequent calls to this routine where bw is
the initial bandwidth object). A version of this package using the
Rmpi wrapper is under development that allows one to deploy
this software in a clustered computing environment to facilitate
computation involving large datasets.
Tristen Hayfield tristen.hayfield@gmail.com, Jeffrey S. Racine racinej@mcmaster.ca
Aitchison, J. and C.G.G. Aitken (1976), “Multivariate binary discrimination by the kernel method,” Biometrika, 63, 413-420.
Gao, Q. and L. Liu and J.S. Racine (2015), “A partially linear kernel estimator for categorical data,” Econometric Reviews, 34 (6-10), 958-977.
Hall, P. and J.S. Racine (2015), “Infinite Order Cross-Validated Local Polynomial Regression,” Journal of Econometrics, 185, 510-525.
Li, Q. and J.S. Racine (2007), Nonparametric Econometrics: Theory and Practice, Princeton University Press.
Li, Q. and J.S. Racine (2004), “Cross-validated local linear nonparametric regression,” Statistica Sinica, 14, 485-512.
Pagan, A. and A. Ullah (1999), Nonparametric Econometrics, Cambridge University Press.
Racine, J.S. and Q. Li (2004), “Nonparametric estimation of regression functions with both categorical and continuous data,” Journal of Econometrics, 119, 99-130.
Robinson, P.M. (1988), “Root-n-consistent semiparametric regression,” Econometrica, 56, 931-954.
Wang, M.C. and J. van Ryzin (1981), “A class of smooth estimators for discrete distributions,” Biometrika, 68, 301-309.
np.kernels, np.options, plot
npregbw, npreg
## Not run:
# EXAMPLE 1 (INTERFACE=FORMULA): For this example, we simulate an
# example for a partially linear model and perform bandwidth selection
set.seed(42)
n <- 250
x1 <- rnorm(n)
x2 <- rbinom(n, 1, .5)
z1 <- rbinom(n, 1, .5)
z2 <- rnorm(n)
y <- 1 + x1 + x2 + z1 + sin(z2) + rnorm(n)
X <- data.frame(x1, factor(x2))
Z <- data.frame(factor(z1), z2)
# Compute data-driven bandwidths... this may take a minute or two
# depending on the speed of your computer...
bw <- npplregbw(formula=y~x1+factor(x2)|factor(z1)+z2)
summary(bw)
# Note - the default is to use the local constant estimator. If you wish
# to use instead a local linear estimator, this is accomplished via
# npplregbw(xdat=X, zdat=Z, ydat=y, regtype="ll")
# Note - see the example for npudensbw() for multiple illustrations
# of how to change the kernel function, kernel order, and so forth.
# You may want to manually specify your bandwidths
bw.mat <- matrix(data = c(0.19, 0.34, # y on Z
0.00, 0.74, # X[,1] on Z
0.29, 0.23), # X[,2] on Z
ncol = ncol(Z), byrow=TRUE)
bw <- npplregbw(formula=y~x1+factor(x2)|factor(z1)+z2,
bws=bw.mat, bandwidth.compute=FALSE)
summary(bw)
# Sleep for 5 seconds so that we can examine the output...
if (interactive()) Sys.sleep(5)
# You may want to tweak some of the bandwidths
bw$bw[[1]] # y on Z, alternatively bw$bw$yzbw
bw$bw[[1]]$bw <- c(0.17, 0.30)
bw$bw[[2]] # X[,1] on Z
bw$bw[[2]]$bw[1] <- 0.00054
summary(bw)
# EXAMPLE 1 (INTERFACE=DATA FRAME): For this example, we simulate an
# example for a partially linear model and perform bandwidth selection
set.seed(42)
n <- 250
x1 <- rnorm(n)
x2 <- rbinom(n, 1, .5)
z1 <- rbinom(n, 1, .5)
z2 <- rnorm(n)
y <- 1 + x1 + x2 + z1 + sin(z2) + rnorm(n)
X <- data.frame(x1, factor(x2))
Z <- data.frame(factor(z1), z2)
# Compute data-driven bandwidths... this may take a minute or two
# depending on the speed of your computer...
bw <- npplregbw(xdat=X, zdat=Z, ydat=y)
summary(bw)
# Note - the default is to use the local constant estimator. If you wish
# to use instead a local linear estimator, this is accomplished via
# npplregbw(xdat=X, zdat=Z, ydat=y, regtype="ll")
# Note - see the example for npudensbw() for multiple illustrations
# of how to change the kernel function, kernel order, and so forth.
# You may want to manually specify your bandwidths
bw.mat <- matrix(data = c(0.19, 0.34, # y on Z
0.00, 0.74, # X[,1] on Z
0.29, 0.23), # X[,2] on Z
ncol = ncol(Z), byrow=TRUE)
bw <- npplregbw(xdat=X, zdat=Z, ydat=y,
bws=bw.mat, bandwidth.compute=FALSE)
summary(bw)
# Sleep for 5 seconds so that we can examine the output...
if (interactive()) Sys.sleep(5)
# You may want to tweak some of the bandwidths
bw$bw[[1]] # y on Z, alternatively bw$bw$yzbw
bw$bw[[1]]$bw <- c(0.17, 0.30)
bw$bw[[2]] # X[,1] on Z
bw$bw[[2]]$bw[1] <- 0.00054
summary(bw)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.