| npreg | R Documentation |
npreg computes a kernel regression estimate of a one
(1) dimensional dependent variable on p-variate explanatory
data, given a set of evaluation points, training points (consisting of
explanatory data and dependent data), and a bandwidth specification
using the method of Racine and Li (2004) and Li and Racine (2004). A
bandwidth specification can be a rbandwidth object, or a
bandwidth vector, bandwidth type and kernel type.
npreg(bws, ...)
## S3 method for class 'formula'
npreg(bws,
data = NULL,
newdata = NULL,
y.eval = FALSE,
...)
## Default S3 method:
npreg(bws,
txdat,
tydat,
nomad = FALSE,
...)
## S3 method for class 'rbandwidth'
npreg(bws,
txdat = stop("training data 'txdat' missing"),
tydat = stop("training data 'tydat' missing"),
exdat,
eydat,
gradient.order = 1L,
gradients = FALSE,
residuals = FALSE,
...)
These arguments identify the bandwidth specification, formula/data interface, and training data.
bws |
a bandwidth specification. This can be set as a |
data |
an optional data frame, list or environment (or object
coercible to a data frame by |
txdat |
a |
tydat |
a one (1) dimensional numeric or integer vector of dependent data, each
element |
This argument passes the recommended automatic local-polynomial NOMAD preset to npregbw when bandwidths are computed inside npreg.
nomad |
logical shortcut passed through to |
These arguments control where the regression is evaluated and which fitted quantities are returned.
exdat |
a |
eydat |
a one (1) dimensional numeric or integer vector of the true values of the dependent variable. Optional, and used only to calculate the true errors. |
gradient.order |
for |
gradients |
a logical value indicating that you want gradients computed and
returned in the resulting |
newdata |
An optional data frame in which to look for evaluation data. If omitted, the training data are used. |
residuals |
a logical value indicating that you want residuals computed and
returned in the resulting |
y.eval |
If |
Further arguments are passed to npregbw when bandwidths are computed internally, or used to interpret a numeric bws vector.
... |
additional arguments supplied to |
Documentation guide: see npregbw for bandwidth
selection and search controls, np.kernels for kernels,
np.options for global options, and plot,
plot.np for plotting options.
When bws is omitted, the formula and default methods call
npregbw first and pass bandwidth-selection arguments
from ... to that call. When bws is already an
rbandwidth object, npreg estimates with the stored
bandwidth metadata in that object.
Argument groups for bandwidth selection are documented on
npregbw. The most common workflow is to choose data and
bandwidth inputs first, then bandwidth criterion and representation,
then kernel/support controls, and finally local-polynomial/NOMAD
controls when using polynomial-adaptive fits.
For S3 plotting help, use methods("plot") and query
class-specific help topics such as ?plot.npregression and
?plot.rbandwidth. You can inspect implementations with
getS3method("plot","npregression").
Typical usages are (see below for a complete list of options and also the examples at the end of this help file)
Usage 1: first compute the bandwidth object via npregbw and then
compute the conditional mean:
bw <- npregbw(y~x)
ghat <- npreg(bw)
Usage 2: alternatively, compute the bandwidth object indirectly:
ghat <- npreg(y~x)
Usage 3: modify the default kernel and order:
ghat <- npreg(y~x, ckertype="epanechnikov", ckerorder=4)
Usage 4: use the data frame interface rather than the formula
interface:
ghat <- npreg(tydat=y, txdat=x, ckertype="epanechnikov", ckerorder=4)
npreg implements a variety of methods for regression on
multivariate (p-variate) data, the types of which are possibly
continuous and/or discrete (unordered, ordered). The approach is
based on Li and Racine (2003) who employ ‘generalized product kernels’
that admit a mix of continuous and discrete data types.
Three classes of kernel estimators for the continuous data types are
available: fixed, adaptive nearest-neighbor, and generalized
nearest-neighbor. Adaptive nearest-neighbor bandwidths change with
each sample realization in the set, x_i, when estimating the
density at the point x. Generalized nearest-neighbor bandwidths change
with the point at which the density is estimated, x. Fixed bandwidths
are constant over the support of x.
Data contained in the data frame txdat may be a mix of
continuous (default), unordered discrete (to be specified in the data
frame txdat using factor), and ordered discrete
(to be specified in the data frame txdat using
ordered). Data can be entered in an arbitrary order and
data types will be detected automatically by the routine (see
np for details).
A variety of kernels may be specified by the user. Kernels implemented for continuous data types include the second, fourth, sixth, and eighth order Gaussian and Epanechnikov kernels, and the uniform kernel. Unordered discrete data types use a variation on Aitchison and Aitken's (1976) kernel, while ordered data types use a variation of the Wang and van Ryzin (1981) kernel.
When bandwidths are obtained with regtype="lp", C-level
npreg supports heterogeneous continuous polynomial degrees via
degree. The basis selector currently supports
basis="glp", "additive", and "tensor".
For continuous predictors with degree vector
d, additive basis size is
1+\sum_j d_j, tensor basis size is
\prod_j (d_j+1), and GLP uses admissible
multi-indices \alpha with
\alpha_j \le d_j and
0<\sum_j \alpha_j \le \max_j d_j plus
an intercept. The optional flag bernstein.basis
controls basis construction: FALSE (default) uses raw
local-polynomial powers, while TRUE uses a Bernstein/B-spline basis. The
homogeneous degree-0 and degree-1 cases remain
equivalent to lc and ll, respectively. Current GLP
derivative output is first-order for continuous predictors; higher
order and cross-partial extraction are reserved for future extension.
In mixed-data GLP settings, derivative entries for unordered/ordered
predictors are returned as NA.
When npregbw(..., regtype="lp") is used with
degree.select="manual", the degree vector remains fixed user
input. When degree.select != "manual", npregbw can
jointly select polynomial degree and bandwidth using either the
cached cell-search backend or the direct
search.engine="nomad"/"nomad+powell" route described in
npregbw; the latter follows Hall and Racine (2015). For
practitioners who want that recommended route without spelling out all
LP tuning arguments, npreg(..., nomad=TRUE) and
npregbw(..., nomad=TRUE) expand missing settings to the same
documented automatic-LP NOMAD preset. Explicit incompatible settings
fail fast rather than being silently rewritten. The direct NOMAD
backend is provided by the suggested package crs, so install
crs before using search.engine="nomad",
"nomad+powell", or nomad=TRUE. For
bernstein.basis=TRUE, evaluation points for continuous predictors
must lie within training support; use bernstein.basis=FALSE for
extrapolation. For regtype="ll" and regtype="lp", the
training continuous design is checked for rank deficiency and extreme
condition number before estimation proceeds.
The use of compactly supported kernels or the occurrence of small bandwidths can lead to numerical problems for the local linear estimator when computing the locally weighted least squares solution. To overcome this problem we rely on a form or ‘ridging’ proposed by Cheng, Hall, and Titterington (1997), modified so that we solve the problem pointwise rather than globally (i.e. only when it is needed).
npreg returns a npregression object.
The generic
functions fitted, residuals,
se, predict, and
gradients, extract (or generate) estimated values,
residuals, asymptotic standard
errors on estimates, predictions, and gradients, respectively, from
the returned object. Furthermore, the functions summary
and plot support objects of this type. The returned object
has the following components:
eval |
evaluation points |
mean |
estimates of the regression function (conditional mean) at the evaluation points |
merr |
standard errors of the regression function estimates |
grad |
estimates of the gradients at each evaluation point |
gerr |
standard errors of the gradient estimates |
resid |
if |
R2 |
coefficient of determination (Doksum and Samarov (1995)) |
MSE |
mean squared error |
MAE |
mean absolute error |
MAPE |
mean absolute percentage error |
CORR |
absolute value of Pearson's correlation coefficient |
SIGN |
fraction of observations where fitted and observed values agree in sign |
If you are using data of mixed types, then it is advisable to use the
data.frame function to construct your input data and not
cbind, since cbind will typically not work as
intended on mixed data types and will coerce the data to the same
type.
Tristen Hayfield tristen.hayfield@gmail.com, Jeffrey S. Racine racinej@mcmaster.ca
Aitchison, J. and C.G.G. Aitken (1976), “Multivariate binary discrimination by the kernel method,” Biometrika, 63, 413-420.
Cheng, M.-Y. and P. Hall and D.M. Titterington (1997), “On the shrinkage of local linear curve estimators,” Statistics and Computing, 7, 11-17.
Fan, J. and I. Gijbels (1996), Local Polynomial Modelling and Its Applications, Chapman and Hall.
Doksum, K. and A. Samarov (1995), “Nonparametric estimation of global functionals and a measure of the explanatory power of covariates in regression,” The Annals of Statistics, 23 1443-1473.
Hall, P. and Q. Li and J.S. Racine (2007), “Nonparametric estimation of regression functions in the presence of irrelevant regressors,” The Review of Economics and Statistics, 89, 784-789.
Hall, P. and J.S. Racine (2015), “Infinite Order Cross-Validated Local Polynomial Regression,” Journal of Econometrics, 185, 510-525.
Li, Q. and J.S. Racine (2007), Nonparametric Econometrics: Theory and Practice, Princeton University Press.
Li, Q. and J.S. Racine (2004), “Cross-validated local linear nonparametric regression,” Statistica Sinica, 14, 485-512.
Pagan, A. and A. Ullah (1999), Nonparametric Econometrics, Cambridge University Press.
Racine, J.S. and Q. Li (2004), “Nonparametric estimation of regression functions with both categorical and continuous data,” Journal of Econometrics, 119, 99-130.
Wang, M.C. and J. van Ryzin (1981), “A class of smooth estimators for discrete distributions,” Biometrika, 68, 301-309.
np.kernels, np.options, plot, plot.np
loess
## Not run:
# EXAMPLE 1 (INTERFACE=FORMULA): For this example, we compute a
# bivariate nonparametric regression estimate for Giovanni Baiocchi's
# Italian income panel (see Italy for details)
data("Italy")
attach(Italy)
# First, compute the least-squares cross-validated bandwidths for the
# local constant estimator (default).
bw <- npregbw(formula=gdp~ordered(year))
# Now take these bandwidths and fit the model and gradients
model <- npreg(bws = bw, gradients = TRUE)
summary(model)
# Use plot() to visualize the regression function, add bootstrap
# error bars, and overlay the data on the same plot.
# Note - this may take a minute or two depending on the speed of your
# computer due to bootstrapping being conducted (<ctrl>-C will
# interrupt). Note - nothing will appear in the graphics window until
# all computations are completed (if you use
# plot.errors.method="asymptotic" the figure will instantly appear).
if (interactive()) plot(bw, plot.errors.method="bootstrap")
points(ordered(year), gdp, cex=.2, col="red")
detach(Italy)
# Sleep for 5 seconds so that we can examine the output...
if (interactive()) Sys.sleep(5)
# EXAMPLE 1 (INTERFACE=DATA FRAME): For this example, we compute a
# bivariate nonparametric regression estimate for Giovanni Baiocchi's
# Italian income panel (see Italy for details)
data("Italy")
attach(Italy)
# First, compute the least-squares cross-validated bandwidths for the
# local constant estimator (default).
bw <- npregbw(xdat=ordered(year), ydat=gdp)
# Now take these bandwidths and fit the model and gradients
model <- npreg(bws = bw, gradients = TRUE)
summary(model)
# Use plot() to visualize the regression function, add bootstrap
# error bars, and overlay the data on the same plot.
# Note - this may take a minute or two depending on the speed of your
# computer due to bootstrapping being conducted (<ctrl>-C will
# interrupt). Note - nothing will appear in the graphics window until
# all computations are completed (if you use
# plot.errors.method="asymptotic" the figure will instantly appear).
if (interactive()) plot(bw, plot.errors.method="bootstrap")
points(ordered(year), gdp, cex=.2, col="red")
detach(Italy)
# Sleep for 5 seconds so that we can examine the output...
if (interactive()) Sys.sleep(5)
# EXAMPLE 2 (INTERFACE=FORMULA): For this example, we compute a local
# linear fit using the AIC_c bandwidth selection criterion. We then plot
# the estimator and its gradient using asymptotic standard errors.
data("cps71", package = "np")
bw <- npregbw(logwage~age, regtype="ll", bwmethod="cv.aic", data = cps71)
# Next, plot the regression function...
if (interactive()) plot(bw, plot.errors.method="asymptotic")
with(cps71, points(age, logwage, cex=.2, col="red"))
# Sleep for 5 seconds so that we can examine the output...
if (interactive()) Sys.sleep(5)
# Next, plot the derivative...
if (interactive()) plot(bw, plot.errors.method="asymptotic", gradient=TRUE)
# Sleep for 5 seconds so that we can examine the output...
if (interactive()) Sys.sleep(5)
# EXAMPLE 2 (INTERFACE=DATA FRAME): For this example, we compute a local
# linear fit using the AIC_c bandwidth selection criterion. We then plot
# the estimator and its gradient using asymptotic standard errors.
data("cps71", package = "np")
bw <- npregbw(xdat=cps71$age, ydat=cps71$logwage, regtype="ll", bwmethod="cv.aic")
# Next, plot the regression function...
if (interactive()) plot(bw, plot.errors.method="asymptotic")
with(cps71, points(age, logwage, cex=.2, col="red"))
# Sleep for 5 seconds so that we can examine the output...
if (interactive()) Sys.sleep(5)
# Next, plot the derivative...
if (interactive()) plot(bw, plot.errors.method="asymptotic", gradient=TRUE)
# Sleep for 5 seconds so that we can examine the output...
if (interactive()) Sys.sleep(5)
# EXAMPLE 3 (INTERFACE=FORMULA): For this example, we replicate the
# nonparametric regression in Maasoumi, Racine, and Stengos
# (2007) (see oecdpanel for details). Note that X is multivariate
# containing a mix of unordered, ordered, and continuous data types. Note
# - this may take a few minutes depending on the speed of your computer.
data("oecdpanel")
attach(oecdpanel)
bw <- npregbw(formula=growth~
factor(oecd)+
factor(year)+
initgdp+
popgro+
inv+
humancap)
if (interactive()) plot(bw, plot.errors.method="asymptotic")
detach(oecdpanel)
# EXAMPLE 3 (INTERFACE=DATA FRAME): For this example, we replicate the
# nonparametric regression in Maasoumi, Racine, and Stengos
# (2007) (see oecdpanel for details). Note that X is multivariate
# containing a mix of unordered, ordered, and continuous data types. Note
# - this may take a few minutes depending on the speed of your computer.
data("oecdpanel")
attach(oecdpanel)
y <- growth
X <- data.frame(factor(oecd), factor(year), initgdp, popgro, inv, humancap)
bw <- npregbw(xdat=X, ydat=y)
if (interactive()) plot(bw, plot.errors.method="asymptotic")
detach(oecdpanel)
# EXAMPLE 4 (INTERFACE=FORMULA): Experimental data - the effect of
# vitamin C on tooth growth in guinea pigs
#
# Description:
#
# The response is the length of odontoblasts (teeth) in each of 10
# guinea pigs at each of three dose levels of Vitamin C (0.5, 1, and
# 2 mg) with each of two delivery methods (orange juice or ascorbic
# acid).
#
# Usage:
#
# ToothGrowth
#
# Format:
#
# A data frame with 60 observations on 3 variables.
#
# [,1] len numeric Tooth length
# [,2] supp factor Supplement type (VC or OJ).
# [,3] dose numeric Dose in milligrams.
library("datasets")
attach(ToothGrowth)
# Note - in this example, there are six cells.
bw <- npregbw(formula=len~factor(supp)+ordered(dose))
# Now plot the partial regression surfaces with bootstrapped
# nonparametric confidence bounds
if (interactive()) plot(bw, plot.errors.method="bootstrap", plot.errors.type="simultaneous")
detach(ToothGrowth)
# EXAMPLE 4 (INTERFACE=DATA FRAME): Experimental data - the effect of
# vitamin C on tooth growth in guinea pigs
#
# Description:
#
# The response is the length of odontoblasts (teeth) in each of 10
# guinea pigs at each of three dose levels of Vitamin C (0.5, 1, and
# 2 mg) with each of two delivery methods (orange juice or ascorbic
# acid).
#
# Usage:
#
# ToothGrowth
#
# Format:
#
# A data frame with 60 observations on 3 variables.
#
# [,1] len numeric Tooth length
# [,2] supp factor Supplement type (VC or OJ).
# [,3] dose numeric Dose in milligrams.
library("datasets")
attach(ToothGrowth)
# Note - in this example, there are six cells.
y <- len
X <- data.frame(supp=factor(supp), dose=ordered(dose))
bw <- npregbw(X, y)
# Now plot the partial regression surfaces with bootstrapped
# nonparametric confidence bounds
if (interactive()) plot(bw, plot.errors.method="bootstrap", plot.errors.type="simultaneous")
detach(ToothGrowth)
# EXAMPLE 5 (INTERFACE=FORMULA): a pretty 2-d smoothing example adapted
# from the R mgcv library which was written by Simon N. Wood.
set.seed(12345)
# This function generates a smooth nonlinear DGP
dgp.func <- function(x, z, sx=0.3, sz=0.4)
{ (pi**sx*sz)*(1.2*exp(-(x-0.2)^2/sx^2-(z-0.3)^2/sz^2)+
0.8*exp(-(x-0.7)^2/sx^2-(z-0.8)^2/sz^2))
}
# Generate 500 observations, compute the true DGP (i.e., no noise),
# then a noisy sample
n <- 500
x <- runif(n)
z <- runif(n)
xs <- seq(0, 1, length=30)
zs <- seq(0, 1, length=30)
X.eval <- data.frame(x=rep(xs, 30), z=rep(zs, rep(30, 30)))
dgp <- matrix(dgp.func(X.eval$x, X.eval$z), 30, 30)
y <- dgp.func(x, z)+rnorm(n)*0.1
# Prepare the screen for output... first, plot the true DGP
split.screen(c(2, 1))
screen(1)
persp(xs, zs, dgp, xlab="x1", ylab="x2", zlab="y", main="True DGP")
# Next, compute a local linear fit and plot that
bw <- npregbw(formula=y~x+z, regtype="ll", bwmethod="cv.aic")
fit <- fitted(npreg(bws=bw, newdata=X.eval))
fit.mat <- matrix(fit, 30, 30)
screen(2)
persp(xs, zs, fit.mat, xlab="x1", ylab="x2", zlab="y",
main="Local linear estimate")
# EXAMPLE 5 (INTERFACE=DATA FRAME): a pretty 2-d smoothing example
# adapted from the R mgcv library which was written by Simon N. Wood.
set.seed(12345)
# This function generates a smooth nonlinear DGP
dgp.func <- function(x, z, sx=0.3, sz=0.4)
{ (pi**sx*sz)*(1.2*exp(-(x-0.2)^2/sx^2-(z-0.3)^2/sz^2)+
0.8*exp(-(x-0.7)^2/sx^2-(z-0.8)^2/sz^2))
}
# Generate 500 observations, compute the true DGP (i.e., no noise),
# then a noisy sample
n <- 500
x <- runif(n)
z <- runif(n)
xs <- seq(0, 1, length=30)
zs <- seq(0, 1, length=30)
X <- data.frame(x, z)
X.eval <- data.frame(x=rep(xs, 30), z=rep(zs, rep(30, 30)))
dgp <- matrix(dgp.func(X.eval$x, X.eval$z), 30, 30)
y <- dgp.func(x, z)+rnorm(n)*0.1
# Prepare the screen for output... first, plot the true DGP
split.screen(c(2, 1))
screen(1)
persp(xs, zs, dgp, xlab="x1", ylab="x2", zlab="y", main="True DGP")
# Next, compute a local linear fit and plot that
bw <- npregbw(xdat=X, ydat=y, regtype="ll", bwmethod="cv.aic")
fit <- fitted(npreg(exdat=X.eval, bws=bw))
fit.mat <- matrix(fit, 30, 30)
screen(2)
persp(xs, zs, fit.mat, xlab="x1", ylab="x2", zlab="y",
main="Local linear estimate")
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.