lwr: Locally Weighted Regression

Description Usage Arguments Details Value References See Also Examples

View source: R/lwr.R

Description

Estimates a model of the form y = f(x) using locally weighted regression. x can include either one or two variables. Returns estimated values, derivatives, and standard errors for both f(x) and df(x)/dx.

Usage

1
2
3
 
lwr(form,window=.25,bandwidth=0,kern="tcub",distance="Mahal",
  target=NULL,data=NULL)

Arguments

form

Model formula

window

Window size. Default: 0.25.

bandwidth

Bandwidth. Default: not used.

kern

Kernel weighting function. Default is the tri-cube. Options include "rect", "tria", "epan", "bisq", "tcub", "trwt", and "gauss".

distance

Options: "Euclid", "Mahal", or "Latlong" for Euclidean, Mahalanobis, or "great-circle" geographic distance. May be abbreviated to the first letter but must be capitalized. Note: lwr looks for the first two letters to determine which variable is latitude and which is longitude, so the data set must be attached first or specified using the data option; options like data$latitude will not work. Default: Mahal.

target

If target = NULL, uses the maketarget command to form targets using the values specified for window, bandwidth, and kern. If target="alldata", each observation is used as a target value for x. A set of target values can be supplied directly.

data

A data frame containing the data. Default: use data in the current working directory.

Details

The estimated value of y at a target value x0 is the predicted value from a weighted least squares regression of y on x-x0 with weights given by K(ψ/h), where ψ is a measure of the distance between x and x0 and h is the bandwidth or window.

When x includes a single variable, ψ = x-x0. When x includes two variables, the method for specifying ψ depends on the distance option. If distance="Mahal" or distance="Euclid", the ith row of the matrix X = (x1, x2) is transformed such that x_i = sqrt(x_i * V * t(x_i)). Under the "Mahal" option, V is the inverse of cov(X). Under the "Euclid" option, V is the inverse of diag(cov(X)). By reducing x from two dimensions to one, this transformation leads again to the simple kernel weighting function K((x- x0 )/(sd(x)*h)).

The great circle formula is used to define K when distance = "Latlong"; in this case, the explanatory variable list must be specified as ~latitude+longitude (or ~lo+la or ~lat+long, etc), with the longitude and latitude variables expressed in degrees (e.g., -87.627800 and 41.881998 for one observation of longitude and latitude, respectively). The order in which latitude and longitude are listed does not matter and the function only looks for the first two letters to determine which variable is latitude and which is the longitude. It is important to note that the great circle distance measure is left in miles rather than being standardized. Thus, the window option should be specified when distance = "Latlong" or the bandwidth should be adjusted to account for the scale. The kernel weighting function becomes K(distance/h) under the "Latlong" option.

h is specified by the bandwidth or window options. The intercept, α, provides an estimate of y at x0 and β provides an estimate of the slope, dy/dx at x0. When target="alldata", each data point in turn is used as a target point, x0.

Since each estimate is a linear function of all n values for y, the full set of estimates takes the form yhat = LY, where L is an nxn matrix. Loader (1999) suggests two measures of the number of degrees of freedom used in estimation, df1 = tr(L) and df2 = tr(L'L), both of which are stored by lwr. The diagonal elements of tr(L) are stored in the array infl. Again following Loader (1999), the degrees of freedom correction used to estimate the error variance, sig2, is df = 2*df1 - df2. Let e represent the vector of residuals, e = y-yhat. The estimated variance is sig2 = sum(e^2)/(n-df). The covariance matrix is

σ^2(∑ Z_i K(ψ_i/h) Z_i')^{-1}(∑ Z_i (K(ψ_i/h))^2 Z_i' )(∑ Z_i K(ψ_i/h) Z_i')^{-1}.

where Z = (1 x-x0).

Estimation can be very slow when target = "alldata". The maketarget command can be used to identify target points. The smooth12 command is then used to interpolate the coefficient estimates, the standard errors, and the values used to form df1 and df2.

h can be specified to be either a fixed bandwidth or a window size set to a percentage of the sample size. Optionally, the lwrgrid command can be used to specify a vector of values for h with lwr picking the one that minimizes a criterion function. In general, the window option will be preferable because it provides more accurate estimates in regions where x is relatively sparse.

Available kernel weighting functions include the following:

Kernel Call abbreviation Kernel function K(z)
Rectangular ``rect'' 1/2 * I(|z|<1)
Triangular ``tria'' (1-|z|) * I(|z|<1)
Epanechnikov ``epan'' 3/4 * (1-z^2)*I(|z| < 1)
Bi-Square ``bisq'' 15/16 * (1-z^2)^2 * I(|z| < 1)
Tri-Cube ``tcub'' 70/81 * (1-|z|^3)^3 * I(|z| < 1)
Tri-Weight ``trwt'' 35/32 * (1-z^2)^3 * I(|z| < 1)
Gaussian ``gauss'' 2pi^{-.5} exp(-z^2/2)

Value

target

The target points for the original estimation of the function.

ytarget

The predicted values of y at the original target points.

dtarget1

The estimated derivatives dy/dx1 at the target points.

dtarget2

The estimated derivatives dy/dx2 at the target points. All zeros if the model has only one explanatory variable.

ytarget.se

Standard errors for the predicted values of y at the target points.

dtarget1.se

Standard errors for the derivatives dy/dx1 at the target points.

dtarget2.se

Standard errors for the derivatives dy/dx2 at the target points. All zeros if the model has only one explanatory variable.

yhat

The predicted values of y for the full data set.

dhat1

The estimated derivatives dy/dx1 for the full data set.

dhat2

The estimated derivatives dy/dx2 for the full data set. All zeros if the model has only one explanatory variable.

yhat.se

Standard errors for the predicted values of y for the full data set.

dhat1.se

Standard errors for the estimated derivatives dy/dx1 for the full data set.

dhat2.se

Standard errors for the estimated derivatives dy/dx2 for the full data set. All zeros if the model has only one explanatory variable.

df1

tr(L), a measure of the degrees of freedom used in estimation.

df2

tr(L'L), an alternative measure of the degrees of freedom used in estimation.

sig2

Estimated residual variance, sig2 = rss/(n-2*df1+df2).

cv

Cross-validation measure. cv = mean(((y-yhat)/(1-infl))^2), where yhat is vector of predicted values for y and infl is the vector of diagonal terms for L.

gcv

gcv = n*(n*sig2)/((n-nreg)^2), where sig2 is the estimated residual variance and nreg = 2*df1 - df2.

infl

A vector containing the diagonal elements of L.

References

Cleveland, William S. and Susan J. Devlin, "Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting," Journal of the American Statistical Association 83 (1988), 596-610.

Loader, Clive. Local Regression and Likelihood. New York: Springer, 1999.

McMillen, Daniel P., "Issues in Spatial Data Analysis," Journal of Regional Science 50 (2010), 119-141.

McMillen, Daniel P., "Employment Densities, Spatial Autocorrelation, and Subcenters in Large Metropolitan Areas," Journal of Regional Science 44 (2004), 225-243.

McMillen, Daniel P. and John F. McDonald, "A Nonparametric Analysis of Employment Density in a Polycentric City," Journal of Regional Science 37 (1997), 591-612.

McMillen, Daniel P. and Christian Redfearn, "Estimation and Hypothesis Testing for Nonparametric Hedonic House Price Functions," Journal of Regional Science 50 (2010), 712-733.

Pagan, Adrian and Aman Ullah. Nonparametric Econometrics. New York: Cambridge University Press, 1999.

Silverman, A. W., Density Estimation for Statistics and Data Analysis, Chapman and Hall, New York (1986).

See Also

cparlwr

cubespline

fourier

lwrgrid

maketarget

semip

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# 1. Monte Carlo data
n = 1000
x <- runif(n,0,2*pi)
x <- sort(x)
ybase <- x - .1*(x^2) + sin(x) - cos(x) -.5*sin(2*x) + .5*cos(2*x)
sig = sd(ybase)/2
y <- ybase + rnorm(n,0,sig)
par(ask=TRUE)
plot(x,y)
lines(x,ybase,col="red")
fit <- lwr(y~x, window=.15)
# plot 95% confidence intervals for predicted y 
predse <- sqrt(fit$sig2 + fit$yhat.se^2)
lower <- fit$yhat + qnorm(.025)*predse
upper <- fit$yhat + qnorm(.975)*predse
plot(x, ybase, type="l", ylim=c(min(lower), max(upper)), 
  main="Estimated Function", xlab="x", ylab="y")
lines(x, fit$yhat, col="red")
lines(x, lower, lty="dashed", col="red")
lines(x, upper, lty="dashed", col="red")
legend("topleft", c("Base", "Predicted", "95 Percent CI"), 
 col=c("black", "red", "red"), lty=c("solid", "solid", "dashed"), lwd=1)

# plot 95%  confidence intervals for slopes
dxbase <- 1 - .2*x + cos(x) + sin(x) - cos(2*x) - sin(2*x)
lower <- fit$dhat1 + qnorm(.025)*fit$dhat1.se
upper <- fit$dhat1 + qnorm(.975)*fit$dhat1.se
plot(x, dxbase, type="l", ylim=c(min(lower), max(upper)), 
  main="Estimated Slopes", xlab="x", ylab="y")
lines(x, fit$dhat1, col="red")
lines(x, lower, lty="dashed", col="red")
lines(x, upper, lty="dashed", col="red")
legend("topright", c("Base", "Predicted", "95 Percent CI"), 
 col=c("black", "red", "red"),lty=c("solid", "solid", "dashed"), lwd=1)

# Derivative estimates with larger window size
fit <- lwr(y~x,window=.20)
lower <- fit$dhat1 + qnorm(.025)*fit$dhat1.se
upper <- fit$dhat1 + qnorm(.975)*fit$dhat1.se
plot(x, dxbase, type="l", ylim=c(min(lower), max(upper)), 
  main="Estimated Slopes", xlab="x", ylab="y")
lines(x, fit$dhat1, col="red")
lines(x, lower, lty="dashed", col="red")
lines(x, upper, lty="dashed", col="red")
legend("topright", c("Base", "Predicted", "95 Percent CI"), 
 col=c("black", "red", "red"), lty=c("solid", "solid", "dashed"), lwd=1)

## Not run: 
#2. Population density data
library(RColorBrewer)

cook <- readShapePoly(system.file("maps/CookCensusTracts.shp",
  package="McSpatial"))
cook$obs <- seq(1:nrow(cook))
# measure distance to Chicago city center
lmat <- coordinates(cook)
cook$LONGITUDE <- lmat[,1]
cook$LATITUDE  <- lmat[,2]
cook$DCBD <- geodistance(longvar=cook$LONGITUDE,latvar=cook$LATITUDE,
  lotarget=-87.627800,latarget=41.881998,dcoor=FALSE)$dist
# population density = population/acres,  acres = square mile x 640
cook$LNDENS <- log(cook$POPULATION/(cook$AREA*640))
densdata <- data.frame(cook[cook$POPULATION>0,])
par(ask=TRUE)

# lndens = f(longitude, latitude), weights are function of straight-line distance
fit <- lwr(LNDENS~LONGITUDE+LATITUDE,  window=.10, 
   distance="Latlong",data=densdata)
c(fit$df1, fit$df2, 2*fit$df1-fit$df2)
cook$lwrhat[densdata$obs] <- fit$yhat
brks <- seq(min(cook$lwrhat,na.rm=TRUE),max(cook$lwrhat,na.rm=TRUE),length=9)
spplot(cook,"lwrhat",at=brks,col.regions=rev(brewer.pal(9,"RdBu")),
   main="Log Density LWR Estimates")


## End(Not run)

Example output

Loading required package: lattice
Loading required package: locfit
locfit 1.5-9.1 	 2013-03-22
Loading required package: maptools
Loading required package: sp
Checking rgeos availability: FALSE
 	Note: when rgeos is not available, polygon geometry 	computations in maptools depend on gpclib,
 	which has a restricted licence. It is disabled by default;
 	to enable gpclib, type gpclibPermit()
Loading required package: quantreg
Loading required package: SparseM

Attaching package: 'SparseM'

The following object is masked from 'package:base':

    backsolve

Loading required package: RANN
Warning message:
readShapePoly is deprecated; use rgdal::readOGR or sf::st_read 
[1] 45.41411 31.35058 59.47763

McSpatial documentation built on May 2, 2019, 9:32 a.m.