Description Usage Arguments Details Value Author(s) Examples
Full set of tools for kNN regression and classification, including both for direct usage and as tools for assessing the fit of parametric models.
1 2 3 4 5 6 7 8 9 10 11 12 13 14  knnest(y,xdata,k,nearf=meany)
preprocessx(x,kmax,xval=FALSE)
meany(predpt,nearxy)
vary(predpt,nearxy)
loclin(predpt,nearxy)
## S3 method for class 'knn'
predict(object,...)
kmin(y,xdata,lossftn=l2,nk=5,nearf=meany)
parvsnonparplot(lmout,knnout,cex=1.0)
nonparvsxplot(knnout,lmout=NULL)
nonparvarplot(knnout,returnPts=FALSE)
l2(y,muhat)
l1(y,muhat)

y 
Response variable data in the training set. Vector or matrix, the latter case for vectorvalued response, e.g. multiclass classification. 
x 
X data, predictors, one row per data point, in the training set. 
... 
Needed for consistency with generic. See Details below for 'arguments. 
xdata 
X and associated neighbor indices. Output of

k 
Number of nearest neighbors 
object 
Output of 
predpt 
One point on which to predict, as a vector. 
nearxy 
A set of X neighbors of a point. 
nearf 
Function to apply to the nearest neighbors of a point. 
kmax 
Maximal number of nearest neighbors to find. 
xval 
Crossvalidation flag. If TRUE, then the set of nearest neighbors of a point will not include the point itself. 
lossftn 
Loss function to be used in crossvalidation
determination of "best" 
nk 
Number of values of 
lmout 
Output of 
knnout 
Output of 
cex 
R parameter to control dot size in plot. 
muhat 
Vector of estimated regression function values. 
returnPts 
If TRUE, return matrix of plotted points. 
The knnest
function does knearest neighbor regression
function estimation, in any dimension, i.e. any number of predictor
variables, and any number of response variables. This of course
includes classification problems case; a scalar Y = 0,1 would
represent two classes, with the regression function reducing to
the conditional probability of class 1, given the predictors.
The preprocessx
function does the prep work. For each row in
x
, the code finds the kmax
closest rows to that row.
By separating this computation from knnest
, one can save a lot
of overall computing time. If for instance one wants to try the
number of nearest neighbors k
at 25, 50 and 100, one can call
preprocessx
with kmax
equal to 100, then reuse the
results; in calling knnest
for several values of k
, we
do not need to call preprocessx
again. Setting xval
to
TRUE turns out crossvalidation: the neighborhood of a point will not
include the point itself; note that this is set in
preprocessx
, not in knnest
.
One can specify various types of smoothing by proper specification of
the nearf
function. The default is meany
, specifying
the standard averaging of the neighbor Y values. Another possible
choice is vary
, specifying calculation of the sample variance
of those Y values; this is useful in assessing heteroscedasticity in
a linear model.
Another choice is to specify local linear smoothing by setting
nearf
to loclin
. Here the value of the regression
function at a point is predicted from a linear fit to the point's
neighbors. This may be especially helpful
to counteract bias near the edges of the data. As in any regression
fit, the number of predictors should be considerably less than the
number of neighbors.
The X, i.e. predictor, data will be scaled by the code, so as to put all predictor variables on an equal footing. The scaling parameters will be recorded, and then applied later in prediction.
The function predict.knn
uses the output of knnest
to
do regression estimation or prediction on new points. Since the
output of knnest
is of class 'knn'
, one invokes this
function with the simpler predict
. The second argument is the
set of new points, named predpts
within the code. It is
specified as a matrix if there is more than one prediction point and
more than one predictor variable; otherwise, use a vector.
A "1NN" method is used here: Given a new point u whose
Y value we wish to predict, the code finds the single closest row
in the training set, and returns the previouslyestimated regression
function value at that row. If u needs to be scaled, specify
TRUE
in the third argument of predict
;
otherwise specify FALSE.
It can be shown that nearestneighbor (or kernel) regression
estimates are subject to substantial bias near the fringes of the
data; the further away from the center of the data, the worse the
bias. This can be mitigated by user specification that a local
linear regression be applied, as follows: For each new point u to
predict, the r closest X rows in the training set to u will be found,
and a linear regression of the corresponding Y values against those X
values will be computed. The result of that operation will be used to
predict the Y value at the point u. The value of r is specified as
the third argument in the call to predict
; if left
unspecified, the 1NN method is used as described above, and it may
be more accurate than the locallinear approach within the bulk of
the data set.
The functions ovaknntrn
and ovaknnpred
are multiclass
wrappers for knnest
and knnpred
. Here y
is coded
0,1,...,m
1 for the m
classes.
The tools here can be useful for fit assessment of parametric models.
The parvsnonparplot
function plots fitted values of
parameteric model vs. kNN fitted, nonparvsxplot
kNN fitted
values against each predictor, one by one.
The functions l2
and l1
are used to define L2 and L1
loss.
The return value of preprocessx
is an R list. Its x
component is the scaled x
matrix, with the scaling factors being
recorded in the scaling
component. The idxs
component
contains the indices of the nearest neighbors of each point in the
predictor data, stored in a matrix with nrow(x)
rows and k
columns. Row i contains the indices of the nearest rows in x
to
row i of x
. The first of these indices is for the closest point,
then for the secondclosest, and so on. If crossvalidation is
requrested (xval = TRUE
, then any point will not be considered a
neighbor of itself.
The knnest
function returns an expanded version of xdata
,
with the expansion consisting of a new component regest
, the
estimated regression function values at the training set points.
The function predict.knn
returns the predicted Y values at
predpts
. It is called simply via predict
.
One can explore the effects of various numbers of nearest neighbors
k
through the kmin
function. (This function should be
cosidered experimental.) It will run knnest
for the values of
k
specified in nk
. If the latter is a number, the range 0
to xdata$kmax
will be divided into nk
equally
subintervals, and the values of k
used will then be the right
endpoints of the subintervals. The function returns an R list, with the
component meanerrs
containing the crossvalidated mean loss
function values and ks
containing the corresponding values of
k
; plot.knn
then plots the former against the latter.
Norm Matloff
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93  set.seed(9999)
x < matrix(sample(1:100,30),ncol=3)
xd < preprocessx(x[,1],2,TRUE) # just 1 predictor
ko < knnest(x[,2],xd,2) # Y is x[,2]
ko$regest # 1st element = 74.5
predict(ko,matrix(76),TRUE) # 47.5
ko < knnest(x[,1],xd,2) # Y bivar
ko$regest # 1st row = (74.5,31.5)
predict(ko,matrix(76),TRUE) # 47.5, 65.0
set.seed(9999)
xe < matrix(rnorm(30000),ncol=3)
xe[,3] < xe[,3] + 2
# xe is 2 predictors and epsilon
y < xe %*% c(1,0.5,0.2) # Y
x < xe[,3] # X
xdata < preprocessx(x,500) # k as high as 500
zout < knnest(y,xdata,200)
predict(zout,matrix(c(1,1),nrow=1),TRUE) # about 1.55
predict(zout,rbind(c(1,1),c(2,1.2)),TRUE) # about 1.55, 2.58
predict(zout,rbind(c(0,0)),TRUE) # about 0.63
## Not run:
data(prgeng)
pe < prgeng
# dummies for MS, PhD
pe$ms < as.integer(pe$educ == 14)
pe$phd < as.integer(pe$educ == 16)
# computer occupations only
pecs < pe[pe$occ >= 100 & pe$occ <= 109,]
# for simplicity, let's choose a few predictors
pecs1 < pecs[,c(1,7,9,12,13,8)]
# will predict wage income from age, gender etc.
# prepare nearestneighbor data, k up to 50
xdata < preprocessx(pecs1[,1:5],50)
zout < knnest(pecs1[,6],xdata,50) # k = 50
# find the est. mean income for 42yearold women, 52 weeks worked, with
# a Master's
predict(zout,matrix(c(42,2,52,0,0),nrow=1),TRUE) # 62106
# try k = 25; don't need to call preprocessx() again
zout < knnest(pecs1[,6],xdata,25)
predict(zout,matrix(c(42,2,52,0,0),nrow=1),TRUE) # 69104
# quite a difference; what k values are good?
kmout < kmin(pecs1[,6],xdata) # at least 50
# what about a man?
zout < knnest(pecs1[,6],xdata,50)
predict(zout,matrix(c(42,1,52,0,0),nrow=1),TRUE) # 78588
# form training and test sets, fit on the former and predict on the
# latter
fullidxs < 1:nrow(pecs1)
train < sample(fullidxs,10000)
xdata < preprocessx(pecs1[train,1:5],50)
trainout < knnest(pecs1[train,6],xdata,50)
testout < predict(trainout,as.matrix(pecs1[train,6]),TRUE)
# find mean abs. prediction error (about $25K)
mean(abs(pecs1[train,6]  testout))
# examples of fit assessment
# look for nonlinear relations between Y and each X
nonparvsxplot(zout) # keep hitting Enter for next plot
# there seem to be quadratic relations with age and wkswrkd, so add quad
# terms and run lm()
pecs2 < pecs1
pecs2$age2 < pecs1$age^2
pecs2$wks2 < pecs1$wkswrkd^2
lmout2 < lm(wageinc ~ .,data=pecs2)
# check parametric fit by comparing to kNN
parvsnonparplot(lmout2,zout)
# linear model line somewhat faint, due to large n;
# parametric model seems to overpredict at high end;
# to deal with faintness, reduce size of points
parvsnonparplot(lmout2,zout,cex=0.1)
# assess homogeneity of conditional variance
nonparvarplot(zout)
# hockey stick!
## End(Not run)
# Y vectorvalued (3 classes)
# 3 clusters, equal wts, coded 0,1,2
n < 1500
# withingrp cov matrix
cv < rbind(c(1,0.2),c(0.2,1))
xy < NULL
for (i in 1:3)
xy < rbind(xy,rmvnorm(n,mean=rep(i*2.0,2),sigma=cv))
y < rep(0:2,each=n)
xy < cbind(xy,dummy(y))
xdata < preprocessx(xy[,(3:5)],20) # X is xy[,1:2], k <= 20
ko < knnest(xy[,3:5],xdata,20)
# find predicted Y for each data pt
mx < apply(as.matrix(ko$regest),1,which.max)  1
# overall correct classification rate
mean(mx == y) # should be about 0.87

Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.