Full set of tools for k-NN regression and classification, including both for direct usage and as tools for assessing the fit of parametric models.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
knnest(y,xdata,k,nearf=meany) preprocessx(x,kmax,xval=FALSE) meany(predpt,nearxy) vary(predpt,nearxy) loclin(predpt,nearxy) ## S3 method for class 'knn' predict(object,...) kmin(y,xdata,lossftn=l2,nk=5,nearf=meany) parvsnonparplot(lmout,knnout,cex=1.0) nonparvsxplot(knnout,lmout=NULL) nonparvarplot(knnout,returnPts=FALSE) l2(y,muhat) l1(y,muhat)
Response variable data in the training set. Vector or matrix, the latter case for vector-valued response, e.g. multiclass classification.
X data, predictors, one row per data point, in the training set.
Needed for consistency with generic. See Details below for 'arguments.
X and associated neighbor indices. Output of
Number of nearest neighbors
One point on which to predict, as a vector.
A set of X neighbors of a point.
Function to apply to the nearest neighbors of a point.
Maximal number of nearest neighbors to find.
Cross-validation flag. If TRUE, then the set of nearest neighbors of a point will not include the point itself.
Loss function to be used in cross-validation
determination of "best"
Number of values of
R parameter to control dot size in plot.
Vector of estimated regression function values.
If TRUE, return matrix of plotted points.
knnest function does k-nearest neighbor regression
function estimation, in any dimension, i.e. any number of predictor
variables, and any number of response variables. This of course
includes classification problems case; a scalar Y = 0,1 would
represent two classes, with the regression function reducing to
the conditional probability of class 1, given the predictors.
preprocessx function does the prep work. For each row in
x, the code finds the
kmax closest rows to that row.
By separating this computation from
knnest, one can save a lot
of overall computing time. If for instance one wants to try the
number of nearest neighbors
k at 25, 50 and 100, one can call
kmax equal to 100, then reuse the
results; in calling
knnest for several values of
do not need to call
preprocessx again. Setting
TRUE turns out cross-validation: the neighborhood of a point will not
include the point itself; note that this is set in
preprocessx, not in
One can specify various types of smoothing by proper specification of
nearf function. The default is
the standard averaging of the neighbor Y values. Another possible
vary, specifying calculation of the sample variance
of those Y values; this is useful in assessing heteroscedasticity in
a linear model.
Another choice is to specify local linear smoothing by setting
loclin. Here the value of the regression
function at a point is predicted from a linear fit to the point's
neighbors. This may be especially helpful
to counteract bias near the edges of the data. As in any regression
fit, the number of predictors should be considerably less than the
number of neighbors.
The X, i.e. predictor, data will be scaled by the code, so as to put all predictor variables on an equal footing. The scaling parameters will be recorded, and then applied later in prediction.
predict.knn uses the output of
do regression estimation or prediction on new points. Since the
knnest is of class
'knn', one invokes this
function with the simpler
predict. The second argument is the
set of new points, named
predpts within the code. It is
specified as a matrix if there is more than one prediction point and
more than one predictor variable; otherwise, use a vector.
A "1-NN" method is used here: Given a new point u whose
Y value we wish to predict, the code finds the single closest row
in the training set, and returns the previously-estimated regression
function value at that row. If u needs to be scaled, specify
TRUE in the third argument of
otherwise specify FALSE.
It can be shown that nearest-neighbor (or kernel) regression
estimates are subject to substantial bias near the fringes of the
data; the further away from the center of the data, the worse the
bias. This can be mitigated by user specification that a local
linear regression be applied, as follows: For each new point u to
predict, the r closest X rows in the training set to u will be found,
and a linear regression of the corresponding Y values against those X
values will be computed. The result of that operation will be used to
predict the Y value at the point u. The value of r is specified as
the third argument in the call to
predict; if left
unspecified, the 1-NN method is used as described above, and it may
be more accurate than the local-linear approach within the bulk of
the data set.
ovaknnpred are multiclass
y is coded
m-1 for the
The tools here can be useful for fit assessment of parametric models.
parvsnonparplot function plots fitted values of
parameteric model vs. kNN fitted,
nonparvsxplot k-NN fitted
values against each predictor, one by one.
l1 are used to define L2 and L1
The return value of
preprocessx is an R list. Its
component is the scaled
x matrix, with the scaling factors being
recorded in the
scaling component. The
contains the indices of the nearest neighbors of each point in the
predictor data, stored in a matrix with
nrow(x) rows and
columns. Row i contains the indices of the nearest rows in
row i of
x. The first of these indices is for the closest point,
then for the second-closest, and so on. If cross-validation is
xval = TRUE, then any point will not be considered a
neighbor of itself.
knnest function returns an expanded version of
with the expansion consisting of a new component
estimated regression function values at the training set points.
predict.knn returns the predicted Y values at
predpts. It is called simply via
One can explore the effects of various numbers of nearest neighbors
k through the
kmin function. (This function should be
cosidered experimental.) It will run
knnest for the values of
k specified in
nk. If the latter is a number, the range 0
xdata$kmax will be divided into
subintervals, and the values of
k used will then be the right
endpoints of the subintervals. The function returns an R list, with the
meanerrs containing the cross-validated mean loss
function values and
ks containing the corresponding values of
plot.knn then plots the former against the latter.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93
set.seed(9999) x <- matrix(sample(1:100,30),ncol=3) xd <- preprocessx(x[,1],2,TRUE) # just 1 predictor ko <- knnest(x[,2],xd,2) # Y is x[,2] ko$regest # 1st element = 74.5 predict(ko,matrix(76),TRUE) # 47.5 ko <- knnest(x[,-1],xd,2) # Y bivar ko$regest # 1st row = (74.5,31.5) predict(ko,matrix(76),TRUE) # 47.5, 65.0 set.seed(9999) xe <- matrix(rnorm(30000),ncol=3) xe[,-3] <- xe[,-3] + 2 # xe is 2 predictors and epsilon y <- xe %*% c(1,0.5,0.2) # Y x <- xe[,-3] # X xdata <- preprocessx(x,500) # k as high as 500 zout <- knnest(y,xdata,200) predict(zout,matrix(c(1,1),nrow=1),TRUE) # about 1.55 predict(zout,rbind(c(1,1),c(2,1.2)),TRUE) # about 1.55, 2.58 predict(zout,rbind(c(0,0)),TRUE) # about 0.63 ## Not run: data(prgeng) pe <- prgeng # dummies for MS, PhD pe$ms <- as.integer(pe$educ == 14) pe$phd <- as.integer(pe$educ == 16) # computer occupations only pecs <- pe[pe$occ >= 100 & pe$occ <= 109,] # for simplicity, let's choose a few predictors pecs1 <- pecs[,c(1,7,9,12,13,8)] # will predict wage income from age, gender etc. # prepare nearest-neighbor data, k up to 50 xdata <- preprocessx(pecs1[,1:5],50) zout <- knnest(pecs1[,6],xdata,50) # k = 50 # find the est. mean income for 42-year-old women, 52 weeks worked, with # a Master's predict(zout,matrix(c(42,2,52,0,0),nrow=1),TRUE) # 62106 # try k = 25; don't need to call preprocessx() again zout <- knnest(pecs1[,6],xdata,25) predict(zout,matrix(c(42,2,52,0,0),nrow=1),TRUE) # 69104 # quite a difference; what k values are good? kmout <- kmin(pecs1[,6],xdata) # at least 50 # what about a man? zout <- knnest(pecs1[,6],xdata,50) predict(zout,matrix(c(42,1,52,0,0),nrow=1),TRUE) # 78588 # form training and test sets, fit on the former and predict on the # latter fullidxs <- 1:nrow(pecs1) train <- sample(fullidxs,10000) xdata <- preprocessx(pecs1[train,1:5],50) trainout <- knnest(pecs1[train,6],xdata,50) testout <- predict(trainout,as.matrix(pecs1[-train,-6]),TRUE) # find mean abs. prediction error (about $25K) mean(abs(pecs1[-train,6] - testout)) # examples of fit assessment # look for nonlinear relations between Y and each X nonparvsxplot(zout) # keep hitting Enter for next plot # there seem to be quadratic relations with age and wkswrkd, so add quad # terms and run lm() pecs2 <- pecs1 pecs2$age2 <- pecs1$age^2 pecs2$wks2 <- pecs1$wkswrkd^2 lmout2 <- lm(wageinc ~ .,data=pecs2) # check parametric fit by comparing to kNN parvsnonparplot(lmout2,zout) # linear model line somewhat faint, due to large n; # parametric model seems to overpredict at high end; # to deal with faintness, reduce size of points parvsnonparplot(lmout2,zout,cex=0.1) # assess homogeneity of conditional variance nonparvarplot(zout) # hockey stick! ## End(Not run) # Y vector-valued (3 classes) # 3 clusters, equal wts, coded 0,1,2 n <- 1500 # within-grp cov matrix cv <- rbind(c(1,0.2),c(0.2,1)) xy <- NULL for (i in 1:3) xy <- rbind(xy,rmvnorm(n,mean=rep(i*2.0,2),sigma=cv)) y <- rep(0:2,each=n) xy <- cbind(xy,dummy(y)) xdata <- preprocessx(xy[,-(3:5)],20) # X is xy[,1:2], k <= 20 ko <- knnest(xy[,3:5],xdata,20) # find predicted Y for each data pt mx <- apply(as.matrix(ko$regest),1,which.max) - 1 # overall correct classification rate mean(mx == y) # should be about 0.87
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.