knnest,meany,vary,loclin,predict.knn,preprocessx,kmin,parvsnonparplot,nonparvsxplot,l1,l2,kNN,bestKperPoint  R Documentation 
Full set of tools for kNN regression and classification, including both for direct usage and as tools for assessing the fit of parametric models.
kNN(x,y,newx=x,kmax,scaleX=TRUE,PCAcomps=0,expandVars=NULL,expandVals=NULL, smoothingFtn=mean,allK=FALSE,leave1out=FALSE, classif=FALSE, startAt1=TRUE,saveNhbrs=FALSE,savedNhbrs=NULL) knnest(y,xdata,k,nearf=meany) preprocessx(x,kmax,xval=FALSE) meany(nearIdxs,x,y,predpt) mediany(nearIdxs,x,y,predpt) vary(nearIdxs,x,y,predpt) loclin(nearIdxs,x,y,predpt) ## S3 method for class 'knn' predict(object,...) kmin(y,xdata,lossftn=l2,nk=5,nearf=meany) parvsnonparplot(lmout,knnout,cex=1.0) nonparvsxplot(knnout,lmout=NULL) nonparvarplot(knnout,returnPts=FALSE) l2(y,muhat) l1(y,muhat) MAPE(yhat,y) bestKperPoint(x,y,maxK,lossFtn="MAPE",classif=FALSE) kNNallK(x,y,newx=x,kmax,scaleX=TRUE,PCAcomps=0, expandVars=NULL,expandVals=NULL,smoothingFtn=mean, allK=FALSE,leave1out=FALSE,classif=FALSE,startAt1=TRUE) kNNxv(x,y,k,scaleX=TRUE,PCAcomps=0,smoothingFtn=mean, nSubSam=500) knnest(y,xdata,k,nearf=meany) loclogit(nearIdxs,x,y,predpt) mediany(nearIdxs,x,y,predpt) exploreExpVars(xtrn, ytrn, xtst, ytst, k, eVar, maxEVal, lossFtn, eValIncr = 0.05, classif = FALSE, leave1out = FALSE) plotExpVars(xtrn,ytrn,xtst,ytst,k,eVars,maxEVal,lossFtn, ylim,eValIncr=0.05,classif=FALSE,leave1out=FALSE)
nearf 
Function to be applied to a neighborhood. 
ylim 
Range of Y values for plot. 
lossFtn 
Loss function for plot. 
eVar 
Variable to be expanded. 
eVars 
Variables to be expanded. 
maxEVal 
Maximum expansion value. 
eValIncr 
Increment in range of expansion value. 
xtrn 
Training set for X. 
ytrn 
Training set for Y. 
xtst 
Test set for X. 
ytst 
Test set for Y. 
nearIdxs 
Indices of the neighbors. 
nSubSam 
Number of folds. 
x 
"X" data, predictors, one row per data point, in the training set. 
y 
Response variable data in the training set. Vector or matrix, the latter case for vectorvalued response, e.g. multiclass classification. In that case, can be a vector, either (0,1,2,...,) or (1,2,3,...), which automatically is converted into a matrix of dummies. 
newx 
New data points to be predicted. If NULL in 
scaleX 
If TRUE, call 
PCAcomps 
If positive, transform 
expandVars 
Indices of columns in 
expandVals 
The corresponding expansion values. 
smoothingFtn 
Function to apply to the "Y" values in the
set of nearest neighbors. Builtin choices are 
allK 
If TRUE, find regression estimates for all 
leave1out 
If TRUE, omit the 1nearest neighbor from analysis 
classif 
If TRUE, compute the predicted class labels, not just the regression function values 
startAt1 
If TRUE, class labels start at 1, else 0. 
k 
Number of nearest neighbors 
saveNhbrs 
If TRUE, place output of 
savedNhbrs 
If nonNULL, this is the 
... 
Needed for consistency with generic. See Details below for 'arguments. 
xdata 
X and associated neighbor indices. Output of

object 
Output of 
predpt 
One point on which to predict, as a vector. 
kmax 
Maximal number of nearest neighbors to find. 
maxK 
Maximal number of nearest neighbors to find. 
xval 
Crossvalidation flag. If TRUE, then the set of nearest neighbors of a point will not include the point itself. 
lossftn 
Loss function to be used in crossvalidation
determination of "best" 
nk 
Number of values of 
lmout 
Output of 
knnout 
Output of 
cex 
R parameter to control dot size in plot. 
muhat 
Vector of estimated regression function values. 
yhat 
Vector of estimated regression function values. 
returnPts 
If TRUE, return matrix of plotted points. 
The kNN
function is the main tool here; knnest
is being
deprecated. (Note too qeKNN
, a wrapper for kNN
; more
on this below.) Here are the capabilities:
In its most basic form, the function will input training data and
output predictions for new cases newx
. By default this is
done for a single value of the number k
of nearest neighbors,
but by setting allK
to TRUE, the user can request that it be
done for all k
through the specified maximum.
In the second form, newx
is set to NULL in the call to
kNN
. No predictions are made; instead, the regression function
is estimated on all data points in x
, which are saved in the return
value. Future new cases can then be predicted from this saved object,
via predict.kNN
(called via the generic predict
).
The call form is predict(knnout,newx,newxK
, with a
default value of 1 for newxK
.
In this second form, the closest k
points to the newx
in
x
are determined as usual, but instead of averaging their Y
values, the average is taken over the fitted regression estimates at
those points. In this manner, there is almost no computational cost
in the prediction stage.
The second form is intended more for production use, so that neighbor distances need not be repeatedly recomputed.
Nearestneighbor computation can be timeconsuming. If more than one
value of k
is anticipated, for the same x
, y
and
newx
, first run with the largest anticipated value of
k
, with saveNhbrs
set to TRUE. Then for other values
of k
, set savedNhbrs
to the nhbrs
component in
the return value of the first call.
In addition, a novel feature allows the user to weight some
predictors more than others. This is done by scaling the given
predictor up or down, according to a specified value. Normally, this
should be done with scaleX = TRUE
, which applies
scale()
to the data. In other words, first we create a "level
playing field" in which all predictors have standard deviation 1.0,
then scale some of them up or down.
Alternatives are provided to calculating the mean Y in the given neighborhood, such as the median and the variance, the latter of possible use in dealing with heterogeneity in linear models.
Another choice of note is to allow locallinear smoothing, by
setting smoothingFtn
to loclin
. Here the value of the
regression function at a point is predicted from a linear fit to the
point's neighbors. This may be especially helpful to counteract bias
near the edges of the data. As in any regression fit, the number of
predictors should be considerably less than the number of neighbors.
Custom functions for smoothing can easily be written, say following
the pattern of loclin
.
The main alternative to kNN
is qeKNN
in the qe* ("quick
and easy") series. It is more convenient, e.g. allowing factor
inputs, but less flexible.
The functions ovaknntrn
and ovaknnpred
are multiclass
wrappers for knnest
and knnpred
, thus also deprecated.
Here y
is coded 0,1,...,m
1 for the m
classes.
The tools here can be useful for fit assessment of parametric models.
The parvsnonparplot
function plots fitted values of
parameteric model vs. kNN fitted, nonparvsxplot
kNN fitted
values against each predictor, one by one.
The functions l2
and l1
are used to define L2 and L1
loss.
Norm Matloff
x < rbind(c(1,0),c(2,5),c(0,5),c(3,3),c(6,3)) y < c(8,3,10,11,4) newx < c(0,0) kNN(x,y,newx,2,scaleX=FALSE) # $whichClosest # [,1] [,2] # [1,] 1 4 # $regests # [1] 9.5 kNN(x,y,newx,3,scaleX=FALSE,smoothingFtn=loclin)$regests # 7.307692 knnout < kNN(x,y,newx,2,scaleX=FALSE) knnout # $whichClosest # [,1] [,2] # [1,] 1 4 # ... ## Not run: data(mlb) mlb < mlb[,c(4,6,5)] # height, age, weight # fit, then predict 75", age 21, and 72", age 32 knnout < kNN(mlb[,1:2],mlb[,3],rbind(c(75,21),c(72,32)),25) knnout$regests # [1] 202.72 195.72 # fit now, predict later knnout < kNN(mlb[,1:2],mlb[,3],NULL,25) predict(knnout,c(70,28)) # [1] 186.48 data(peDumms) names(peDumms) ped < peDumms[,c(1,20,22:27,29,31,32)] names(ped) # fit, and predict income of a 35yearold man, MS degree, occupation 101, # worked 50 weeks, using 25 nearest neighbors kNN(ped[,10],ped[,10],c(35,1,0,0,1,0,0,0,1,50),25) $regests # [1] 67540 # fit, and predict occupation 101 for a 35yearold man, MS degree, # wage $55K, worked 50 weeks, using 25 nearest neighbors z < kNN(ped[,c(4:8)],ped[,4],c(35,1,0,1,55,50),25,classif=TRUE) z$regests # [1] 0.16 16 z$ypreds # [1] 0 class 0, i.e. not occupation 101; round(0.24) = 0, # computed by user request, classif = TRUE # the y argument must be either a vector (2class setting) or a matrix # (multiclass setting) occs < as.matrix(ped[, 4:8]) z < kNN(ped[,c(4:8)],occs,c(35,1,0,1,72000,50),25,classif=TRUE) z$ypreds # [1] 3 occupation 3, i.e. 102, is predicted # predict occupation in general; let's bring occ.141 back in (was # excluded as a predictor due to redundancy) names(peDumms) # [1] "age" "cit.1" "cit.2" "cit.3" "cit.4" "cit.5" "educ.1" # [8] "educ.2" "educ.3" "educ.4" "educ.5" "educ.6" "educ.7" "educ.8" # [15] "educ.9" "educ.10" "educ.11" "educ.12" "educ.13" "educ.14" "educ.15" # [22] "educ.16" "occ.100" "occ.101" "occ.102" "occ.106" "occ.140" "occ.141" # [29] "sex.1" "sex.2" "wageinc" "wkswrkd" "yrentry" occs < as.matrix(peDumms[,23:28]) z < kNN(ped[,c(4:8)],occs,c(35,1,0,1,72000,50),25,classif=TRUE) z$ypreds # [1] 3 prediction is occ.102 # try weight age 0.5, wkswrked 1.5; use leave1out to avoid overfit knnout < kNN(ped[,10],ped[,10],ped[,10],25,leave1out=TRUE) mean(abs(knnout$regests  ped[,10])) # [1] 25341.6 # use of the weighted distance feature; deweight age by a factor of 0.5, # put increased weight on weeks worked, factor of 1.5 knnout < kNN(ped[,10],ped[,10],ped[,10],25, expandVars=c(1,10),expandVals=c(0.5,1.5),leave1out=TRUE) mean(abs(knnout$regests  ped[,10])) # [1] 25196.61 ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.