knnest: k-NN Nonparametric Regression and Classification

View source: R/KNN.R

knnest,meany,vary,loclin,predict.knn,preprocessx,kmin,parvsnonparplot,nonparvsxplot,l1,l2,kNN,bestKperPointR Documentation

k-NN Nonparametric Regression and Classification

Description

Full set of tools for k-NN regression and classification, including both for direct usage and as tools for assessing the fit of parametric models.

Usage

kNN(x,y,newx=x,kmax,scaleX=TRUE,PCAcomps=0,expandVars=NULL,expandVals=NULL,
   smoothingFtn=mean,allK=FALSE,leave1out=FALSE, classif=FALSE,
   startAt1=TRUE,saveNhbrs=FALSE,savedNhbrs=NULL)
knnest(y,xdata,k,nearf=meany)
preprocessx(x,kmax,xval=FALSE)
meany(nearIdxs,x,y,predpt) 
mediany(nearIdxs,x,y,predpt) 
vary(nearIdxs,x,y,predpt) 
loclin(nearIdxs,x,y,predpt) 
## S3 method for class 'knn'
predict(object,...)
kmin(y,xdata,lossftn=l2,nk=5,nearf=meany) 

parvsnonparplot(lmout,knnout,cex=1.0) 
nonparvsxplot(knnout,lmout=NULL) 
nonparvarplot(knnout,returnPts=FALSE)
l2(y,muhat)
l1(y,muhat)
MAPE(yhat,y)
bestKperPoint(x,y,maxK,lossFtn="MAPE",classif=FALSE)
kNNallK(x,y,newx=x,kmax,scaleX=TRUE,PCAcomps=0,
   expandVars=NULL,expandVals=NULL,smoothingFtn=mean,
   allK=FALSE,leave1out=FALSE,classif=FALSE,startAt1=TRUE)
kNNxv(x,y,k,scaleX=TRUE,PCAcomps=0,smoothingFtn=mean,
   nSubSam=500)
knnest(y,xdata,k,nearf=meany)
loclogit(nearIdxs,x,y,predpt)
mediany(nearIdxs,x,y,predpt) 
exploreExpVars(xtrn, ytrn, xtst, ytst, k, eVar, maxEVal, lossFtn, 
    eValIncr = 0.05, classif = FALSE, leave1out = FALSE) 
plotExpVars(xtrn,ytrn,xtst,ytst,k,eVars,maxEVal,lossFtn,
   ylim,eValIncr=0.05,classif=FALSE,leave1out=FALSE)

Arguments

nearf

Function to be applied to a neighborhood.

ylim

Range of Y values for plot.

lossFtn

Loss function for plot.

eVar

Variable to be expanded.

eVars

Variables to be expanded.

maxEVal

Maximum expansion value.

eValIncr

Increment in range of expansion value.

xtrn

Training set for X.

ytrn

Training set for Y.

xtst

Test set for X.

ytst

Test set for Y.

nearIdxs

Indices of the neighbors.

nSubSam

Number of folds.

x

"X" data, predictors, one row per data point, in the training set.

y

Response variable data in the training set. Vector or matrix, the latter case for vector-valued response, e.g. multiclass classification. In that case, can be a vector, either (0,1,2,...,) or (1,2,3,...), which automatically is converted into a matrix of dummies.

newx

New data points to be predicted. If NULL in kNN, compute regression functions estimates on x and save for future prediction with predict.kNN

scaleX

If TRUE, call scale on x and newx

PCAcomps

If positive, transform x and newx by PCA, using the top PCAcomps principal components. Disabled.

expandVars

Indices of columns in x to expand.

expandVals

The corresponding expansion values.

smoothingFtn

Function to apply to the "Y" values in the set of nearest neighbors. Built-in choices are meany, mediany, vary and loclin.

allK

If TRUE, find regression estimates for all k through kmax. Currently disabled.

leave1out

If TRUE, omit the 1-nearest neighbor from analysis

classif

If TRUE, compute the predicted class labels, not just the regression function values

startAt1

If TRUE, class labels start at 1, else 0.

k

Number of nearest neighbors

saveNhbrs

If TRUE, place output of FNN::get.knnx into nhbrs of component in return value

savedNhbrs

If non-NULL, this is the nhbrs component in the return value of a previous call; newx must be the same in both calls

...

Needed for consistency with generic. See Details below for 'arguments.

xdata

X and associated neighbor indices. Output of preprocessx.

object

Output of knnest.

predpt

One point on which to predict, as a vector.

kmax

Maximal number of nearest neighbors to find.

maxK

Maximal number of nearest neighbors to find.

xval

Cross-validation flag. If TRUE, then the set of nearest neighbors of a point will not include the point itself.

lossftn

Loss function to be used in cross-validation determination of "best" k.

nk

Number of values of k to try in cross-validation.

lmout

Output of lm.

knnout

Output of knnest.

cex

R parameter to control dot size in plot.

muhat

Vector of estimated regression function values.

yhat

Vector of estimated regression function values.

returnPts

If TRUE, return matrix of plotted points.

Details

The kNN function is the main tool here; knnest is being deprecated. (Note too qeKNN, a wrapper for kNN; more on this below.) Here are the capabilities:

In its most basic form, the function will input training data and output predictions for new cases newx. By default this is done for a single value of the number k of nearest neighbors, but by setting allK to TRUE, the user can request that it be done for all k through the specified maximum.

In the second form, newx is set to NULL in the call to kNN. No predictions are made; instead, the regression function is estimated on all data points in x, which are saved in the return value. Future new cases can then be predicted from this saved object, via predict.kNN (called via the generic predict). The call form is predict(knnout,newx,newxK, with a default value of 1 for newxK.

In this second form, the closest k points to the newx in x are determined as usual, but instead of averaging their Y values, the average is taken over the fitted regression estimates at those points. In this manner, there is almost no computational cost in the prediction stage.

The second form is intended more for production use, so that neighbor distances need not be repeatedly recomputed.

Nearest-neighbor computation can be time-consuming. If more than one value of k is anticipated, for the same x, y and newx, first run with the largest anticipated value of k, with saveNhbrs set to TRUE. Then for other values of k, set savedNhbrs to the nhbrs component in the return value of the first call.

In addition, a novel feature allows the user to weight some predictors more than others. This is done by scaling the given predictor up or down, according to a specified value. Normally, this should be done with scaleX = TRUE, which applies scale() to the data. In other words, first we create a "level playing field" in which all predictors have standard deviation 1.0, then scale some of them up or down.

Alternatives are provided to calculating the mean Y in the given neighborhood, such as the median and the variance, the latter of possible use in dealing with heterogeneity in linear models.

Another choice of note is to allow local-linear smoothing, by setting smoothingFtn to loclin. Here the value of the regression function at a point is predicted from a linear fit to the point's neighbors. This may be especially helpful to counteract bias near the edges of the data. As in any regression fit, the number of predictors should be considerably less than the number of neighbors.

Custom functions for smoothing can easily be written, say following the pattern of loclin.

The main alternative to kNN is qeKNN in the qe* ("quick and easy") series. It is more convenient, e.g. allowing factor inputs, but less flexible.

The functions ovaknntrn and ovaknnpred are multiclass wrappers for knnest and knnpred, thus also deprecated. Here y is coded 0,1,...,m-1 for the m classes.

The tools here can be useful for fit assessment of parametric models. The parvsnonparplot function plots fitted values of parameteric model vs. kNN fitted, nonparvsxplot k-NN fitted values against each predictor, one by one.

The functions l2 and l1 are used to define L2 and L1 loss.

Author(s)

Norm Matloff

Examples


x <- rbind(c(1,0),c(2,5),c(0,5),c(3,3),c(6,3))
y <- c(8,3,10,11,4)
newx <- c(0,0)

kNN(x,y,newx,2,scaleX=FALSE)
# $whichClosest
#      [,1] [,2]
# [1,]    1    4
# $regests
# [1] 9.5

kNN(x,y,newx,3,scaleX=FALSE,smoothingFtn=loclin)$regests
# 7.307692

knnout <- kNN(x,y,newx,2,scaleX=FALSE)
knnout
# $whichClosest
#      [,1] [,2]
# [1,]    1    4
# ...

## Not run: 
data(mlb) 
mlb <- mlb[,c(4,6,5)]  # height, age, weight
# fit, then predict 75", age 21, and 72", age 32
knnout <- kNN(mlb[,1:2],mlb[,3],rbind(c(75,21),c(72,32)),25) 
knnout$regests
# [1] 202.72 195.72

# fit now, predict later
knnout <- kNN(mlb[,1:2],mlb[,3],NULL,25) 
predict(knnout,c(70,28)) 
# [1] 186.48

data(peDumms) 
names(peDumms) 
ped <- peDumms[,c(1,20,22:27,29,31,32)] 
names(ped) 

# fit, and predict income of a 35-year-old man, MS degree, occupation 101,
# worked 50 weeks, using 25 nearest neighbors
kNN(ped[,-10],ped[,10],c(35,1,0,0,1,0,0,0,1,50),25) $regests
# [1] 67540

# fit, and predict occupation 101 for a 35-year-old man, MS degree, 
# wage $55K, worked 50 weeks, using 25 nearest neighbors
z <- kNN(ped[,-c(4:8)],ped[,4],c(35,1,0,1,55,50),25,classif=TRUE)
z$regests
# [1] 0.16  16
z$ypreds
# [1] 0  class 0, i.e. not occupation 101; round(0.24) = 0, 
# computed by user request, classif = TRUE

# the y argument must be either a vector (2-class setting) or a matrix
# (multiclass setting)
occs <- as.matrix(ped[, 4:8])
z <- kNN(ped[,-c(4:8)],occs,c(35,1,0,1,72000,50),25,classif=TRUE)
z$ypreds
# [1] 3   occupation 3, i.e. 102, is predicted

# predict occupation in general; let's bring occ.141 back in (was
# excluded as a predictor due to redundancy)
names(peDumms)
#  [1] "age"     "cit.1"   "cit.2"   "cit.3"   "cit.4"   "cit.5"   "educ.1" 
#  [8] "educ.2"  "educ.3"  "educ.4"  "educ.5"  "educ.6"  "educ.7"  "educ.8" 
# [15] "educ.9"  "educ.10" "educ.11" "educ.12" "educ.13" "educ.14" "educ.15"
# [22] "educ.16" "occ.100" "occ.101" "occ.102" "occ.106" "occ.140" "occ.141"
# [29] "sex.1"   "sex.2"   "wageinc" "wkswrkd" "yrentry"
occs <- as.matrix(peDumms[,23:28])  
z <- kNN(ped[,-c(4:8)],occs,c(35,1,0,1,72000,50),25,classif=TRUE)
z$ypreds
# [1] 3   prediction is occ.102

# try weight age 0.5, wkswrked 1.5; use leave1out to avoid overfit
knnout <- kNN(ped[,-10],ped[,10],ped[,-10],25,leave1out=TRUE)
mean(abs(knnout$regests - ped[,10]))
# [1] 25341.6

# use of the weighted distance feature; deweight age by a factor of 0.5,
# put increased weight on weeks worked, factor of 1.5
knnout <- kNN(ped[,-10],ped[,10],ped[,-10],25,
   expandVars=c(1,10),expandVals=c(0.5,1.5),leave1out=TRUE)
mean(abs(knnout$regests - ped[,10]))
# [1] 25196.61




## End(Not run)


matloff/regtools documentation built on July 17, 2022, 10:10 a.m.