# knnest: Nonparametric Regression and Classification In matloff/regtools: Regression and Classification Tools

## Description

Full set of tools for k-NN regression and classification, including both for direct usage and as tools for assessing the fit of parametric models.

## Usage

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14``` ```knnest(y,xdata,k,nearf=meany) preprocessx(x,kmax,xval=FALSE) meany(predpt,nearxy) vary(predpt,nearxy) loclin(predpt,nearxy) ## S3 method for class 'knn' predict(object,...) kmin(y,xdata,lossftn=l2,nk=5,nearf=meany) parvsnonparplot(lmout,knnout,cex=1.0) nonparvsxplot(knnout,lmout=NULL) nonparvarplot(knnout,returnPts=FALSE) l2(y,muhat) l1(y,muhat) ```

## Arguments

 `y` Response variable data in the training set. Vector or matrix, the latter case for vector-valued response, e.g. multiclass classification. `x` X data, predictors, one row per data point, in the training set. `...` Needed for consistency with generic. See Details below for 'arguments. `xdata` X and associated neighbor indices. Output of `preprocessx`. `k` Number of nearest neighbors `object` Output of `knnest`. `predpt` One point on which to predict, as a vector. `nearxy` A set of X neighbors of a point. `nearf` Function to apply to the nearest neighbors of a point. `kmax` Maximal number of nearest neighbors to find. `xval` Cross-validation flag. If TRUE, then the set of nearest neighbors of a point will not include the point itself. `lossftn` Loss function to be used in cross-validation determination of "best" `k`. `nk` Number of values of `k` to try in cross-validation. `lmout` Output of `lm`. `knnout` Output of `knnest`. `cex` R parameter to control dot size in plot. `muhat` Vector of estimated regression function values. `returnPts` If TRUE, return matrix of plotted points.

## Details

The `knnest` function does k-nearest neighbor regression function estimation, in any dimension, i.e. any number of predictor variables, and any number of response variables. This of course includes classification problems case; a scalar Y = 0,1 would represent two classes, with the regression function reducing to the conditional probability of class 1, given the predictors.

The `preprocessx` function does the prep work. For each row in `x`, the code finds the `kmax` closest rows to that row. By separating this computation from `knnest`, one can save a lot of overall computing time. If for instance one wants to try the number of nearest neighbors `k` at 25, 50 and 100, one can call `preprocessx` with `kmax` equal to 100, then reuse the results; in calling `knnest` for several values of `k`, we do not need to call `preprocessx` again. Setting `xval` to TRUE turns out cross-validation: the neighborhood of a point will not include the point itself; note that this is set in `preprocessx`, not in `knnest`.

One can specify various types of smoothing by proper specification of the `nearf` function. The default is `meany`, specifying the standard averaging of the neighbor Y values. Another possible choice is `vary`, specifying calculation of the sample variance of those Y values; this is useful in assessing heteroscedasticity in a linear model.

Another choice is to specify local linear smoothing by setting `nearf` to `loclin`. Here the value of the regression function at a point is predicted from a linear fit to the point's neighbors. This may be especially helpful to counteract bias near the edges of the data. As in any regression fit, the number of predictors should be considerably less than the number of neighbors.

The X, i.e. predictor, data will be scaled by the code, so as to put all predictor variables on an equal footing. The scaling parameters will be recorded, and then applied later in prediction.

The function `predict.knn` uses the output of `knnest` to do regression estimation or prediction on new points. Since the output of `knnest` is of class `'knn'`, one invokes this function with the simpler `predict`. The second argument is the set of new points, named `predpts` within the code. It is specified as a matrix if there is more than one prediction point and more than one predictor variable; otherwise, use a vector.

A "1-NN" method is used here: Given a new point u whose Y value we wish to predict, the code finds the single closest row in the training set, and returns the previously-estimated regression function value at that row. If u needs to be scaled, specify `TRUE` in the third argument of `predict`; otherwise specify FALSE.

It can be shown that nearest-neighbor (or kernel) regression estimates are subject to substantial bias near the fringes of the data; the further away from the center of the data, the worse the bias. This can be mitigated by user specification that a local linear regression be applied, as follows: For each new point u to predict, the r closest X rows in the training set to u will be found, and a linear regression of the corresponding Y values against those X values will be computed. The result of that operation will be used to predict the Y value at the point u. The value of r is specified as the third argument in the call to `predict`; if left unspecified, the 1-NN method is used as described above, and it may be more accurate than the local-linear approach within the bulk of the data set.

The functions `ovaknntrn` and `ovaknnpred` are multiclass wrappers for `knnest` and `knnpred`. Here `y` is coded 0,1,...,`m`-1 for the `m` classes.

The tools here can be useful for fit assessment of parametric models. The `parvsnonparplot` function plots fitted values of parameteric model vs. kNN fitted, `nonparvsxplot` k-NN fitted values against each predictor, one by one.

The functions `l2` and `l1` are used to define L2 and L1 loss.

## Value

The return value of `preprocessx` is an R list. Its `x` component is the scaled `x` matrix, with the scaling factors being recorded in the `scaling` component. The `idxs` component contains the indices of the nearest neighbors of each point in the predictor data, stored in a matrix with `nrow(x)` rows and `k` columns. Row i contains the indices of the nearest rows in `x` to row i of `x`. The first of these indices is for the closest point, then for the second-closest, and so on. If cross-validation is requrested (`xval = TRUE`, then any point will not be considered a neighbor of itself.

The `knnest` function returns an expanded version of `xdata`, with the expansion consisting of a new component `regest`, the estimated regression function values at the training set points.

The function `predict.knn` returns the predicted Y values at `predpts`. It is called simply via `predict`.

One can explore the effects of various numbers of nearest neighbors `k` through the `kmin` function. (This function should be cosidered experimental.) It will run `knnest` for the values of `k` specified in `nk`. If the latter is a number, the range 0 to `xdata\$kmax` will be divided into `nk` equally subintervals, and the values of `k` used will then be the right endpoints of the subintervals. The function returns an R list, with the component `meanerrs` containing the cross-validated mean loss function values and `ks` containing the corresponding values of `k`; `plot.knn` then plots the former against the latter.

Norm Matloff

## Examples

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93``` ```set.seed(9999) x <- matrix(sample(1:100,30),ncol=3) xd <- preprocessx(x[,1],2,TRUE) # just 1 predictor ko <- knnest(x[,2],xd,2) # Y is x[,2] ko\$regest # 1st element = 74.5 predict(ko,matrix(76),TRUE) # 47.5 ko <- knnest(x[,-1],xd,2) # Y bivar ko\$regest # 1st row = (74.5,31.5) predict(ko,matrix(76),TRUE) # 47.5, 65.0 set.seed(9999) xe <- matrix(rnorm(30000),ncol=3) xe[,-3] <- xe[,-3] + 2 # xe is 2 predictors and epsilon y <- xe %*% c(1,0.5,0.2) # Y x <- xe[,-3] # X xdata <- preprocessx(x,500) # k as high as 500 zout <- knnest(y,xdata,200) predict(zout,matrix(c(1,1),nrow=1),TRUE) # about 1.55 predict(zout,rbind(c(1,1),c(2,1.2)),TRUE) # about 1.55, 2.58 predict(zout,rbind(c(0,0)),TRUE) # about 0.63 ## Not run: data(prgeng) pe <- prgeng # dummies for MS, PhD pe\$ms <- as.integer(pe\$educ == 14) pe\$phd <- as.integer(pe\$educ == 16) # computer occupations only pecs <- pe[pe\$occ >= 100 & pe\$occ <= 109,] # for simplicity, let's choose a few predictors pecs1 <- pecs[,c(1,7,9,12,13,8)] # will predict wage income from age, gender etc. # prepare nearest-neighbor data, k up to 50 xdata <- preprocessx(pecs1[,1:5],50) zout <- knnest(pecs1[,6],xdata,50) # k = 50 # find the est. mean income for 42-year-old women, 52 weeks worked, with # a Master's predict(zout,matrix(c(42,2,52,0,0),nrow=1),TRUE) # 62106 # try k = 25; don't need to call preprocessx() again zout <- knnest(pecs1[,6],xdata,25) predict(zout,matrix(c(42,2,52,0,0),nrow=1),TRUE) # 69104 # quite a difference; what k values are good? kmout <- kmin(pecs1[,6],xdata) # at least 50 # what about a man? zout <- knnest(pecs1[,6],xdata,50) predict(zout,matrix(c(42,1,52,0,0),nrow=1),TRUE) # 78588 # form training and test sets, fit on the former and predict on the # latter fullidxs <- 1:nrow(pecs1) train <- sample(fullidxs,10000) xdata <- preprocessx(pecs1[train,1:5],50) trainout <- knnest(pecs1[train,6],xdata,50) testout <- predict(trainout,as.matrix(pecs1[-train,-6]),TRUE) # find mean abs. prediction error (about \$25K) mean(abs(pecs1[-train,6] - testout)) # examples of fit assessment # look for nonlinear relations between Y and each X nonparvsxplot(zout) # keep hitting Enter for next plot # there seem to be quadratic relations with age and wkswrkd, so add quad # terms and run lm() pecs2 <- pecs1 pecs2\$age2 <- pecs1\$age^2 pecs2\$wks2 <- pecs1\$wkswrkd^2 lmout2 <- lm(wageinc ~ .,data=pecs2) # check parametric fit by comparing to kNN parvsnonparplot(lmout2,zout) # linear model line somewhat faint, due to large n; # parametric model seems to overpredict at high end; # to deal with faintness, reduce size of points parvsnonparplot(lmout2,zout,cex=0.1) # assess homogeneity of conditional variance nonparvarplot(zout) # hockey stick! ## End(Not run) # Y vector-valued (3 classes) # 3 clusters, equal wts, coded 0,1,2 n <- 1500 # within-grp cov matrix cv <- rbind(c(1,0.2),c(0.2,1)) xy <- NULL for (i in 1:3) xy <- rbind(xy,rmvnorm(n,mean=rep(i*2.0,2),sigma=cv)) y <- rep(0:2,each=n) xy <- cbind(xy,dummy(y)) xdata <- preprocessx(xy[,-(3:5)],20) # X is xy[,1:2], k <= 20 ko <- knnest(xy[,3:5],xdata,20) # find predicted Y for each data pt mx <- apply(as.matrix(ko\$regest),1,which.max) - 1 # overall correct classification rate mean(mx == y) # should be about 0.87 ```

matloff/regtools documentation built on Aug. 26, 2019, 5:27 p.m.