# yai: Find K nearest neighbors In yaImpute: Nearest Neighbor Observation Imputation and Evaluation Tools

## Description

Given a set of observations, `yai`

1. separates the observations into reference and target observations,

2. applies the specified method to project X-variables into a Euclidean space (not always, see argument `method`), and

3. finds the k-nearest neighbors within the referenece observations and between the reference and target observations.

An alternative method using `randomForest` classification and regression trees is provided for steps 2 and 3. Target observations are those with values for X-variables and not for Y-variables, while reference observations are those with no missing values for X-and Y-variables (see Details for the exception).

## Usage

 ```1 2 3 4``` ```yai(x=NULL,y=NULL,data=NULL,k=1,noTrgs=FALSE,noRefs=FALSE, nVec=NULL,pVal=.05,method="msn",ann=TRUE,mtry=NULL,ntree=500, rfMode="buildClasses",bootstrap=FALSE,ppControl=NULL,sampleVars=NULL, rfXsubsets=NULL) ```

## Arguments

 `x` 1) a matrix or data frame containing the X-variables for all observations with row names are the identification for the observations, or 2) a one-sided formula defining the X-variables as a linear formula. If a formula is coded for `x`, one must be used for `y` as well, if needed. `y` 1) a matrix or data frame containing the Y-variables for the reference observations, or 2) a one-sided formula defining the Y-variables as a linear formula. `data` when `x` and `y` are formulas, then data is a data frame or matrix that contains all the variables with row names are the identification for the observations. The observations are split by `yai` into two sets. `k` the number of nearest neighbors; default is 1. `noTrgs` when TRUE, skip finding neighbors for target observations. `noRefs` when TRUE, skip finding neighbors for reference observations. `nVec` number of canonical vectors to use (methods `msn` and `msn2`), or number of independent of X-variables reference data when method `mahalanobis`. When NULL, the number is set by the function. `pVal` significant level for canonical vectors, used when `method` is `msn` or `msn2`. `method` is the strategy used for computing distance and therefore for finding neighbors; the options are quoted key words (see details): `euclidean `distance is computed in a normalized X space. `raw `like euclidean, except no normalization is done. `mahalanobis ` distance is computed in its namesakes space. `ica `like mahalanobis, but based on Independent Component Analysis using package `fastICA`. `msn `distance is computed in a projected canonical space. `msn2 `like msn, but with variance weighting (canonical regression rather than correlation). `msnPP `like msn, except that the canonical correlation is computed using projection pursuit from ccaPP (see argument `ppControl`). `gnn ` distance is computed using a projected ordination of Xs found using canonical correspondence analysis (`cca` from package vegan). If `cca` fails, `rda` is used and a warning is issued. `randomForest ` distance is one minus the proportion of randomForest trees where a target observation is in the same terminal node as a reference observation (see `randomForest`). `random ` like raw except that the X space is a single vector of uniform random [0,1] numbers generated using `runif`, results in random assignment of neighbors, and forces `ann` to be FALSE. `gower ` distance is computed in its namesakes space using function `gower_topn` from package gower; forces `ann` to be FALSE. `ann` TRUE if `ann` is used to find neighbors, FALSE if a slow search is used. `mtry` the number of X-variables picked at random when method is `randomForest`, see `randomForest`, default is sqrt(number of X-variables). `ntree` the number of classification and regression trees when method is `randomForest`. When more than one Y-variable is used, the trees are divided among the variables. Alternatively, ntree can be a vector of values corresponding to each Y-variable. `rfMode` when `buildClasses` and method is `randomForest`, continuous variables are internally converted to classes forcing randomForest to build classification trees for the variable. Otherwise, regression trees are built if your version of randomForest is newer than `4.5-18`. `bootstrap` if `TRUE`, the reference observations are sampled with replacement. `ppControl` used to control how canoncial correlation analysis via projection pursuit is done, see Details. `sampleVars` the X- and/or Y-variables will be sampled (without replacement) if this is not NULL and greater than zero. If specified as a single unnamed value, that value is used to control the sample size of both X and Y variables. If two unnamed values, then the first is taken for X-variables and the second for Y-variables. If zero, no sampling is done. Otherwise, values are less than 1.0 they are taken as the proportion of the number of variables. Values greater or equal to 1 are number of variables to be included in the sample. Specification of a large number will cause the sequence of variables to be randomized. `rfXsubsets` a named list of character vectors where there is one vector for each Y-variable, see details, only applies when `method="randomForest"`

## Details

See the paper at http://www.jstatsoft.org/v23/i10 (it includes examples).

The following information is in addition to the content in the papers.

You need not have any Y-variables to run yai for the following methods: `euclidean`, `raw`, `mahalanobis`, `ica`, `random`, and `randomForest` (in which case unsupervised classification is performed). However, normally `yai` classifies reference observations as those with no missing values for X- and Y- variables and target observations are those with values for X- variables and missing data for Y-variables. When Y is NULL (there are no Y-variables), all the observations are considered references. See `newtargets` for an example of how to use yai in this situation.

When `bootstrap=TRUE` the reference observations are sampled with replacement. The sample size is set to the number of reference observations. Normally, about a third of the reference observations are left out of the sample; they are often called out-of-bag samples. The out-of-bag observations are then treated as targets.

When `method="msnPP"` projection pursuit from ccaPP is used. The method is further controlled using argument `ppControl` to specify a character vector that has has two named components.

• `method `One of the following `"spearman", "kendall", "quadrant", "M", "pearson"`, default is "spearman"

• `search `If `"data"` or `"proj"`, then `ccaProj` is used, otherwise the default `ccaGrid` is used.

Here are some details on argument `rfXsubsets`. When `method="randomForest"` one call to `randomForest` is generated for for each Y-variable. When argument `rfXsubsets` is left `NULL`, all the X-variables are used for each of the Y-variables. However, sometimes better results can be achieved by using specific subsets of X-variables for each Y-variable. This is done by setting `rfXsubsets` equal to a named list of character vectors. The names correspond to the Y-variable names and the character vectors hold the list of X-variables for the corresponding Y-variable.

## Value

An object of class `yai`, which is a list with the following tags:

 `call` the call. `yRefs, xRefs` matrices of the X- and Y-variables for just the reference observations (unscaled). The scale factors are attached as attributes. `obsDropped` a list of the row names for observations dropped for various reasons (missing data). `trgRows` a list of the row names for target observations as a subset of all observations. `xall` the X-variables for all observations. `cancor` returned from cancor function when method `msn` or `msn2` is used (NULL otherwise). `ccaVegan` an object of class cca (from package vegan) when method gnn is used. `ftest` a list containing partial F statistics and a vector of Pr>F (pgf) corresponding to the canonical correlation coefficients when method msn or msn2 is used (NULL otherwise). `yScale, xScale` scale data used on yRefs and xRefs as needed. `k` the value of k. `pVal` as input; only used when method `msn`, `msn2` or `msnPP` is used. `projector` NULL when not used. For methods `msn`, `msn2`, `msnPP`, `gnn` and `mahalanobis`, this is a matrix that projects normalized X-variables into a space suitable for doing Eculidian distances. `nVec` number of canonical vectors used (methods `msn` and `msn2`), or number of independent X-variables in the reference data when method `mahalanobis` is used. `method` as input, the method used. `ranForest` a list of the forests if method `randomForest` is used. There is one forest for each Y-variable, or just one forest when there are no Y-variables. `ICA` a list of information from `fastICA` when method `ica` is used. `ann` the value of ann, TRUE when `ann` is used, FALSE otherwise. `xlevels` NULL if no factors are used as predictors; otherwise a list of predictors that have factors and their levels (see `lm`). `neiDstTrgs` a matrix of distances between a target (identified by its row name) and the k references. There are k columns. `neiIdsTrgs` a matrix of reference identifications that correspond to neiDstTrgs. `neiDstRefs, neiIdsRefs` counterparts for references. `bootstrap` a vector of reference rownames that constitute the bootstrap sample; or the value `FALSE` when bootstrap is not used.

## Author(s)

Nicholas L. Crookston [email protected]
John Coulston [email protected]
Andrew O. Finley [email protected]

`grmsd` `ensembleImpute`
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84``` ```require (yaImpute) data(iris) # set the random number seed so that example results are consistent # normally, leave out this command set.seed(12345) # form some test data, y's are defined only for reference # observations. refs=sample(rownames(iris),50) x <- iris[,1:2] # Sepal.Length Sepal.Width y <- iris[refs,3:4] # Petal.Length Petal.Width # build yai objects using 2 methods msn <- yai(x=x,y=y) mal <- yai(x=x,y=y,method="mahalanobis") # compare these results using the generalized mean distances. mal wins! grmsd(mal,msn) # use projection pursuit and specify ppControl (loads package ccaPP) if (require(ccaPP)) { msnPP <- yai(x=x,y=y,method="msnPP",ppControl=c(method="kendall",search="proj")) grmsd(mal,msnPP,msn) } ############# data(MoscowMtStJoe) # convert polar slope and aspect measurements to cartesian # (which is the same as Stage's (1976) transformation). polar <- MoscowMtStJoe[,40:41] polar[,1] <- polar[,1]*.01 # slope proportion polar[,2] <- polar[,2]*(pi/180) # aspect radians cartesian <- t(apply(polar,1,function (x) {return (c(x[1]*cos(x[2]),x[1]*sin(x[2]))) })) colnames(cartesian) <- c("xSlAsp","ySlAsp") x <- cbind(MoscowMtStJoe[,37:39],cartesian,MoscowMtStJoe[,42:64]) y <- MoscowMtStJoe[,1:35] msn <- yai(x=x, y=y, method="msn", k=1) mal <- yai(x=x, y=y, method="mahalanobis", k=1) # the results can be plotted. plot(mal,vars=yvars(mal)[1:16]) # compare these results using the generalized mean distances.. grmsd(mal,msn) # try method="gower" if (require(gower)) { gow <- yai(x=x, y=y, method="gower", k=1) # compare these results using the generalized mean distances.. grmsd(mal,msn,gow) } # try method="randomForest" if (require(randomForest)) { # reduce the plant community data for randomForest. yba <- MoscowMtStJoe[,1:17] ybaB <- whatsMax(yba,nbig=7) # see help on whatsMax rf <- yai(x=x, y=ybaB, method="randomForest", k=1) # build the imputations for the original y's rforig <- impute(rf,ancillaryData=y) # compare the results using individual rmsd's compare.yai(mal,msn,rforig) plot(compare.yai(mal,msn,rforig)) # build another randomForest case forcing regression # to be used for continuous variables. The answers differ # but one is not clearly better than the other. rf2 <- yai(x=x, y=ybaB, method="randomForest", rfMode="regression") rforig2 <- impute(rf2,ancillaryData=y) compare.yai(rforig2,rforig) } ```