ann: Approximate nearest neighbor search routines
In yaImpute: Nearest Neighbor Observation Imputation and Evaluation Tools

Description Usage Arguments Details Value Author(s) Examples

Given a set of reference data points S, ann constructs a kd-tree or box-decomposition tree (bd-tree) for efficient k-nearest neighbor searches.

1
2
3

ann(ref, target, k=1, eps=0.0, tree.type="kd",
    search.type="standard", bucket.size=1, split.rule="sl_midpt",
    shrink.rule="simple", verbose=TRUE, ...)

`ref`	an n x d matrix containing the reference point set S. Each row in `ref` corresponds to a point in d-dimensional space.
`target`	an m x d matrix containing the points for which k nearest neighbor reference points are sought.
`k`	defines the number of nearest neighbors to find. The default is k=1.
`eps`	the i-th nearest neighbor is at most (1+`eps`) from true i-th nearest neighbor, where `eps`>=0 . Specifically, the true (not squared) difference between the true i-th and the approximation of the i-th point is a factor of (1+`eps`). The default value of `eps`=0 is an exact search.
`tree.type`	the data structures kd-tree or bd-tree as quoted key words kd and bd, respectively. A brute force search can be specified with the quoted key word brute. If brute is specified, then all subsequent arguments are ignored. The default is the kd-tree.
`search.type`	either standard or priority search in the kd-tree or bd-tree, specified by quoted key words standard and priority, respectively. The default is the standard search.
`bucket.size`	the maximum number of reference points in the leaf nodes. The default is 1.
`split.rule`	is the strategy for the recursive splitting of those nodes with more points than the bucket size. The splitting rule applies to both the kd-tree and bd-tree. Splitting rule options are the quoted key words: `standard` - standard kd-tree `midpt` - midpoint `fair` - fair-split `sl\_midpt` - sliding-midpoint (default) `sl{fair` - fair-split rule See supporting documentation, reference below, for a thorough description and discussion of these splitting rules.
`shrink.rule`	applies only to the bd-tree and is an additional strategy (beyond the splitting rule) for the recursive partitioning of nodes. This argument is ignored if `tree.type` is specified as kd. Shrinking rule options are quoted key words: `none` - equivalent to the kd-tree `simple` - simple shrink (default) `centroid` - centroid shrink See supporting documentation, reference below, for a thorough description and discussion of these shrinking rules.
`verbose`	if true, search progress is printed to the screen.
`...`	currently no additional arguments.

The ann function calls portions of the Approximate Nearest Neighbor Library, written by David M. Mount. All of the ann function arguments are detailed in the ANN Programming Manual found at http://www.cs.umd.edu/~mount/ANN.

An object of class ann, which is a list with some or all of the following tags:

`knnIndexDist`	an m x 2k matrix. Each row corresponds to a target point in `target` and columns 1:k hold the `ref` matrix row indices of the nearest neighbors, such that column 1 index holds the `ref` matrix row index for the first nearest neighbor and column k is the k-th nearest neighbor index. Columns k+1:2k hold the Euclidean distance from the target to each of the k nearest neighbors indexed in columns 1:k.
`searchTime`	total search time, not including data structure construction, etc.
`k`	as defined in the `ann` function call.
`eps`	as defined in the `ann` function call.
`tree.type`	as defined in the `ann` function call.
`search.type`	as defined in the `ann` function call.
`bucket.size`	as defined in the `ann` function call.
`split.rule`	as defined in the `ann` function call.
`shrink.rule`	as defined in the `ann` function call.

Andrew O. Finley finleya@msu.edu

## Make a couple of bivariate normal classes
rmvn <- function(n, mu=0, V = matrix(1))
{
  p <- length(mu)
  if(any(is.na(match(dim(V),p))))
    stop("Dimension problem!")
  D <- chol(V)
  matrix(rnorm(n*p), ncol=p) %*% D + rep(mu,rep(n,p))
}

m <- 10000

## Class 1.
mu.1 <- c(20, 40)
V.1 <- matrix(c(-5,1,0,5),2,2); V.1 <- V.1%*%t(V.1)
c.1 <- cbind(rmvn(m, mu.1, V.1), rep(1, m))

## Class 2.
mu.2 <- c(30, 60)
V.2 <- matrix(c(4,2,0,2),2,2); V.2 <- V.2%*%t(V.2)
c.2 <- cbind(rmvn(m, mu.2, V.2), rep(2, m))

## Class 3.
mu.3 <- c(15, 60)
V.3 <- matrix(c(5,5,0,5),2,2); V.3 <- V.3%*%t(V.3)
c.3 <- cbind(rmvn(m, mu.3, V.3), rep(3, m))

c.all <- rbind(c.1, c.2, c.3)
max.x <- max(c.all[,1]); min.x <- min(c.all[,1])
max.y <- max(c.all[,2]); min.y <- min(c.all[,2])

## Check them out.
plot(c.1[,1], c.1[,2], xlim=c(min.x, max.x), ylim=c(min.y, max.y),
     pch=19, cex=0.5,
     col="blue", xlab="Variable 1", ylab="Variable 2")
points(c.2[,1], c.2[,2], pch=19, cex=0.5, col="green")
points(c.3[,1], c.3[,2], pch=19, cex=0.5, col="red")


## Take a reference sample.
n <- 2000
ref <- c.all[sample(1:nrow(c.all), n),]

## Compare search times
k <- 10
## Do a simple brute force search.
brute <- ann(ref=ref[,1:2], target=c.all[,1:2],
             tree.type="brute", k=k, verbose=FALSE)
print(brute$searchTime)

## Do an exact kd-tree search.
kd.exact <- ann(ref=ref[,1:2], target=c.all[,1:2],
                tree.type="kd", k=k, verbose=FALSE)
print(kd.exact$searchTime)

## Do an approximate kd-tree search.
kd.approx <- ann(ref=ref[,1:2], target=c.all[,1:2],
                 tree.type="kd", k=k, eps=100, verbose=FALSE)
print(kd.approx$searchTime)

## Takes too long to calculate for this many targets.
## Compare overall accuracy of the exact vs. approximate search
##knn.mode <- function(knn.indx, ref){
##  x <- ref[knn.indx,]
##  as.numeric(names(sort(as.matrix(table(x))[,1],
##                        decreasing=TRUE))[1])
##}