# lpm2_kdtree: Local Pivotal Method In SamplingBigData: Sampling Methods for Big Data

## Description

The local pivotal method provides a way to perform balanced sampling. This implementation replace linear searches in lpm2, with k-d trees. K-d trees are binary trees used to effectively search high dimensional spaces, and reduce the average computational complexity of lpm2 from O(N^2) to O(N log(N)). Both nearest neighbor and approximate nearest neighbor searching algorithms are provided.

## Usage

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13``` ``` lpm2_kdtree( prob, x, m=40, algorithm = "kdtree", maxCheck = 4, termDist = 0.1, inOrder = FALSE, resample = 1, probTree = FALSE, returnTree = FALSE, returnBounds = FALSE ) ```

## Arguments

 `prob` An array of length N such that the sum of prob is equal to the sample size, where the N is the number of rows of `x`. `x` A matrix of N rows and p columns, each row is assumed to be a sampling unit. `m` Max leaf size used as a terminal condition for building the k-d tree. When probTree is `FALSE`, `m` is the number of rows of `x` held within each leaf node. When probTree is `TRUE`, `m` is the sum of the probability held within each node. `algorithm` The algorithm used to search the k-d tree. The algorithms include "kdtree", "kdtree-count", and "kdtree-dist". The "kdtree" algorithm reproduces the lpm2 using a k-d tree for nearest neighbor search. "kdtree-count" and "kdtree-dist" use approximate nearest neighor searches based on number of nodes to check and minimal sufficient distance respectfully. `maxCheck` A positive integer scalar parameter only used when the algorithm "kdtree-count" is specified. This parameter is the maximum number of non-empty leaf nodes to check for a nearest neighbor. `termDist` A positive valued scalar parameter only used when the algorithm "kdtree-dist" is specified. This parameter specifies a minimal sufficient distance to be considered a nearest neighbor. No tie handling is performed; the first nearest neighbor found will be used. `inOrder` A boolean value, `TRUE` will return results in order of selection. `FALSE` will return in order of index. `resample` The number of samples to return. Resampling builds the k-d tree exactly once for all samples. Each sample will be a distinct column in a matrix. `probTree` A boolean value, `TRUE` will split the k-d tree based on the weighted median, using the values in `prob`. `returnTree` A boolean value, `TRUE` will return the node assigment. This assignment will be appended to a returned list.. `returnBounds` A boolean value, `TRUE` will return the bounds for each node assigment. These bounds will be appended to a returned list.

## Value

A vector of selected indexes from the matrix x. If using default values for `inOrder`, `resample`, and `algorithm`, the results identical to the lpm2 function when no ties exist in the distance function exist. `inOrder=TRUE` will return results in order of selection, and `resample > 1` will return a matrix with each set of samples returned as a column vector.

A list is returned includiing this vector if returnTree or returnBounds is set to TRUE.

Jonathan Lisic

## References

Lisic, L. and Cruze, N. (2016). Local Pivotal Methods for Large Surveys. In proceedings, ICES V, Geneva Switzerland 2016.

## Examples

 ```1 2 3 4 5 6 7 8 9``` ```N <- 1000 n <- 100 x <- cbind( runif(N), runif(N)) set.seed(100) Cprog <- proc.time() sampled <- lpm2_kdtree( rep(n/N,N), x) print("lpm2_kdtree running time") print(proc.time() - Cprog) ```

SamplingBigData documentation built on May 2, 2019, 1:44 p.m.