Compact multidimensional sorting and searching with kdtools

use_cached_data = TRUE
can_run = require(kdtools) &&
  require(ggplot2) &&
  require(tidytext) &&
  require(printr) &&
  require(scales) &&
  kdtools::has_cxx17()
bench_ntuples = 1e7
bench_ntrials = 21
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = can_run
)
if (can_run) theme_set(theme_classic())
if (has_cxx17()) {
  message("Required packages missing -- code will not be evaluated")
} else {
  message("kdtools needs C++17 for full functionality, code will not be evaluated")
}

Introduction

Sorting and searching are fundamental operations in computer and data science. The objective of the kdtools is to expose a C++ header file implementing efficient sorting and searching methods for multidimensional tuple-like data. The interface mirrors that of the C++ Standard Library sorting and searching functions. It is a single header-only library with no external dependencies other than std.

A common approach to improving search performance is to place data into an ordered tree structure, the most common being a binary tree and its variants. Computer textbooks and introductory courses are littered with examples of tree structures, however it has been pointed out by practitioners that many tree-type data structures have considerable storage overhead and poor memory locality. This results from using linked-list-like pointers to connect nodes. In many cases, a fully or partially sorted sequence combined with binary search will outperform pointer-based trees. Indeed, the Standard Library contains a set of algorithms for sorting or partitioning a data sequence and searching within partitioned sequences.

A limitation of the Standard Library algorithms and data structures is that they can only be ordered in a single dimension. Sort keys can possess higher cardinality however comparison operators and ordering are unidimensional. This is suboptimal for data that are intrinsically multidimensional, such as geographical coordinates or other multivariate data. In multiple dimensions, it is preferred to order a sequence of keys such the average distance between neighboring pairs in the sequence is small. There exist a vast array of techniques to accomplish sequence localization; however I will only discuss one known as the kd-tree.

A kd-tree is a type of binary tree that cycles among the dimensions of its stored objects at each level of the tree hierarchy. In the case of spatial data, the root node will contain the median key in the x-direction and each of its daughter nodes will contain the median in the y-direction of their respective partitions. Subsequent daughter nodes will again contain medians in the x-direction, and so on until there are no additional keys to the left or right. Searching a kd-tree involves recursive comparisons, cycling among the dimensions, until a leaf-node is encountered. Most searching operations can be accomplished in logarithmic time complexity yielding an efficient method.

Precisely as an ordinary binary tree can be replaced by a sorted range, a sequence of multidimensional elements can be sorted, in-place if desired, via recursive partitioning following the kd-tree scheme. Divide-and-conquer recursive partitioning is a well-known sorting algorithm and is the basis of the quicksort algorithm. Partitioning a sequence places all elements less than (or more generally meeting some predicate) to the left of a selected key, while all elements greater-than-or-equal-to follow to the right. The kd-sort algorithm presented here simply modifies quicksort to cycle among the dimensions of the stored elements. Searching follows exactly as the kd-tree except that nodes correspond to pivot-elements in the linear array.

Implementation of kd_sort enables a set of additional algorithms for searching and nearest neighbor queries. The following table gives the exposed functions and corresponding Standard Library routines.

Function templates | SL analog | Outcome -----------------|--------------|------------------------------------ kd_sort | sort | sorts range kd_is_sorted | is_sorted | returns true if sorted kd_lower_bound | lower_bound | finds first element not less than key for all dimensions kd_upper_bound | upper_bound | finds first element greater than key for all dimensions kd_binary_search | binary_search | returns true if key exists in range kd_equal_range | equal_range | returns an iterator pair spanning all occurrences of a key kd_range_query | lower_bound, upper_bound | finds all elements not less than lower and less than upper kd_nearest_neighbor | | finds nearest neighbor of key kd_nearest_neighbors, kd_nn_indices | | finds k-nearest neighbors of key lex_sort | | ordinary SL sort using kd_less for comparison

Sorting

The kd-sort algorithm in the kdtools package is implemented as a C++ function template parameterized by the starting dimension (usually 0) and the iterator type, which will be auto-deduced by the compiler.

```{Rcpp eval=FALSE} template void kd_sort(Iter first, Iter last) { using TupleType = iter_value_t; constexpr auto J = next_dim; if (distance(first, last) > 1) { auto pred = kd_less(); auto pivot = median_part(first, last, pred); kd_sort(next(pivot), last); kd_sort(first, pivot); } }

The ```median_part``` function is a thin wrapper around the [```nth_element```](http://en.cppreference.com/w/cpp/algorithm/nth_element) function from the standard library, which is used to find the median value in the current dimension. Elements to the left of the nth element are not greater than the nth element and elements to the right are not less than the nth element. This allows searching left or right depending on whether the sought value is not greater than the median or not less than the median (ties must search both directions).

Once the first dimension is partitioned, the ranges right and left of the pivot are partitioned, and so on until first and last span a single or no value. The dimension index ```I``` is incremented at compile-time using template metaprogramming. The ```value``` member of the templated struct ```next_dim``` holds the incremented value of ```I``` that has been wrapped to cycle through successive dimensions of ```TupleType```. The result of ```kd_sort``` is a sequence ordered as if it were inserted into a kd-tree data structure. The following figure illustrates the resulting order in 2-dimensions.

```r
nr = 5e3
x = matrix(runif(nr), nc = 2)
y = kd_sort(matrix_to_tuples(x))
z = as.data.frame(tuples_to_matrix(y))
z$i = 1:nrow(z)
names(z) = c("x", "y", "i")
ggplot(z, aes(x, y)) +
  geom_path(color = "darkgrey") +
  geom_point(size = 0.25) + 
  geom_point(aes(color = i), alpha = 0.15, size = 5) +
  scale_color_gradientn(colors = rainbow(100, s = 0.8, v = 0.8)) +
  guides(color = "none") +
  coord_fixed()

Here the sequence order is shown with a continuous color gradient. The resulting patchiness indicates a hierarchical tree-like partitioning of the space providing the desired locality. Placing nearby points in proximity is what allows for efficient searches once the data are sorted. Sort times for vectors of tuple-like objects are relatively efficient, taking roughly twice the time of an ordinary lexicographical sort.

```{Rcpp include=FALSE} // [[Rcpp::plugins(cpp17)]]

include

static std::chrono::time_point start, finish;

// [[Rcpp::export]] void start_timing() { start = std::chrono::high_resolution_clock::now(); }

// [[Rcpp::export]] void end_timing() { finish = std::chrono::high_resolution_clock::now(); }

// [[Rcpp::export]] double get_timing() { return std::chrono::duration(finish - start).count(); }

```r
cache_file = "../inst/extdata/sort_benchmark_data"
if (!file.exists(cache_file))
  cache_file = system.file("extdata/sort_benchmark_data", package = "kdtools")
if (use_cached_data && file.exists(cache_file)) {
  load(cache_file)
} else {
  reps = bench_ntrials
  ndim = seq(1, 9, 2)
  ntuples = round(seq(100, bench_ntuples, length.out = 5))
  what = factor(c("kd_sort", "kd_sort_threaded", "lex_sort"))
  res = expand.grid(ndim = ndim, ntuples = ntuples, what = what, time = 0)
  for (i in 1:nrow(res))
  {
    times = numeric(reps)
    for (j in seq_len(reps))
    {
      x = matrix(runif(prod(res[i, 1:2])), ncol = res[i, "ndim"])
      y = matrix_to_tuples(x)
      switch(as.character(res[i, "what"]),
             kd_sort = 
             {
                start_timing()
                kd_sort(y, inplace = TRUE, parallel = FALSE)
                end_timing()
             },
             kd_sort_threaded =
             {
                start_timing()
                kd_sort(y, inplace = TRUE, parallel = TRUE)
                end_timing()
             },
             lex_sort =
             {
               start_timing()
               lex_sort(y, inplace = TRUE)
               end_timing()
             })
      times[j] = get_timing()
    }
    res[i, "time"] = median(times)
  }
  save(res, file = cache_file)
}
ggplot(res) +
  geom_line(aes(x = ntuples, y = time, color = as.factor(ndim))) +
  scale_x_continuous("Number of Tuples",
      labels = trans_format("identity", scientific_format(digits = 1))) +
  facet_wrap(~what) + 
  theme(strip.background = element_blank(),
        legend.position = "bottom") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(color = "Number of dimensions",
       x = "Number of tuples",
       y = "Time (seconds)")

In the above results, lex_sort is the Standard Library built-in std::sort algorithm using kd_less as the predicate. The results indicate that kd_sort is reasonably fast, requiring roughly twice the time as the Standard Library built-in sorting algorithm. Because of the divide-and-conquer nature of kd_sort, it is trivially parallelizable. The threaded version is equal to or faster than the regular sort. Performance comparable to the build-in Standard Library sort function indicates that this reference implementation is reasonable.

Multidimensional comparisons

To ensure the correct behavior in the presence of multiple ties, or even when all keys in one dimension are equal, the kd_less predicate implements circular lexicographic comparison. If elements in one dimension are tied, the next dimension is interrogated to break the tie. The comparisons cycle from trailing dimensions to leading dimensions using modulo arithmetic. It is implemented as a recursive, templated function object with the dimension index incremented at compile-time. The recursion is terminated using constexpr if.

```{Rcpp eval=FALSE} template struct kd_less { template bool operator()(const T& lhs, const T& rhs) const { if constexpr (is_last) { return less_nth()(lhs, rhs); } else { constexpr auto J = next_dim; return equal_nth()(lhs, rhs) ? kd_less()(lhs, rhs) : less_nth()(lhs, rhs); } } };

The predicates ```equal_nth``` and ```less_nth``` compare the $I^{th}$ element of the tuple. They have specializations for handling pointers to tuple-like object, so all functions can operate on a range over pointers to tuples, allowing arbitrary indexing into the original data.

A more general function object template ```kd_compare``` that takes an arbitrary binary predicate is also provided by ```kdtools```. Some examples applying ```kd_less```, including degenerate, all-ties cases, are shown here.

```r
j = cbind(sample(1:9))
k = cbind(sample(1:9))
t(kd_sort(j))
t(kd_sort(cbind(j, k)))
t(kd_sort(cbind(0, j, k)))
t(kd_sort(cbind(j, k, 0)))
t(kd_sort(cbind(0, j, 0)))

Notice that kd_sort is a variant of the quicksort algorithm in one dimension. It also is not effected by all-ties. The order of subsequent dimensions can be changed when the leading dimension contains all ties, however the sequence is still sorted and can be used for searching.

Searching

Once data are ordered appropriately, searching can be achieved efficiently. The kdtools package provides analogs of the Standard Library lower_bound, upper_bound, equal_range and binary_search functions for querying sorted ranges. An additional function, kd_range_query, does efficient searching within a boxed region.

The kd_lower_bound, kd_upper_bound and kd_equal_range algorithms behave differently than their unidimensional analogs. This is because the only reasonable definition of multidimensional less is to require less in all dimensions simultaneously. As illustrated below, this leads to cases where some, but not all, elements are less to be included in the interval spanned by kd_lower_bound, kd_upper_bound and kd_equal_range. In other words, in a single dimension, these routines are both necessary and sufficient, whereas in multiple-dimensions, they are necessary, but not sufficient to uniquely identify elements in a boxed region.

In other schemes, such as Morton codes, elaborate subroutines are required to prune the extra elements falling outside the region of interest (see lower-left panel below). This is a result of using a fundamentally unidimensional search on multidimensional data. Instead kd_sort implements direct binary search for points falling within a boxed region.

```{Rcpp eval=FALSE} template void kd_range_query(Iter first, Iter last, const TupleType& lower, const TupleType& upper, OutIter outp) { if (distance(first, last) > 32) { auto pred = less_nth(); auto pivot = find_pivot(first, last); constexpr auto J = next_dim; if (within(pivot, lower, upper)) outp++ = pivot; if (!pred(pivot, lower)) // search left kd_range_query(first, pivot, lower, upper, outp); if (pred(*pivot, upper)) // search right kd_range_query(next(pivot), last, lower, upper, outp); } else { copy_if(first, last, outp, &{ return within(x, lower, upper); }); } return; }

Here, each pivot element is tested to see whether it falls within the boxed region, in which case, it is assigned to the output iterator. The template ```find_pivot``` searches for the partition point whose position may or may not be the middle of the sequence. Leaf tuples are also tested and copied to the output iterator. We check in each dimension whether there might be additional points in the region of interest to the left or right in the array. These are searched recursively until reaching the span falls below a threshold and is more profitably searched via sequential scan. In the following plot, the blue points indicate those not meeting the predicate and the aquamarine points are those where the predicate returns true.

```r
xy = matrix(runif(1e3), nc = 2)
p = c(0.33, 0.33, 0.66, 0.66)
xy = kd_sort(xy)
lb = kd_lower_bound(xy, p[1:2])
ub = kd_upper_bound(xy, p[3:4])
rq = kd_range_query(xy, p[1:2], p[3:4])
rq = kd_sort(rq)
i1 = 1:nrow(xy) >= lb
i2 = 1:nrow(xy) < ub
i3 = i1 & i2
i4 = sapply(1:nrow(xy), function(i) kd_binary_search(rq, xy[i, ]))
df = rbind(data.frame(x = xy[, 1], y = xy[, 2], i = i1, j = ">= lower bound"),
           data.frame(x = xy[, 1], y = xy[, 2], i = i2, j = "< upper bound"),
           data.frame(x = xy[, 1], y = xy[, 2], i = i3, j = ">= lower b. and < upper b."),
           data.frame(x = xy[, 1], y = xy[, 2], i = i4, j = "within(lower, upper)"))
ggplot(df, aes(x, y)) +
  geom_path(color = "darkgrey") +
  geom_point(size = 0.25) + 
  geom_point(aes(color = i), alpha = 0.15, size = 3) +
  scale_color_manual(values = c("steelblue", "aquamarine")) +
  scale_x_continuous(breaks = c(0.33, 0.66)) +
  scale_y_continuous(breaks = c(0.33, 0.66)) +
  geom_vline(xintercept = p[1], color = "red") + 
  geom_hline(yintercept = p[2], color = "red") +
  geom_vline(xintercept = p[3], color = "red") +
  geom_hline(yintercept = p[4], color = "red") +
  guides(color = "none") +
  coord_fixed() +
  facet_wrap(~j) + 
  theme(strip.background = element_blank())

Performance comparison to Boost Geometry

The Boost Geometry library is a highly sophisticated and performant framework for working with geometric features. It includes an R-Tree index that allows access to objects within a bounding region in logarithmic time. Here I compare the speed of building and querying a collection of points. The only purpose of this comparison is to look for large (i.e., 5 to 10 fold) differences, which would indicate a problem in the implementation. Small differences in measured times are probably not meaningful and will likely change on different systems. Reported results are the median of r bench_ntrials trials using r bench_ntuples tuples.

```{Rcpp include=FALSE} // [[Rcpp::depends(BH)]] // [[Rcpp::depends(kdtools)]] // [[Rcpp::plugins(cpp17)]]

define NO_TUPLEMAPR

include

include

include

include

include

include

include

include

using namespace keittlab;

using key_type = std::array; using box_type = std::pair; using range_type = std::vector;

BOOST_GEOMETRY_REGISTER_STD_ARRAY_CS(cs::cartesian) BOOST_GEOMETRY_REGISTER_BOX(box_type, key_type, first, second)

template double time_it(T x, U u, size_t n) { std::vector res; while (n--) { u(); auto start = std::chrono::high_resolution_clock::now(); x(); auto finish = std::chrono::high_resolution_clock::now(); res.push_back(std::chrono::duration(finish - start).count()); } auto med = kdtools::utils::middle_of(begin(res), end(res)); std::nth_element(begin(res), med, end(res)); return *med; }

// [[Rcpp::export]] Rcpp::List time_build(int n, int m) { Rcpp::CharacterVector test(6); Rcpp::NumericVector time(6);

range_type data(n);

auto prep_data = [&]{ std::generate(begin(data), end(data), []{ return key_type{{R::runif(0, 1), R::runif(0, 1)}}; }); };

auto do_kd_sort = [&]{ kdtools::kd_sort(begin(data), end(data)); };

auto t = time_it(do_kd_sort, prep_data, m); test[1] = "kd_sort inplace"; time[1] = t;

auto do_kd_sort_copy = [&]{ range_type x(data); kdtools::kd_sort(begin(x), end(x)); };

t = time_it(do_kd_sort_copy, prep_data, m); test[2] = "kd_sort w/ copy"; time[2] = t;

auto do_kd_sort_threaded = [&]{ kdtools::kd_sort_threaded(begin(data), end(data)); };

t = time_it(do_kd_sort_threaded, prep_data, m); test[0] = "kd_sort_threaded"; time[0] = t;

using rtree_type_1 = boost::geometry::index::rtree>;

auto do_rt_construct_1 = [&]{ rtree_type_1 x(data); };

t = time_it(do_rt_construct_1, prep_data, m); test[3] = "rtree pack linear"; time[3] = t;

using rtree_type_2 = boost::geometry::index::rtree>;

auto do_rt_construct_2 = [&]{ rtree_type_2 x(data); };

t = time_it(do_rt_construct_2, prep_data, m); test[4] = "rtree pack quadratic"; time[4] = t;

using rtree_type_3 = boost::geometry::index::rtree>;

auto do_rt_construct_3 = [&]{ rtree_type_3 x(data); };

t = time_it(do_rt_construct_3, prep_data, m); test[5] = "rtree pack rstar"; time[5] = t;

Rcpp::List out; out["Test"] = test; out["Time"] = time;

return out; }

// [[Rcpp::export]] Rcpp::List time_query(int n, int m) { Rcpp::CharacterVector test(4); Rcpp::NumericVector time(4);

range_type data(n), res; key_type p1{{0.1, 0.1}}, p2{{0.2, 0.2}};

auto prep_data = [&]{ std::generate(begin(data), end(data), []{ return key_type{{R::runif(0, 1), R::runif(0, 1)}}; }); kdtools::kd_sort_threaded(begin(data), end(data)); res.clear(); };

auto do_kd_range_query = [&]{ kdtools::kd_range_query(begin(data), end(data), p1, p2, back_inserter(res)); };

auto t = time_it(do_kd_range_query, prep_data, m); test[0] = "kd_range_query"; time[0] = t;

using rtree_type_1 = boost::geometry::index::rtree>;

rtree_type_1 rt1;

auto prep_rt_1 = [&]{ std::generate(begin(data), end(data), []{ return key_type{{R::runif(0, 1), R::runif(0, 1)}}; }); rt1.clear(); rt1.insert(begin(data), end(data)); res.clear(); };

const box_type box = std::make_pair(p1, p2);

auto do_rt_1_query = [&]{ rt1.query( boost::geometry::index::within(box), std::back_inserter(res)); };

t = time_it(do_rt_1_query, prep_rt_1, m); test[1] = "rtree query linear"; time[1] = t;

using rtree_type_2 = boost::geometry::index::rtree>;

rtree_type_2 rt2;

auto prep_rt_2 = [&]{ std::generate(begin(data), end(data), []{ return key_type{{R::runif(0, 1), R::runif(0, 1)}}; }); rt2.clear(); rt2.insert(begin(data), end(data)); res.clear(); };

auto do_rt_2_query = [&]{ rt2.query( boost::geometry::index::within(box), std::back_inserter(res)); };

t = time_it(do_rt_2_query, prep_rt_2, m); test[2] = "rtree query quadratic"; time[2] = t;

using rtree_type_3 = boost::geometry::index::rtree>;

rtree_type_3 rt3;

auto prep_rt_3 = [&]{ std::generate(begin(data), end(data), []{ return key_type{{R::runif(0, 1), R::runif(0, 1)}}; }); rt3.clear(); rt3.insert(begin(data), end(data)); res.clear(); };

auto do_rt_3_query = [&]{ rt3.query( boost::geometry::index::within(box), std::back_inserter(res)); };

t = time_it(do_rt_3_query, prep_rt_3, m); test[3] = "rtree query rstar"; time[3] = t;

Rcpp::List out; out["Test"] = test; out["Time"] = time;

return out; }

```r
cache_file = "../inst/extdata/query_benchmark_data"
if (!file.exists(cache_file))
  cache_file = system.file("extdata/query_benchmark_data", package = "kdtools")
if (use_cached_data && file.exists(cache_file)) {
  load(cache_file)
} else {
  build = as.data.frame(time_build(bench_ntuples, bench_ntrials))
  query = as.data.frame(time_query(bench_ntuples, bench_ntrials))
  build$Time = signif(build$Time, 3)
  query$Time = signif(query$Time, 3)
  build$Ratio = signif(build$Time / build$Time[1], 2)
  query$Ratio = signif(query$Time / query$Time[1], 2)
  save(build, query, file = cache_file)
}
build
query

The results indicate that sorting and searching with kdtools is at least as performant as using Boost Geometry. Generally, an R-Tree will have an advantage with many individual insertions and deletions, whereas a sorted range will be more compact and show equal, if not better, query performance.

Nearest-neighbor search

The kdtools package also provides routines for nearest- and k-nearest- neighbor search.

```{Rcpp include=FALSE} // [[Rcpp::depends(BH)]] // [[Rcpp::depends(kdtools)]] // [[Rcpp::plugins(cpp17)]]

define NO_TUPLEMAPR

include

include

include

include

include

include

include

include

using namespace keittlab;

using key_type = std::array; using box_type = std::pair; using range_type = std::vector;

BOOST_GEOMETRY_REGISTER_STD_ARRAY_CS(cs::cartesian) BOOST_GEOMETRY_REGISTER_BOX(box_type, key_type, first, second)

template double time_it(T x, U u, size_t n) { std::vector res; while (n--) { u(); auto start = std::chrono::high_resolution_clock::now(); x(); auto finish = std::chrono::high_resolution_clock::now(); res.push_back(std::chrono::duration(finish - start).count()); } auto med = kdtools::utils::middle_of(begin(res), end(res)); std::nth_element(begin(res), med, end(res)); return *med; }

// [[Rcpp::export]] Rcpp::List time_nn(int n, int m) { Rcpp::CharacterVector test(8); Rcpp::NumericVector time(8);

range_type data(n), res; key_type p{{0.1, 0.1}};

auto prep_data = [&]{ std::generate(begin(data), end(data), []{ return key_type{{R::runif(0, 1), R::runif(0, 1)}}; }); kdtools::kd_sort_threaded(begin(data), end(data)); res.clear(); };

auto do_nn = [&]{ kdtools::kd_nearest_neighbor(begin(data), end(data), p); };

auto t = time_it(do_nn, prep_data, m); test[0] = "kd_nearest_neighbor"; time[0] = t;

using rtree_type_1 = boost::geometry::index::rtree>;

rtree_type_1 rt1;

auto prep_rt_1 = [&]{ std::generate(begin(data), end(data), []{ return key_type{{R::runif(0, 1), R::runif(0, 1)}}; }); rt1.clear(); rt1.insert(begin(data), end(data)); res.clear(); };

auto do_rt_1_query = [&]{ rt1.query( boost::geometry::index::nearest(p, 1), std::back_inserter(res)); };

t = time_it(do_rt_1_query, prep_rt_1, m); test[1] = "rtree nearest neighbor linear"; time[1] = t;

using rtree_type_2 = boost::geometry::index::rtree>;

rtree_type_2 rt2;

auto prep_rt_2 = [&]{ std::generate(begin(data), end(data), []{ return key_type{{R::runif(0, 1), R::runif(0, 1)}}; }); rt2.clear(); rt2.insert(begin(data), end(data)); res.clear(); };

auto do_rt_2_query = [&]{ rt2.query( boost::geometry::index::nearest(p, 1), std::back_inserter(res)); };

t = time_it(do_rt_2_query, prep_rt_2, m); test[2] = "rtree nearest neighbor quadratic"; time[2] = t;

using rtree_type_3 = boost::geometry::index::rtree>;

rtree_type_3 rt3;

auto prep_rt_3 = [&]{ std::generate(begin(data), end(data), []{ return key_type{{R::runif(0, 1), R::runif(0, 1)}}; }); rt3.clear(); rt3.insert(begin(data), end(data)); res.clear(); };

auto do_rt_3_query = [&]{ rt3.query( boost::geometry::index::nearest(p, 1), std::back_inserter(res)); };

t = time_it(do_rt_3_query, prep_rt_3, m); test[3] = "rtree nearest neighbor rstar"; time[3] = t;

auto do_nns = [&]{ kdtools::kd_nearest_neighbors(begin(data), end(data), p, 100, back_inserter(res)); };

t = time_it(do_nns, prep_data, m); test[4] = "kd_nearest_neighbors 100"; time[4] = t;

auto do_rt_1_nn_query = [&]{ rt1.query( boost::geometry::index::nearest(p, 100), std::back_inserter(res)); };

t = time_it(do_rt_1_nn_query, prep_rt_1, m); test[5] = "rtree nearest 100 linear"; time[5] = t;

auto do_rt_2_nn_query = [&]{ rt2.query( boost::geometry::index::nearest(p, 100), std::back_inserter(res)); };

t = time_it(do_rt_2_nn_query, prep_rt_2, m); test[6] = "rtree nearest 100 quadratic"; time[6] = t;

auto do_rt_3_nn_query = [&]{ rt3.query( boost::geometry::index::nearest(p, 100), std::back_inserter(res)); };

t = time_it(do_rt_3_nn_query, prep_rt_3, m); test[7] = "rtree nearest 100 rstar"; time[7] = t;

Rcpp::List out; out["Test"] = test; out["Time"] = time;

return out; }

```r
cache_file = "../inst/extdata/nn_benchmark_data"
if (!file.exists(cache_file))
  cache_file = system.file("extdata/nn_benchmark_data", package = "kdtools")
if (use_cached_data && file.exists(cache_file)) {
  load(cache_file)
} else {
  res = as.data.frame(time_nn(bench_ntuples, bench_ntrials))
  res$Time = signif(res$Time, 3)
  res1 = res[1:4,]
  res2 = res[5:8,]
  rownames(res2) = NULL
  res1$Ratio = signif(res1$Time / res1$Time[1], 2)
  res2$Ratio = signif(res2$Time / res2$Time[1], 2)
  save(res1, res2, file = cache_file)
}
res1
res2

As before, the results show kdtools to be as fast or faster than Boost Geometry search. This is not surprising as the R-Tree data structure is optimized for indexing boxed regions, not points. The exception is rstar search for 100 nearest neighbors, which is faster. This might have to do with differences in the priority queue implementation or it could be that the rtree buckets allow rapid retrieval of groups of nearby points all at once. The main conclusion here is that the current implementation of kdtools is reasonably efficient.

Mixed-type searches

Notice that the only requirement of the algorithms is that the get<I> function return a type meeting the EqualityComparable and LessThanComparable C++ concepts. This means that different dimensions can contain different data types, including strings, for example. The current implementation uses std::get and so is restricted to std::pair, std::tuple and std::array, unless one chooses to specializes in the std namespace, which is generally not recommended. A suitable mechanism for providing type-specific specializations is to-be-implemented. The following example demonstrates sorting and searching a container holding keys of type std::tuple<double, std::string>.

header_code = '
#include <Rcpp.h>
using Rcpp::CharacterVector;
using Rcpp::NumericVector;
using Rcpp::Rcout;
using Rcpp::stop;

#include <chrono>

#include <utility>
using std::begin;
using std::end;

#include <tuple>

#include <vector>
using std::vector;

#include <string>
using std::string;

#define NO_TUPLEMAPR

#include <kdtools.h>
using namespace keittlab;
using kdtools::kd_sort;
using kdtools::kd_range_query;
using kdtools::kd_nearest_neighbors;

using key_type = std::tuple<double, string>;
using range_type = vector<key_type>;
using pointers_type = vector<key_type*>;

#define N 3

void print_range(const range_type& x)
{
  int n = x.size();
  if (n < 2 * N) stop("Not enough rows");
  for (int i = 0; i != N; ++i)
    Rcout << i << " "
          << std::get<0>(x[i]) << " " 
          << std::get<1>(x[i]) << std::endl;
  Rcout << "..." << std::endl;
  for (int i = n - N; i != n; ++i)
    Rcout << i << " "
          << std::get<0>(x[i]) << " " 
          << std::get<1>(x[i]) << std::endl;
}

void print_range(const pointers_type& x)
{
  int n = x.size();
  if (n < 2 * N) stop("Not enough rows");
  for (int i = 0; i != N; ++i)
    Rcout << i << " "
          << std::get<0>(*x[i]) << " " 
          << std::get<1>(*x[i]) << std::endl;
  Rcout << "..." << std::endl;
  for (int i = n - N; i != n; ++i)
    Rcout << i << " "
          << std::get<0>(*x[i]) << " " 
          << std::get<1>(*x[i]) << std::endl;
}

template <typename T>
std::chrono::duration<double> time_it(T x)
{
  auto start = std::chrono::high_resolution_clock::now();
  x();
  auto finish = std::chrono::high_resolution_clock::now();
  return finish - start;
}

struct make_key
{
  key_type operator()(const double a, const char* b){ return key_type(a, b); }
};

struct greater_key
{
  bool operator()(const double& lhs, const double& rhs)
  {
    return lhs > rhs;
  }
  bool operator()(const string& lhs, const string& rhs)
  {
    return lhs > rhs;
  }
};

template<class InputIt1, class InputIt2>
double set_similarity(InputIt1 first1, InputIt1 last1,
                      InputIt2 first2, InputIt2 last2)
{
  double num = 0, denom = 0;
  while (first1 != last1 && first2 != last2) {
    if (*first1 < *first2) {
      ++first1; ++denom;
    } else {
      if (!(*first2 < *first1)) {
        ++first1; ++num;
      }
      ++first2; ++denom;
    }
  }
  denom += std::distance(first1, last1) + std::distance(first2, last2);
  return num / denom;
}

namespace keittlab {
namespace kdtools {

template <>
double scalar_dist(const std::string& lhs, const std::string& rhs)
{
  std::string a(lhs), b(rhs);
  std::sort(begin(a), end(a));
  std::sort(begin(b), end(b));
  return 1 - set_similarity(begin(a), end(a),
                            begin(b), end(b));
}

} // namespace kdtools
} // namespace keittlab
'

header_file = "mixed_query.h"
cat(header_code, file = file.path(tempdir(), header_file))
pkg_cppflags = Sys.getenv("PKG_CPPFLAGS")
pkg_cppflags = paste(pkg_cppflags, paste0("-I\"", tempdir(), "\""))
Sys.setenv(PKG_CPPFLAGS = pkg_cppflags)
// [[Rcpp::depends(kdtools)]]
// [[Rcpp::plugins(cpp17)]]

#include "mixed_query.h"

// [[Rcpp::export]]
void mixed_query(NumericVector c1, CharacterVector c2)
{
  auto n = c1.size();

  range_type data;
  data.reserve(n);

  transform(begin(c1), end(c1), begin(c2), back_inserter(data), make_key());

  auto t = time_it([&]{

    kd_sort(begin(data), end(data));

  });
  Rcout << "\nSort time: " << t.count() << " seconds" << std::endl;

  Rcout << "\nFirst and last " << N << " elements of sorted data:\n" << std::endl;
  print_range(data);

  range_type result;

  auto u = time_it([&]{

    kd_range_query(begin(data), end(data),
                   key_type(0.4, "w"), key_type(0.6, "z"),
                   back_inserter(result));

  });

  Rcout << "\nQuery time: " << u.count() << " seconds" << std::endl;

  Rcout << "\nFirst and last " << N << " elements of query return:\n" << std::endl;
  kd_sort(begin(result), end(result)); print_range(result);

  result.clear();

  auto v = time_it([&]{

    kd_nearest_neighbors(begin(data), end(data),
                         key_type(0.5, "kdtools"), 100,
                         back_inserter(result));

  });

  Rcout << "\nNeighbors query time: " << v.count() << " seconds" << std::endl;

  Rcout << "\nFirst and last " << N << " elements of neighbors query return:\n" << std::endl;
  print_range(result);
}

First the data are sorted and then a range query is conducted. The range query selects tuples where the first element ranges from 0.4 to 0.6 and the second element begins with letters w through y. For the test, we borrow the words from the parts_of_speech dataset from the tidytext package. It contains more than $10^5$ words.

data("parts_of_speech")
numbers = signif(runif(nrow(parts_of_speech)), 4)
strings = sample(tolower(parts_of_speech[[1]]))
mixed_query(numbers, strings)

The timing results indicate that sorting $2 \times 10 ^ 5$ tuples is on the order of 1/10th second, whereas the range query returned in less than a millisecond.

Nearest neighbor searching requires a suitable distance function be defined for all types. The template scalar_dist can be specialized for different types. It returns the distance along a single dimension. Currently, nearest neighbor distances are hardwired as the L2 norm of the returned scalar differences computed for each dimension. The computation above uses the following definitions.

```{Rcpp eval=FALSE} template double set_similarity(InputIt1 first1, InputIt1 last1, InputIt2 first2, InputIt2 last2) { double num = 0, denom = 0; while (first1 != last1 && first2 != last2) { if (first1 < first2) { ++first1; ++denom; } else { if (!(first2 < first1)) { ++first1; ++num; } ++first2; ++denom; } } denom += std::distance(first1, last1) + std::distance(first2, last2); return num / denom; }

namespace kdtools {

template <> double scalar_dist(const std::string& lhs, const std::string& rhs) { std::string a(lhs), b(rhs); std::sort(begin(a), end(a)); std::sort(begin(b), end(b)); return 1 - set_similarity(begin(a), end(a), begin(b), end(b)); }

} // namespace kdtools

The ```set_similarity``` template is adapted from the Standard Library ```set_intersection``` template and produces a number between 0 and 1. I have not attempted to prove this distance is metric, so the results may not be entirely correct. Also, there is no facility currently available to weight distances in different dimensions or override the default root-sum-of-squares algorithm for the multivariate distance.

### Indexing with pointer arrays

Since version 0.4.0, ```kd_tools``` all predicates and distance functions have specializations or overloads to handle pointers to tuple-like objects. Hence, any of the routines can sort or search a vector of pointers to tuples, which act as indices into the original data.

```{Rcpp}
// [[Rcpp::depends(kdtools)]]
// [[Rcpp::plugins(cpp17)]]

#include "mixed_query.h"

// [[Rcpp::export]]
void sort_pointers(NumericVector c1, CharacterVector c2)
{
  auto n = c1.size();

  range_type data;
  data.reserve(n);

  transform(begin(c1), end(c1), begin(c2), back_inserter(data), make_key());

  pointers_type idx1, idx2;
  idx1.reserve(n); idx2.reserve(n);

  for (auto& x : data) {
    idx1.push_back(&x);
    idx2.push_back(&x);
  }

  kd_sort(begin(idx1), end(idx1));
  kd_sort(begin(idx2), end(idx2), greater_key());

  Rcout << "Original order\n\n"; print_range(data);
  Rcout << "\nOrder by less\n\n"; print_range(idx1);
  Rcout << "\nOrder by greater\n\n"; print_range(idx2);
}

Here, greater_key has overloads for both string and double types.

sort_pointers(numbers, strings)

Note that the forward and reverse sorting do not give exactly the same result (in reverse) owing to ties.

Conclusions

The kdtools package implements efficient sorting and searching of arbitrary tuple-like object in C++. One of the major design decisions was to make the number of dimensions fixed at compile time. This is a reasonable tradeoff as these methods break down in high dimensions where search times will be no better than that of brute-force random searching. Nonetheless, a runtime-dimensioned extension would be easy to add and could allow working with runtime-determined types as is common in R. The current implementation is a demonstration, however it does illustrate the value of the Standard Library as the algorithms are constructed from Standard Library components. The generic nature of the kdtools implementation is an attractive feature as it permits flexibility and can be adapted to arbitrary tuple-like data as needed. Modern development of C++ has emphasized tuple-style data and algorithms and kdtools extends these developments.



Try the kdtools package in your browser

Any scripts or data that you put into this service are public.

kdtools documentation built on Oct. 8, 2021, 9:07 a.m.