R/rnndescent.R
In rnndescent: Nearest Neighbor Descent Method for Approximate Nearest Neighbors

Documented in brute_force_knn brute_force_knn_query graph_knn_query merge_knn neighbor_overlap nnd_knn prepare_search_graph random_knn random_knn_query rnnd_build rnnd_knn rnnd_query

# API ---------------------------------------------------------------------

#' Build approximate nearest neighbors index and neighbor graph
#'
#' This function builds an approximate nearest neighbors graph with convenient
#' defaults, then prepares the index for querying new data, for later use with
#' [rnnd_query()]. For more control over the process, please see the other
#' functions in the package.
#'
#' The process of k-nearest neighbor graph construction using Random Projection
#' Forests (Dasgupta and Freund, 2008) for initialization and Nearest Neighbor
#' Descent (Dong and co-workers, 2011) for refinement. Index preparation, uses
#' the graph diversification method of Harwood and Drummond (2016).
#'
#' @param data Matrix of `n` items to generate neighbors for, with observations
#'   in the rows and features in the columns. Optionally, input can be passed
#'   with observations in the columns, by setting `obs = "C"`, which should be
#'   more efficient. Possible formats are [base::data.frame()], [base::matrix()]
#'   or [Matrix::sparseMatrix()]. Sparse matrices should be in `dgCMatrix`
#'   format. Dataframes will be converted to `numerical` matrix format
#'   internally, so if your data columns are `logical` and intended to be used
#'   with the specialized binary `metric`s, you should convert it to a logical
#'   matrix first (otherwise you will get the slower dense numerical version).
#' @param k Number of nearest neighbors to build the index for. You can specify
#'   a different number when running `rnnd_query`, but the index is calibrated
#'   using this value so it's recommended to set `k` according to the likely
#'   number of neighbors you will want to retrieve. Optional if `init` is
#'   specified, in which case `k` can be inferred from the `init` data. If you
#'   do both, then the specified version of `k` will take precedence.
#' @param metric Type of distance calculation to use. One of:
#'   - `"braycurtis"`
#'   - `"canberra"`
#'   - `"chebyshev"`
#'   - `"correlation"` (1 minus the Pearson correlation)
#'   - `"cosine"`
#'   - `"dice"`
#'   - `"euclidean"`
#'   - `"hamming"`
#'   - `"hellinger"`
#'   - `"jaccard"`
#'   - `"jensenshannon"`
#'   - `"kulsinski"`
#'   - `"sqeuclidean"` (squared Euclidean)
#'   - `"manhattan"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"spearmanr"` (1 minus the Spearman rank correlation)
#'   - `"symmetrickl"` (symmetric Kullback-Leibler divergence)
#'   - `"tsss"` (Triangle Area Similarity-Sector Area Similarity or TS-SS
#'   metric)
#'   - `"yule"`
#'
#'   For non-sparse data, the following variants are available with
#'   preprocessing: this trades memory for a potential speed up during the
#'   distance calculation. Some minor numerical differences should be expected
#'   compared to the non-preprocessed versions:
#'   - `"cosine-preprocess"`: `cosine` with preprocessing.
#'   - `"correlation-preprocess"`: `correlation` with preprocessing.
#'
#'   For non-sparse binary data passed as a `logical` matrix, the following
#'   metrics have specialized variants which should be substantially faster than
#'   the non-binary variants (in other cases the logical data will be treated as
#'   a dense numeric vector of 0s and 1s):
#'   - `"dice"`
#'   - `"hamming"`
#'   - `"jaccard"`
#'   - `"kulsinski"`
#'   - `"matching"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"yule"`
#' @param use_alt_metric If `TRUE`, use faster metrics that maintain the
#'   ordering of distances internally (e.g. squared Euclidean distances if using
#'   `metric = "euclidean"`), then apply a correction at the end. Probably
#'   the only reason to set this to `FALSE` is if you suspect that some
#'   sort of numeric issue is occurring with your data in the alternative code
#'   path.
#' @param init Name of the initialization strategy or initial `data` neighbor
#'   graph to optimize. One of:
#'   - `"rand"` random initialization (the default).
#'   - `"tree"` use the random projection tree method of Dasgupta and Freund
#'   (2008).
#'   - a pre-calculated neighbor graph. A list containing:
#'       - `idx` an `n` by `k` matrix containing the nearest neighbor indices.
#'       - `dist` (optional) an `n` by `k` matrix containing the nearest
#'       neighbor distances. If the input distances are omitted, they will be
#'       calculated for you.'
#'
#'   If `k` and `init` are specified as arguments to this function, and the
#'   number of neighbors provided in `init` is not equal to `k` then:
#'
#'   * if `k` is smaller, only the `k` closest values in `init` are retained.
#'   * if `k` is larger, then random neighbors will be chosen to fill `init` to
#'   the size of `k`. Note that there is no checking if any of the random
#'   neighbors are duplicates of what is already in `init` so effectively fewer
#'   than `k` neighbors may be chosen for some observations under these
#'   circumstances.
#' @param n_trees The number of trees to use in the RP forest. A larger number
#'   will give more accurate results at the cost of a longer computation time.
#'   The default of `NULL` means that the number is chosen based on the number
#'   of observations in `data`. Only used if `init = "tree"`.
#' @param leaf_size The maximum number of items that can appear in a leaf. This
#'   value should be chosen to match the expected number of neighbors you will
#'   want to retrieve when running queries (e.g. if you want find 50 nearest
#'   neighbors set `leaf_size = 50`) and should not be set to a value smaller
#'   than `10`. Only used if `init = "tree"`.
#' @param max_tree_depth The maximum depth of the tree to build (default = 200).
#'   If the maximum tree depth is exceeded then the leaf size of a tree may
#'   exceed `leaf_size` which can result in a large number of neighbor distances
#'   being calculated. If `verbose = TRUE` a message will be logged to indicate
#'   that the leaf size is large. However, increasing the `max_tree_depth` may
#'   not help: it may be that there is something unusual about the distribution
#'   of your data set under your chose `metric` that makes a tree-based
#'   initialization inappropriate. Only used if `init = "tree"`.
#' @param margin A character string specifying the method used to  assign points
#'   to one side of the hyperplane or the other. Possible values are:
#'   - `"explicit"` categorizes all distance metrics as either Euclidean or
#'   Angular (Euclidean after normalization), explicitly calculates a hyperplane
#'   and offset, and then calculates the margin based on the dot product with
#'   the hyperplane.
#'   - `"implicit"` calculates the distance from a point to each of the
#'   points defining the normal vector. The margin is calculated by comparing the
#'   two distances: the point is assigned to the side of the hyperplane that
#'   the normal vector point with the closest distance belongs to.
#'   - `"auto"` (the default) picks the margin method depending on whether a
#'   binary-specific `metric` such as `"bhammming"` is chosen, in which case
#'   `"implicit"` is used, and `"explicit"` otherwise: binary-specific metrics
#'   involve storing the data in a way that isn't very efficient for the
#'   `"explicit"` method and the binary-specific metric is usually a lot faster
#'   than the generic equivalent such that the cost of two distance calculations
#'   for the margin method is still faster.
#'
#'   Only used if `init = "tree"`.
#' @param n_iters Number of iterations of nearest neighbor descent to carry out.
#'   By default, this will be chosen based on the number of observations in
#'   `data`.
#' @param max_candidates Maximum number of candidate neighbors to try for each
#'   item in each iteration. Use relative to `k` to emulate the "rho"
#'   sampling parameter in the nearest neighbor descent paper. By default, this
#'   is set to `k` or `60`, whichever is smaller.
#' @param delta The minimum relative change in the neighbor graph allowed before
#'   early stopping. Should be a value between 0 and 1. The smaller the value,
#'   the smaller the amount of progress between iterations is allowed. Default
#'   value of `0.001` means that at least 0.1% of the neighbor graph must
#'   be updated at each iteration.
#' @param low_memory If `TRUE`, use a lower memory, but more
#'   computationally expensive approach to index construction. If set to
#'   `FALSE`, you should see a noticeable speed improvement, especially when
#'   using a smaller number of threads, so this is worth trying if you have the
#'   memory to spare.
#' @param weight_by_degree If `TRUE`, then candidates for the local join are
#'   weighted according to their in-degree, so that if there are more than
#'   `max_candidates` in a candidate list, candidates with a smaller degree are
#'   favored for retention. This prevents items with large numbers of edges
#'   crowding out other items and for high-dimensional data is likely to provide
#'   a small improvement in accuracy. Because this incurs a small extra cost of
#'   counting the degree of each node, and because it tends to delay early
#'   convergence, by default this is `FALSE`.
#' @param n_search_trees, the number of trees to keep in the search forest as
#'   part of index preparation. The default is `1`.
#' @param diversify_prob the degree of diversification of the search graph
#'   by removing unnecessary edges through occlusion pruning. This should take a
#'   value between `0` (no diversification) and `1` (remove as many edges as
#'   possible) and is treated as the probability of a neighbor being removed if
#'   it is found to be an "occlusion". If item `p` and `q`, two members of the
#'   neighbor list of item `i`, are closer to each other than they are to `i`,
#'   then the nearer neighbor `p` is said to "occlude" `q`. It is likely that
#'   `q` will be in the neighbor list of `p` so there is no need to retain it in
#'   the neighbor list of `i`. You may also set this to `NULL` to skip any
#'   occlusion pruning. Note that occlusion pruning is carried out twice, once
#'   to the forward neighbors, and once to the reverse neighbors.
#' @param pruning_degree_multiplier How strongly to truncate the final neighbor
#'   list for each item. The neighbor list of each item will be truncated to
#'   retain only the closest `d` neighbors, where
#'   `d = k * pruning_degree_multiplier`, and `k` is the
#'   original number of neighbors per item in `graph`. Roughly, values
#'   larger than `1` will keep all the nearest neighbors of an item, plus
#'   the given fraction of reverse neighbors (if they exist). For example,
#'   setting this to `1.5` will keep all the forward neighbors and then
#'   half as many of the reverse neighbors, although exactly which neighbors are
#'   retained is also dependent on any occlusion pruning that occurs. Set this
#'   to `NULL` to skip this step.
#' @param prune_reverse If `TRUE`, prune the reverse neighbors of each item
#'   before the reverse graph diversification step using
#'   `pruning_degree_multiplier`. Because the number of reverse neighbors can be
#'   much larger than the number of forward neighbors, this can help to avoid
#'   excessive computation during the diversification step, with little overall
#'   effect on the final search graph. Default is `FALSE`.
#' @param n_threads Number of threads to use.
#' @param verbose If `TRUE`, log information to the console.
#' @param progress Determines the type of progress information logged during the
#'   nearest neighbor descent stage when `verbose = TRUE`. Options are:
#'   * `"bar"`: a simple text progress bar.
#'   * `"dist"`: the sum of the distances in the approximate knn graph at the
#'     end of each iteration.
#' @param obs set to `"C"` to indicate that the input `data` orientation stores
#'   each observation as a column. The default `"R"` means that observations are
#'   stored in each row. Storing the data by row is usually more convenient, but
#'   internally your data will be converted to column storage. Passing it
#'   already column-oriented will save some memory and (a small amount of) CPU
#'   usage.
#' @return the approximate nearest neighbor index, a list containing:
#'    * `graph` the k-nearest neighbor graph, a list containing:
#'        * `idx` an n by k matrix containing the nearest neighbor indices.
#'        * `dist` an n by k matrix containing the nearest neighbor distances.
#'    * Other list items are intended only for internal use by other functions
#'    such as [rnnd_query()].
#' @seealso [rnnd_query()]
#' @examples
#' iris_even <- iris[seq_len(nrow(iris)) %% 2 == 0, ]
#' iris_odd <- iris[seq_len(nrow(iris)) %% 2 == 1, ]
#'
#' # Find 4 (approximate) nearest neighbors using Euclidean distance
#' iris_even_index <- rnnd_build(iris_even, k = 4)
#' iris_odd_nbrs <- rnnd_query(index = iris_even_index, query = iris_odd, k = 4)
#'
#' @references
#' Dasgupta, S., & Freund, Y. (2008, May).
#' Random projection trees and low dimensional manifolds.
#' In *Proceedings of the fortieth annual ACM symposium on Theory of computing*
#' (pp. 537-546).
#' \doi{10.1145/1374376.1374452}.
#'
#' Dong, W., Moses, C., & Li, K. (2011, March).
#' Efficient k-nearest neighbor graph construction for generic similarity measures.
#' In *Proceedings of the 20th international conference on World Wide Web*
#' (pp. 577-586).
#' ACM.
#' \doi{10.1145/1963405.1963487}.
#'
#' Harwood, B., & Drummond, T. (2016).
#' Fanng: Fast approximate nearest neighbour graphs.
#' In *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*
#' (pp. 5713-5722).
#' @export
rnnd_build <- function(data,
                       k = 30,
                       metric = "euclidean",
                       use_alt_metric = TRUE,
                       init = "tree",
                       n_trees = NULL,
                       leaf_size = NULL,
                       max_tree_depth = 200,
                       margin = "auto",
                       n_iters = NULL,
                       delta = 0.001,
                       max_candidates = NULL,
                       low_memory = TRUE,
                       weight_by_degree = FALSE,
                       n_search_trees = 1,
                       pruning_degree_multiplier = 1.5,
                       diversify_prob = 1.0,
                       prune_reverse = FALSE,
                       n_threads = 0,
                       verbose = FALSE,
                       progress = "bar",
                       obs = "R") {
  data <- x2m(data)
  if (obs == "R") {
    data <- Matrix::t(data)
  }

  index <- nnd_knn(
    data,
    k = k,
    metric = metric,
    use_alt_metric = use_alt_metric,
    init = init,
    init_args = list(
      n_trees = n_trees,
      leaf_size = leaf_size,
      max_tree_depth = max_tree_depth,
      margin = margin
    ),
    ret_forest = TRUE,
    n_iters = n_iters,
    delta = delta,
    max_candidates = max_candidates,
    low_memory = low_memory,
    weight_by_degree = weight_by_degree,
    n_threads = n_threads,
    verbose = verbose,
    progress = progress,
    obs = "C"
  )

  index <- list(
    graph = list(idx = index$idx, dist = index$dist),
    forest = index$forest
  )

  index$prep <- list(
    pruning_degree_multiplier = pruning_degree_multiplier,
    diversify_prob = diversify_prob,
    n_search_trees = n_search_trees,
    prune_reverse = prune_reverse,
    is_prepared = FALSE
  )

  index$data <- data
  index$original_metric <- metric
  index$use_alt_metric <- use_alt_metric

  if (!is.null(index$forest)) {
    index$search_forest <-
      rpf_filter(
        nn = index$graph,
        forest = index$forest,
        n_trees = index$prep$n_search_trees,
        n_threads = n_threads,
        verbose = verbose
      )
  }

  search_graph <- prepare_search_graph(
    data = index$data,
    graph = index$graph,
    metric = index$original_metric,
    use_alt_metric = index$use_alt_metric,
    diversify_prob = index$prep$diversify_prob,
    pruning_degree_multiplier = index$prep$pruning_degree_multiplier,
    prune_reverse = index$prep$prune_reverse,
    n_threads = n_threads,
    verbose = verbose,
    obs = "C"
  )
  index$prep$is_prepared <- TRUE
  index$search_graph <- search_graph

  # RP Forests can be large so after preparation we delete this
  index$forest <- NULL
  index
}

#' Query an index for approximate nearest neighbors
#'
#' Takes a nearest neighbor index produced by [rnnd_build()] and uses it to
#' find the nearest neighbors of a query set of observations, using a
#' back-tracking search with the search size determined by the method of
#' Iwasaki and Miyazaki (2018). For further control over the search effort, the
#' total number of distance calculations can also be bounded, similar to the
#' method of Harwood and Drummond (2016).
#'
#' @param index A nearest neighbor index produced by [rnnd_build()].
#' @param query Matrix of `n` query items, with observations in the rows and
#'   features in the columns. Optionally, the data may be passed with the
#'   observations in the columns, by setting `obs = "C"`, which should be more
#'   efficient. Possible formats are [base::data.frame()], [base::matrix()]
#'   or [Matrix::sparseMatrix()]. Sparse matrices should be in `dgCMatrix`
#'   format. Dataframes will be converted to `numerical` matrix format
#'   internally, so if your data columns are `logical` and intended to be used
#'   with the specialized binary `metric`s, you should convert it to a logical
#'   matrix first (otherwise you will get the slower dense numerical version).
#'   Sparse and non-sparse data cannot be mixed, so if the data used to build
#'   index was sparse, the `query` data must also be sparse. and vice versa.
#' @param k Number of nearest neighbors to return.
#' @param epsilon Controls trade-off between accuracy and search cost, as
#'   described by Iwasaki and Miyazaki (2018). Setting `epsilon` to a positive
#'   value specifies a distance tolerance on whether to explore the neighbors of
#'   candidate points. The larger the value, the more neighbors will be
#'   searched. A value of 0.1 allows query-candidate distances to be 10% larger
#'   than the current most-distant neighbor of the query point, 0.2 means 20%,
#'   and so on. Suggested values are between 0-0.5, although this value is
#'   highly dependent on the distribution of distances in the dataset (higher
#'   dimensional data should choose a smaller cutoff). Too large a value of
#'   `epsilon` will result in the query search approaching brute force
#'   comparison. Use this parameter in conjunction with `max_search_fraction` to
#'   prevent excessive run time. Default is 0.1. If you set `verbose = TRUE`,
#'   statistics of the number of distance calculations will be logged which can
#'   help you tune `epsilon`.
#' @param max_search_fraction Maximum fraction of the reference data to search.
#'  This is a value between 0 (search none of the reference data) and 1 (search
#'  all of the data if necessary). This works in conjunction with `epsilon` and
#'  will terminate the search early if the specified fraction of the reference
#'  data has been searched. Default is 1.
#' @param n_threads Number of threads to use.
#' @param init An optional matrix of `k` initial nearest neighbors for each
#'  query point.
#' @param verbose If `TRUE`, log information to the console.
#' @param obs set to `"C"` to indicate that the input `data` orientation stores
#'   each observation as a column. The default `"R"` means that observations are
#'   stored in each row. Storing the data by row is usually more convenient, but
#'   internally your data will be converted to column storage. Passing it
#'   already column-oriented will save some memory and (a small amount of) CPU
#'   usage.
#' @return the approximate nearest neighbor index, a list containing:
#'  * `idx` an n by k matrix containing the nearest neighbor indices.
#'  * `dist` an n by k matrix containing the nearest neighbor distances.
#' @seealso [rnnd_query()]
#' @examples
#' iris_even <- iris[seq_len(nrow(iris)) %% 2 == 0, ]
#' iris_odd <- iris[seq_len(nrow(iris)) %% 2 == 1, ]
#'
#' iris_even_index <- rnnd_build(iris_even, k = 4)
#' iris_odd_nbrs <- rnnd_query(index = iris_even_index, query = iris_odd, k = 4)
#'
#' @references
#' Harwood, B., & Drummond, T. (2016).
#' Fanng: Fast approximate nearest neighbour graphs.
#' In *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*
#' (pp. 5713-5722).
#'
#' Iwasaki, M., & Miyazaki, D. (2018).
#' Optimization of indexing based on k-nearest neighbor graph for proximity search in high-dimensional data.
#' *arXiv preprint* *arXiv:1810.07355*.
#' <https://arxiv.org/abs/1810.07355>
#' @export
rnnd_query <-
  function(index,
           query,
           k = 30,
           epsilon = 0.1,
           max_search_fraction = 1,
           init = NULL,
           n_threads = 0,
           verbose = FALSE,
           obs = "R") {
    if (is.null(init) && !is.null(index$search_forest)) {
      init <- index$search_forest
    }

    query <- x2m(query)
    if (obs == "R") {
      query <- Matrix::t(query)
    }
    res <- graph_knn_query(
      query = query,
      reference = index$data,
      reference_graph = index$search_graph,
      k = k,
      metric = index$original_metric,
      init = init,
      epsilon = epsilon,
      max_search_fraction = max_search_fraction,
      use_alt_metric = index$use_alt_metric,
      n_threads = n_threads,
      verbose = verbose,
      obs = "C"
    )
    res
  }


#' Find approximate nearest neighbors
#'
#' This function builds an approximate nearest neighbors graph of the provided
#' data using convenient defaults. It does not return an index for later
#' querying, to speed the graph construction and reduce the size and complexity
#' of the return value.
#'
#' The process of k-nearest neighbor graph construction using Random Projection
#' Forests (Dasgupta and Freund, 2008) for initialization and Nearest Neighbor
#' Descent (Dong and co-workers, 2011) for refinement. If you are sure you will
#' not want to query new data then compared to [rnnd_build()] this function has
#' the advantage of not storing the index, which can be very large.
#'
#' @param data Matrix of `n` items to generate neighbors for, with observations
#'   in the rows and features in the columns. Optionally, input can be passed
#'   with observations in the columns, by setting `obs = "C"`, which should be
#'   more efficient. Possible formats are [base::data.frame()], [base::matrix()]
#'   or [Matrix::sparseMatrix()]. Sparse matrices should be in `dgCMatrix`
#'   format. Dataframes will be converted to `numerical` matrix format
#'   internally, so if your data columns are `logical` and intended to be used
#'   with the specialized binary `metric`s, you should convert it to a logical
#'   matrix first (otherwise you will get the slower dense numerical version).
#' @param k Number of nearest neighbors to return. Optional if `init` is
#'   specified.
#' @param metric Type of distance calculation to use. One of:
#'   - `"braycurtis"`
#'   - `"canberra"`
#'   - `"chebyshev"`
#'   - `"correlation"` (1 minus the Pearson correlation)
#'   - `"cosine"`
#'   - `"dice"`
#'   - `"euclidean"`
#'   - `"hamming"`
#'   - `"hellinger"`
#'   - `"jaccard"`
#'   - `"jensenshannon"`
#'   - `"kulsinski"`
#'   - `"sqeuclidean"` (squared Euclidean)
#'   - `"manhattan"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"spearmanr"` (1 minus the Spearman rank correlation)
#'   - `"symmetrickl"` (symmetric Kullback-Leibler divergence)
#'   - `"tsss"` (Triangle Area Similarity-Sector Area Similarity or TS-SS
#'   metric)
#'   - `"yule"`
#'
#'   For non-sparse data, the following variants are available with
#'   preprocessing: this trades memory for a potential speed up during the
#'   distance calculation. Some minor numerical differences should be expected
#'   compared to the non-preprocessed versions:
#'   - `"cosine-preprocess"`: `cosine` with preprocessing.
#'   - `"correlation-preprocess"`: `correlation` with preprocessing.
#'
#'   For non-sparse binary data passed as a `logical` matrix, the following
#'   metrics have specialized variants which should be substantially faster than
#'   the non-binary variants (in other cases the logical data will be treated as
#'   a dense numeric vector of 0s and 1s):
#'   - `"dice"`
#'   - `"hamming"`
#'   - `"jaccard"`
#'   - `"kulsinski"`
#'   - `"matching"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"yule"`
#' @param use_alt_metric If `TRUE`, use faster metrics that maintain the
#'   ordering of distances internally (e.g. squared Euclidean distances if using
#'   `metric = "euclidean"`), then apply a correction at the end. Probably
#'   the only reason to set this to `FALSE` is if you suspect that some
#'   sort of numeric issue is occurring with your data in the alternative code
#'   path.
#' @param init Name of the initialization strategy or initial `data` neighbor
#'   graph to optimize. One of:
#'   - `"rand"` random initialization (the default).
#'   - `"tree"` use the random projection tree method of Dasgupta and Freund
#'   (2008).
#'   - a pre-calculated neighbor graph. A list containing:
#'       - `idx` an `n` by `k` matrix containing the nearest neighbor indices.
#'       - `dist` (optional) an `n` by `k` matrix containing the nearest
#'       neighbor distances. If the input distances are omitted, they will be
#'       calculated for you.'
#'
#'   If `k` and `init` are specified as arguments to this function, and the
#'   number of neighbors provided in `init` is not equal to `k` then:
#'
#'   * if `k` is smaller, only the `k` closest values in `init` are retained.
#'   * if `k` is larger, then random neighbors will be chosen to fill `init` to
#'   the size of `k`. Note that there is no checking if any of the random
#'   neighbors are duplicates of what is already in `init` so effectively fewer
#'   than `k` neighbors may be chosen for some observations under these
#'   circumstances.
#' @param n_trees The number of trees to use in the RP forest. A larger number
#'   will give more accurate results at the cost of a longer computation time.
#'   The default of `NULL` means that the number is chosen based on the number
#'   of observations in `data`. Only used if `init = "tree"`.
#' @param leaf_size The maximum number of items that can appear in a leaf. This
#'   value should be chosen to match the expected number of neighbors you will
#'   want to retrieve when running queries (e.g. if you want find 50 nearest
#'   neighbors set `leaf_size = 50`) and should not be set to a value smaller
#'   than `10`. Only used if `init = "tree"`.
#' @param max_tree_depth The maximum depth of the tree to build (default = 200).
#'   If the maximum tree depth is exceeded then the leaf size of a tree may
#'   exceed `leaf_size` which can result in a large number of neighbor distances
#'   being calculated. If `verbose = TRUE` a message will be logged to indicate
#'   that the leaf size is large. However, increasing the `max_tree_depth` may
#'   not help: it may be that there is something unusual about the distribution
#'   of your data set under your chose `metric` that makes a tree-based
#'   initialization inappropriate. Only used if `init = "tree"`.
#' @param margin A character string specifying the method used to  assign points
#'   to one side of the hyperplane or the other. Possible values are:
#'   - `"explicit"` categorizes all distance metrics as either Euclidean or
#'   Angular (Euclidean after normalization), explicitly calculates a hyperplane
#'   and offset, and then calculates the margin based on the dot product with
#'   the hyperplane.
#'   - `"implicit"` calculates the distance from a point to each of the
#'   points defining the normal vector. The margin is calculated by comparing the
#'   two distances: the point is assigned to the side of the hyperplane that
#'   the normal vector point with the closest distance belongs to.
#'   - `"auto"` (the default) picks the margin method depending on whether a
#'   binary-specific `metric` such as `"bhammming"` is chosen, in which case
#'   `"implicit"` is used, and `"explicit"` otherwise: binary-specific metrics
#'   involve storing the data in a way that isn't very efficient for the
#'   `"explicit"` method and the binary-specific metric is usually a lot faster
#'   than the generic equivalent such that the cost of two distance calculations
#'   for the margin method is still faster.
#'
#'   Only used if `init = "tree"`.
#' @param n_iters Number of iterations of nearest neighbor descent to carry out.
#'   By default, this will be chosen based on the number of observations in
#'   `data`.
#' @param max_candidates Maximum number of candidate neighbors to try for each
#'   item in each iteration. Use relative to `k` to emulate the "rho"
#'   sampling parameter in the nearest neighbor descent paper. By default, this
#'   is set to `k` or `60`, whichever is smaller.
#' @param delta The minimum relative change in the neighbor graph allowed before
#'   early stopping. Should be a value between 0 and 1. The smaller the value,
#'   the smaller the amount of progress between iterations is allowed. Default
#'   value of `0.001` means that at least 0.1% of the neighbor graph must
#'   be updated at each iteration.
#' @param low_memory If `TRUE`, use a lower memory, but more
#'   computationally expensive approach to index construction. If set to
#'   `FALSE`, you should see a noticeable speed improvement, especially when
#'   using a smaller number of threads, so this is worth trying if you have the
#'   memory to spare.
#' @param weight_by_degree If `TRUE`, then candidates for the local join are
#'   weighted according to their in-degree, so that if there are more than
#'   `max_candidates` in a candidate list, candidates with a smaller degree are
#'   favored for retention. This prevents items with large numbers of edges
#'   crowding out other items and for high-dimensional data is likely to provide
#'   a small improvement in accuracy. Because this incurs a small extra cost of
#'   counting the degree of each node, and because it tends to delay early
#'   convergence, by default this is `FALSE`.
#' @param n_threads Number of threads to use.
#' @param verbose If `TRUE`, log information to the console.
#' @param progress Determines the type of progress information logged during the
#'   nearest neighbor descent stage when `verbose = TRUE`. Options are:
#'   * `"bar"`: a simple text progress bar.
#'   * `"dist"`: the sum of the distances in the approximate knn graph at the
#'     end of each iteration.
#' @param obs set to `"C"` to indicate that the input `data` orientation stores
#'   each observation as a column. The default `"R"` means that observations are
#'   stored in each row. Storing the data by row is usually more convenient, but
#'   internally your data will be converted to column storage. Passing it
#'   already column-oriented will save some memory and (a small amount of) CPU
#'   usage.
#' @return the approximate nearest neighbor index, a list containing:
#'    * `idx` an n by k matrix containing the nearest neighbor indices.
#'    * `dist` an n by k matrix containing the nearest neighbor distances.
#' @seealso [rnnd_build()], [rnnd_query()]
#' @examples
#'
#' # Find 4 (approximate) nearest neighbors using Euclidean distance
#' iris_knn <- rnnd_knn(iris, k = 4)
#'
#' @references
#' Dasgupta, S., & Freund, Y. (2008, May).
#' Random projection trees and low dimensional manifolds.
#' In *Proceedings of the fortieth annual ACM symposium on Theory of computing*
#' (pp. 537-546).
#' \doi{10.1145/1374376.1374452}.
#'
#' Dong, W., Moses, C., & Li, K. (2011, March).
#' Efficient k-nearest neighbor graph construction for generic similarity measures.
#' In *Proceedings of the 20th international conference on World Wide Web*
#' (pp. 577-586).
#' ACM.
#' \doi{10.1145/1963405.1963487}.
#'
#' @export
rnnd_knn <- function(data,
                     k = 30,
                     metric = "euclidean",
                     use_alt_metric = TRUE,
                     init = "tree",
                     n_trees = NULL,
                     leaf_size = NULL,
                     max_tree_depth = 200,
                     margin = "auto",
                     n_iters = NULL,
                     delta = 0.001,
                     max_candidates = NULL,
                     weight_by_degree = FALSE,
                     low_memory = TRUE,
                     n_threads = 0,
                     verbose = FALSE,
                     progress = "bar",
                     obs = "R") {
  data <- x2m(data)
  if (obs == "R") {
    data <- Matrix::t(data)
  }

  nnd_knn(
    data,
    k = k,
    metric = metric,
    use_alt_metric = use_alt_metric,
    init = init,
    init_args = list(
      n_trees = n_trees,
      leaf_size = leaf_size,
      max_tree_depth = max_tree_depth,
      margin = margin
    ),
    ret_forest = FALSE,
    n_iters = n_iters,
    delta = delta,
    max_candidates = max_candidates,
    low_memory = low_memory,
    weight_by_degree = weight_by_degree,
    n_threads = n_threads,
    verbose = verbose,
    progress = progress,
    obs = "C"
  )
}


# kNN Construction --------------------------------------------------------

#' Find exact nearest neighbors by brute force
#'
#' Returns the exact nearest neighbors of a dataset. A brute force search is
#' carried out: all possible pairs of points are compared, and the nearest
#' neighbors are returned.
#'
#' This method is accurate but scales poorly with dataset size, so use with
#' caution with larger datasets. Having the exact neighbors as a ground truth to
#' compare with approximate results is useful for benchmarking and determining
#' parameter settings of the approximate methods.
#'
#' @param data Matrix of `n` items to generate neighbors for, with observations
#'   in the rows and features in the columns. Optionally, input can be passed
#'   with observations in the columns, by setting `obs = "C"`, which should be
#'   more efficient. Possible formats are [base::data.frame()], [base::matrix()]
#'   or [Matrix::sparseMatrix()]. Sparse matrices should be in `dgCMatrix`
#'   format. Dataframes will be converted to `numerical` matrix format
#'   internally, so if your data columns are `logical` and intended to be used
#'   with the specialized binary `metric`s, you should convert it to a logical
#'   matrix first (otherwise you will get the slower dense numerical version).
#' @param k Number of nearest neighbors to return.
#' @param metric Type of distance calculation to use. One of:
#'   - `"braycurtis"`
#'   - `"canberra"`
#'   - `"chebyshev"`
#'   - `"correlation"` (1 minus the Pearson correlation)
#'   - `"cosine"`
#'   - `"dice"`
#'   - `"euclidean"`
#'   - `"hamming"`
#'   - `"hellinger"`
#'   - `"jaccard"`
#'   - `"jensenshannon"`
#'   - `"kulsinski"`
#'   - `"sqeuclidean"` (squared Euclidean)
#'   - `"manhattan"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"spearmanr"` (1 minus the Spearman rank correlation)
#'   - `"symmetrickl"` (symmetric Kullback-Leibler divergence)
#'   - `"tsss"` (Triangle Area Similarity-Sector Area Similarity or TS-SS
#'   metric)
#'   - `"yule"`
#'
#'   For non-sparse data, the following variants are available with
#'   preprocessing: this trades memory for a potential speed up during the
#'   distance calculation. Some minor numerical differences should be expected
#'   compared to the non-preprocessed versions:
#'   - `"cosine-preprocess"`: `cosine` with preprocessing.
#'   - `"correlation-preprocess"`: `correlation` with preprocessing.
#'
#'   For non-sparse binary data passed as a `logical` matrix, the following
#'   metrics have specialized variants which should be substantially faster than
#'   the non-binary variants (in other cases the logical data will be treated as
#'   a dense numeric vector of 0s and 1s):
#'   - `"dice"`
#'   - `"hamming"`
#'   - `"jaccard"`
#'   - `"kulsinski"`
#'   - `"matching"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"yule"`
#' @param use_alt_metric If `TRUE`, use faster metrics that maintain the
#'   ordering of distances internally (e.g. squared Euclidean distances if using
#'   `metric = "euclidean"`), then apply a correction at the end. Probably
#'   the only reason to set this to `FALSE` is if you suspect that some
#'   sort of numeric issue is occurring with your data in the alternative code
#'   path.
#' @param n_threads Number of threads to use.
#' @param verbose If `TRUE`, log information to the console.
#' @param obs set to `"C"` to indicate that the input `data` orientation stores
#'   each observation as a column. The default `"R"` means that observations are
#'   stored in each row. Storing the data by row is usually more convenient, but
#'   internally your data will be converted to column storage. Passing it
#'   already column-oriented will save some memory and (a small amount of) CPU
#'   usage.
#' @return the nearest neighbor graph as a list containing:
#'   * `idx` an n by k matrix containing the nearest neighbor indices.
#'   * `dist` an n by k matrix containing the nearest neighbor distances.
#' @examples
#' # Find the 4 nearest neighbors using Euclidean distance
#' # If you pass a data frame, non-numeric columns are removed
#' iris_nn <- brute_force_knn(iris, k = 4, metric = "euclidean")
#'
#' # Manhattan (l1) distance
#' iris_nn <- brute_force_knn(iris, k = 4, metric = "manhattan")
#'
#' # Multi-threading: you can choose the number of threads to use: in real
#' # usage, you will want to set n_threads to at least 2
#' iris_nn <- brute_force_knn(iris, k = 4, metric = "manhattan", n_threads = 1)
#'
#' # Use verbose flag to see information about progress
#' iris_nn <- brute_force_knn(iris, k = 4, metric = "euclidean", verbose = TRUE)
#' @export
brute_force_knn <- function(data,
                            k,
                            metric = "euclidean",
                            use_alt_metric = TRUE,
                            n_threads = 0,
                            verbose = FALSE,
                            obs = "R") {
  obs <- match.arg(toupper(obs), c("C", "R"))
  n_obs <- switch(obs,
    R = nrow,
    C = ncol,
    stop("Unknown obs type")
  )

  data <- x2m(data)
  check_k(k, n_obs(data))

  actual_metric <- get_actual_metric(use_alt_metric, metric, data, verbose)

  tsmessage(
    thread_msg(
      "Calculating brute force k-nearest neighbors with k = ",
      k,
      n_threads = n_threads
    )
  )
  if (obs == "R") {
    data <- Matrix::t(data)
  }
  if (is_sparse(data)) {
    res <-
      rnn_sparse_brute_force(
        ind = data@i,
        ptr = data@p,
        data = data@x,
        ndim = nrow(data),
        nnbrs = k,
        metric = actual_metric,
        n_threads = n_threads,
        verbose = verbose
      )
  } else if (is.logical(data)) {
    res <-
      rnn_logical_brute_force(data,
        k,
        actual_metric,
        n_threads = n_threads,
        verbose = verbose
      )
  } else {
    res <-
      rnn_brute_force(data,
        k,
        actual_metric,
        n_threads = n_threads,
        verbose = verbose
      )
  }

  res$idx <- res$idx + 1

  if (use_alt_metric) {
    res$dist <-
      apply_alt_metric_correction(metric, res$dist, is_sparse(data))
  }
  tsmessage("Finished")
  res
}

#' Find nearest neighbors by random selection
#'
#' Create a neighbor graph by randomly selecting neighbors. This is not a useful
#' nearest neighbor method on its own, but can be used with other methods which
#' require initialization, such as [nnd_knn()].
#'
#' @param data Matrix of `n` items to generate random neighbors for, with
#'   observations in the rows and features in the columns. Optionally, input can
#'   be passed with observations in the columns, by setting `obs = "C"`, which
#'   should be more efficient. Possible formats are [base::data.frame()],
#'   [base::matrix()] or [Matrix::sparseMatrix()]. Sparse matrices should be in
#'   `dgCMatrix` format. Dataframes will be converted to `numerical` matrix
#'   format internally, so if your data columns are `logical` and intended to be
#'   used with the specialized binary `metric`s, you should convert it to a
#'   logical matrix first (otherwise you will get the slower dense numerical
#'   version).
#' @param k Number of nearest neighbors to return.
#' @param metric Type of distance calculation to use. One of:
#'   - `"braycurtis"`
#'   - `"canberra"`
#'   - `"chebyshev"`
#'   - `"correlation"` (1 minus the Pearson correlation)
#'   - `"cosine"`
#'   - `"dice"`
#'   - `"euclidean"`
#'   - `"hamming"`
#'   - `"hellinger"`
#'   - `"jaccard"`
#'   - `"jensenshannon"`
#'   - `"kulsinski"`
#'   - `"sqeuclidean"` (squared Euclidean)
#'   - `"manhattan"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"spearmanr"` (1 minus the Spearman rank correlation)
#'   - `"symmetrickl"` (symmetric Kullback-Leibler divergence)
#'   - `"tsss"` (Triangle Area Similarity-Sector Area Similarity or TS-SS
#'   metric)
#'   - `"yule"`
#'
#'   For non-sparse data, the following variants are available with
#'   preprocessing: this trades memory for a potential speed up during the
#'   distance calculation. Some minor numerical differences should be expected
#'   compared to the non-preprocessed versions:
#'   - `"cosine-preprocess"`: `cosine` with preprocessing.
#'   - `"correlation-preprocess"`: `correlation` with preprocessing.
#'
#'   For non-sparse binary data passed as a `logical` matrix, the following
#'   metrics have specialized variants which should be substantially faster than
#'   the non-binary variants (in other cases the logical data will be treated as
#'   a dense numeric vector of 0s and 1s):
#'   - `"dice"`
#'   - `"hamming"`
#'   - `"jaccard"`
#'   - `"kulsinski"`
#'   - `"matching"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"yule"`
#' @param use_alt_metric If `TRUE`, use faster metrics that maintain the
#'   ordering of distances internally (e.g. squared Euclidean distances if using
#'   `metric = "euclidean"`), then apply a correction at the end. Probably
#'   the only reason to set this to `FALSE` is if you suspect that some
#'   sort of numeric issue is occurring with your data in the alternative code
#'   path.
#' @param order_by_distance If `TRUE` (the default), then results for each
#'   item are returned by increasing distance. If you don't need the results
#'   sorted, e.g. you are going to pass the results as initialization to another
#'   routine like [nnd_knn()], set this to `FALSE` to save a small amount of
#'   computational time.
#' @param n_threads Number of threads to use.
#' @param verbose If `TRUE`, log information to the console.
#' @param obs set to `"C"` to indicate that the input `data` orientation stores
#'   each observation as a column. The default `"R"` means that observations are
#'   stored in each row. Storing the data by row is usually more convenient, but
#'   internally your data will be converted to column storage. Passing it
#'   already column-oriented will save some memory and (a small amount of) CPU
#'   usage.
#' @return a random neighbor graph as a list containing:
#'   * `idx` an n by k matrix containing the nearest neighbor indices.
#'   * `dist` an n by k matrix containing the nearest neighbor distances.
#' @examples
#' # Find 4 random neighbors and calculate their Euclidean distance
#' # If you pass a data frame, non-numeric columns are removed
#' iris_nn <- random_knn(iris, k = 4, metric = "euclidean")
#'
#' # Manhattan (l1) distance
#' iris_nn <- random_knn(iris, k = 4, metric = "manhattan")
#'
#' # Multi-threading: you can choose the number of threads to use: in real
#' # usage, you will want to set n_threads to at least 2
#' iris_nn <- random_knn(iris, k = 4, metric = "manhattan", n_threads = 1)
#'
#' # Use verbose flag to see information about progress
#' iris_nn <- random_knn(iris, k = 4, metric = "euclidean", verbose = TRUE)
#'
#' # These results can be improved by nearest neighbors descent. You don't need
#' # to specify k here because this is worked out from the initial input
#' iris_nn <- nnd_knn(iris, init = iris_nn, metric = "euclidean", verbose = TRUE)
#' @export
random_knn <-
  function(data,
           k,
           metric = "euclidean",
           use_alt_metric = TRUE,
           order_by_distance = TRUE,
           n_threads = 0,
           verbose = FALSE,
           obs = "R") {
    obs <- match.arg(toupper(obs), c("C", "R"))
    n_obs <- switch(obs,
      R = nrow,
      C = ncol,
      stop("Unknown obs type")
    )
    data <- x2m(data)
    check_k(k, n_obs(data))

    actual_metric <- get_actual_metric(use_alt_metric, metric, data, verbose)

    if (obs == "R") {
      data <- Matrix::t(data)
    }
    res <- random_knn_impl(
      reference = data,
      k = k,
      actual_metric = actual_metric,
      order_by_distance = order_by_distance,
      n_threads = n_threads,
      verbose = verbose
    )
    res$idx <- res$idx + 1

    if (use_alt_metric) {
      res$dist <-
        apply_alt_metric_correction(metric, res$dist, is_sparse(data))
    }
    tsmessage("Finished")
    res
  }

#' Find nearest neighbors using nearest neighbor descent
#'
#' Uses the Nearest Neighbor Descent method due to Dong and co-workers (2011)
#' to optimize an approximate nearest neighbor graph.
#'
#' If no initial graph is provided, a random graph is generated, or you may also
#' specify the use of a graph generated from a forest of random projection
#' trees, using the method of Dasgupta and Freund (2008).
#'
#' @param data Matrix of `n` items to generate neighbors for, with observations
#'   in the rows and features in the columns. Optionally, input can be passed
#'   with observations in the columns, by setting `obs = "C"`, which should be
#'   more efficient. Possible formats are [base::data.frame()], [base::matrix()]
#'   or [Matrix::sparseMatrix()]. Sparse matrices should be in `dgCMatrix`
#'   format. Dataframes will be converted to `numerical` matrix format
#'   internally, so if your data columns are `logical` and intended to be used
#'   with the specialized binary `metric`s, you should convert it to a logical
#'   matrix first (otherwise you will get the slower dense numerical version).
#' @param k Number of nearest neighbors to return. Optional if `init` is
#'   specified.
#' @param metric Type of distance calculation to use. One of:
#'   - `"braycurtis"`
#'   - `"canberra"`
#'   - `"chebyshev"`
#'   - `"correlation"` (1 minus the Pearson correlation)
#'   - `"cosine"`
#'   - `"dice"`
#'   - `"euclidean"`
#'   - `"hamming"`
#'   - `"hellinger"`
#'   - `"jaccard"`
#'   - `"jensenshannon"`
#'   - `"kulsinski"`
#'   - `"sqeuclidean"` (squared Euclidean)
#'   - `"manhattan"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"spearmanr"` (1 minus the Spearman rank correlation)
#'   - `"symmetrickl"` (symmetric Kullback-Leibler divergence)
#'   - `"tsss"` (Triangle Area Similarity-Sector Area Similarity or TS-SS
#'   metric)
#'   - `"yule"`
#'
#'   For non-sparse data, the following variants are available with
#'   preprocessing: this trades memory for a potential speed up during the
#'   distance calculation. Some minor numerical differences should be expected
#'   compared to the non-preprocessed versions:
#'   - `"cosine-preprocess"`: `cosine` with preprocessing.
#'   - `"correlation-preprocess"`: `correlation` with preprocessing.
#'
#'   For non-sparse binary data passed as a `logical` matrix, the following
#'   metrics have specialized variants which should be substantially faster than
#'   the non-binary variants (in other cases the logical data will be treated as
#'   a dense numeric vector of 0s and 1s):
#'   - `"dice"`
#'   - `"hamming"`
#'   - `"jaccard"`
#'   - `"kulsinski"`
#'   - `"matching"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"yule"`
#' @param init Name of the initialization strategy or initial `data` neighbor
#'   graph to optimize. One of:
#'   - `"rand"` random initialization (the default).
#'   - `"tree"` use the random projection tree method of Dasgupta and Freund
#'   (2008).
#'   - a pre-calculated neighbor graph. A list containing:
#'       - `idx` an `n` by `k` matrix containing the nearest neighbor indices.
#'       - `dist` (optional) an `n` by `k` matrix containing the nearest
#'       neighbor distances. If the input distances are omitted, they will be
#'       calculated for you.'
#'
#'   If `k` and `init` are specified as arguments to this function, and the
#'   number of neighbors provided in `init` is not equal to `k` then:
#'
#'   * if `k` is smaller, only the `k` closest values in `init` are retained.
#'   * if `k` is larger, then random neighbors will be chosen to fill `init` to
#'   the size of `k`. Note that there is no checking if any of the random
#'   neighbors are duplicates of what is already in `init` so effectively fewer
#'   than `k` neighbors may be chosen for some observations under these
#'   circumstances.
#' @param init_args a list containing arguments to pass to the random partition
#'   forest initialization. See [rpf_knn()] for possible arguments. To avoid
#'   inconsistences with the tree calculation and subsequent nearest neighbor
#'   descent optimization, if you attempt to provide a `metric` or
#'   `use_alt_metric` option in this list it will be ignored.
#' @param n_iters Number of iterations of nearest neighbor descent to carry out.
#'   By default, this will be chosen based on the number of observations in
#'   `data`.
#' @param max_candidates Maximum number of candidate neighbors to try for each
#'   item in each iteration. Use relative to `k` to emulate the "rho"
#'   sampling parameter in the nearest neighbor descent paper. By default, this
#'   is set to `k` or `60`, whichever is smaller.
#' @param delta The minimum relative change in the neighbor graph allowed before
#'   early stopping. Should be a value between 0 and 1. The smaller the value,
#'   the smaller the amount of progress between iterations is allowed. Default
#'   value of `0.001` means that at least 0.1% of the neighbor graph must
#'   be updated at each iteration.
#' @param low_memory If `TRUE`, use a lower memory, but more
#'   computationally expensive approach to index construction. If set to
#'   `FALSE`, you should see a noticeable speed improvement, especially
#'   when using a smaller number of threads, so this is worth trying if you have
#'   the memory to spare.
#' @param weight_by_degree If `TRUE`, then candidates for the local join are
#'   weighted according to their in-degree, so that if there are more than
#'   `max_candidates` in a candidate list, candidates with a smaller degree are
#'   favored for retention. This prevents items with large numbers of edges
#'   crowding out other items and for high-dimensional data is likely to provide
#'   a small improvement in accuracy. Because this incurs a small extra cost of
#'   counting the degree of each node, and because it tends to delay early
#'   convergence, by default this is `FALSE`.
#' @param use_alt_metric If `TRUE`, use faster metrics that maintain the
#'   ordering of distances internally (e.g. squared Euclidean distances if using
#'   `metric = "euclidean"`), then apply a correction at the end. Probably
#'   the only reason to set this to `FALSE` is if you suspect that some
#'   sort of numeric issue is occurring with your data in the alternative code
#'   path.
#' @param ret_forest If `TRUE` and `init = "tree"` then the RP forest used to
#'   initialize the nearest neighbors will be returned with the nearest neighbor
#'   data. See the `Value` section for details. The returned forest can be used
#'   as part of initializing the search for new data: see [rpf_knn_query()] and
#'   [rpf_filter()] for more details.
#' @param n_threads Number of threads to use.
#' @param verbose If `TRUE`, log information to the console.
#' @param progress Determines the type of progress information logged if
#'   `verbose = TRUE`. Options are:
#'   * `"bar"`: a simple text progress bar.
#'   * `"dist"`: the sum of the distances in the approximate knn graph at the
#'     end of each iteration.
#' @param obs set to `"C"` to indicate that the input `data` orientation stores
#'   each observation as a column. The default `"R"` means that observations are
#'   stored in each row. Storing the data by row is usually more convenient, but
#'   internally your data will be converted to column storage. Passing it
#'   already column-oriented will save some memory and (a small amount of) CPU
#'   usage.
#' @return the approximate nearest neighbor graph as a list containing:
#'   * `idx` an n by k matrix containing the nearest neighbor indices.
#'   * `dist` an n by k matrix containing the nearest neighbor distances.
#'   * `forest` (if `init = "tree"` and `ret_forest = TRUE` only): the RP forest
#'      used to initialize the neighbor data.
#' @examples
#' # Find 4 (approximate) nearest neighbors using Euclidean distance
#' # If you pass a data frame, non-numeric columns are removed
#' iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean")
#'
#' # Manhattan (l1) distance
#' iris_nn <- nnd_knn(iris, k = 4, metric = "manhattan")
#'
#' # Multi-threading: you can choose the number of threads to use: in real
#' # usage, you will want to set n_threads to at least 2
#' iris_nn <- nnd_knn(iris, k = 4, metric = "manhattan", n_threads = 1)
#'
#' # Use verbose flag to see information about progress
#' iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", verbose = TRUE)
#'
#' # Nearest neighbor descent uses random initialization, but you can pass any
#' # approximation using the init argument (as long as the metrics used to
#' # calculate the initialization are compatible with the metric options used
#' # by nnd_knn).
#' iris_nn <- random_knn(iris, k = 4, metric = "euclidean")
#' iris_nn <- nnd_knn(iris, init = iris_nn, metric = "euclidean", verbose = TRUE)
#'
#' # Number of iterations controls how much optimization is attempted. A smaller
#' # value will run faster but give poorer results
#' iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", n_iters = 2)
#'
#' # You can also control the amount of work done within an iteration by
#' # setting max_candidates
#' iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", max_candidates = 50)
#'
#' # Optimization may also stop early if not much progress is being made. This
#' # convergence criterion can be controlled via delta. A larger value will
#' # stop progress earlier. The verbose flag will provide some information if
#' # convergence is occurring before all iterations are carried out.
#' set.seed(1337)
#' iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", n_iters = 5, delta = 0.5)
#'
#' # To ensure that descent only stops if no improvements are made, set delta = 0
#' set.seed(1337)
#' iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", n_iters = 5, delta = 0)
#'
#' # A faster version of the algorithm is available that avoids repeated
#' # distance calculations at the cost of using more RAM. Set low_memory to
#' # FALSE to try it.
#' set.seed(1337)
#' iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", low_memory = FALSE)
#'
#' # Using init = "tree" is usually more efficient than random initialization.
#' # arguments to the tree initialization method can be passed via the init_args
#' # list
#' set.seed(1337)
#' iris_nn <- nnd_knn(iris, k = 4, init = "tree", init_args = list(n_trees = 5))
#' @references
#' Dasgupta, S., & Freund, Y. (2008, May).
#' Random projection trees and low dimensional manifolds.
#' In *Proceedings of the fortieth annual ACM symposium on Theory of computing*
#' (pp. 537-546).
#' \doi{10.1145/1374376.1374452}.
#'
#' Dong, W., Moses, C., & Li, K. (2011, March).
#' Efficient k-nearest neighbor graph construction for generic similarity measures.
#' In *Proceedings of the 20th international conference on World Wide Web*
#' (pp. 577-586).
#' ACM.
#' \doi{10.1145/1963405.1963487}.
#' @export
nnd_knn <- function(data,
                    k = NULL,
                    metric = "euclidean",
                    init = "rand",
                    init_args = NULL,
                    n_iters = NULL,
                    max_candidates = NULL,
                    delta = 0.001,
                    low_memory = TRUE,
                    weight_by_degree = FALSE,
                    use_alt_metric = TRUE,
                    n_threads = 0,
                    verbose = FALSE,
                    progress = "bar",
                    obs = "R",
                    ret_forest = FALSE) {
  stopifnot(tolower(progress) %in% c("bar", "dist"))
  obs <- match.arg(toupper(obs), c("C", "R"))

  actual_metric <-
    get_actual_metric(use_alt_metric, metric, data, verbose)

  data <- x2m(data)
  if (obs == "R") {
    data <- Matrix::t(data)
  }

  # data must be column-oriented at this point
  if (is.character(init)) {
    if (is.null(k)) {
      stop("Must provide k")
    }
    check_k(k, ncol(data))
    init <- match.arg(tolower(init), c("rand", "tree"))

    # if the use provided tree init args that could clash with nnd args
    # ignore them
    if (init == "tree" && is.list(init_args)) {
      if ("use_alt_metric" %in% init_args) {
        init_args$use_alt_metric <- use_alt_metric
      }
      if ("metric" %in% init_args) {
        init_args$metric <- metric
      }
    }
    tsmessage("Initializing neighbors using '", init, "' method")
    init <- switch(init,
      "rand" = random_knn_impl(
        reference = data,
        k = k,
        actual_metric = actual_metric,
        order_by_distance = FALSE,
        n_threads = n_threads,
        verbose = verbose
      ),
      "tree" = do.call(rpf_knn_impl, lmerge(
        # defaults we have no reason to change should match rpf_knn or rpf_build
        list(
          data,
          k = k,
          metric = metric,
          use_alt_metric = use_alt_metric,
          actual_metric = actual_metric,
          n_trees = NULL,
          leaf_size = NULL,
          max_tree_depth = 200,
          include_self = FALSE, # this is changed from default on purpose
          ret_forest = ret_forest,
          margin = "auto",
          unzero = FALSE,
          n_threads = n_threads,
          verbose = verbose
        ),
        init_args
      )),
      stop("Unknown initialization option '", init, "'")
    )
    # FIXME: can we just turn off unzero in tree and random return?
    init$idx <- init$idx + 1
    if (any(init$idx == 0)) {
      tsmessage(
        "Warning: initialization failed to find ",
        k,
        " neighbors for all points"
      )
    }
  } else {
    tsmessage("Initializing from user-supplied graph")
    # user-supplied input may need to be transformed to the actual metric
    if (use_alt_metric &&
      !is.null(init) && is.list(init) && !is.null(init$dist)) {
      tsmessage(
        "Applying metric correction to initial distances from '",
        metric, "' to '", actual_metric, "'"
      )
      init$dist <-
        apply_alt_metric_uncorrection(metric, init$dist, is_sparse(data))
    }

    if (is.null(k)) {
      k <- ncol(init$idx)
    }
  }
  forest <- NULL
  if (ret_forest && is.list(init) && !is.null(init$forest)) {
    forest <- init$forest
    init$forest <- NULL
  }

  init <-
    prepare_init_graph(
      init,
      k,
      data = data,
      metric = actual_metric,
      n_threads = n_threads,
      verbose = verbose
    )

  if (is.null(max_candidates)) {
    max_candidates <- min(k, 60)
  }
  if (is.null(n_iters)) {
    n_iters <- max(5, round(log2(ncol(data))))
  }
  tsmessage(
    thread_msg(
      "Running nearest neighbor descent for ",
      n_iters,
      " iterations",
      n_threads = n_threads
    )
  )

  nnd_args <- list(
    nn_idx = init$idx,
    nn_dist = init$dist,
    metric = actual_metric,
    n_iters = n_iters,
    max_candidates = max_candidates,
    delta = delta,
    low_memory = low_memory,
    weight_by_degree = weight_by_degree,
    n_threads = n_threads,
    verbose = verbose,
    progress_type = progress
  )
  if (is_sparse(data)) {
    nnd_fun <- rnn_sparse_descent
    nnd_args$data <- data@x
    nnd_args$ind <- data@i
    nnd_args$ptr <- data@p
    nnd_args$ndim <- nrow(data)
  } else if (is.logical(data)) {
    nnd_fun <- rnn_logical_descent
    nnd_args$data <- data
  } else {
    nnd_fun <- rnn_descent
    nnd_args$data <- data
  }
  res <- do.call(nnd_fun, nnd_args)

  if (use_alt_metric) {
    res$dist <-
      apply_alt_metric_correction(metric, res$dist, is_sparse(data))
  }
  if (any(res$idx == 0)) {
    tsmessage(
      "Warning: NN Descent failed to find ",
      k,
      " neighbors for all points"
    )
  }
  tsmessage("Finished")
  if (!is.null(forest)) {
    res$forest <- forest
  }
  res
}


# kNN Queries -------------------------------------------------------------

#' Query exact nearest neighbors by brute force
#'
#' Returns the exact nearest neighbors of query data to the reference data. A
#' brute force search is carried out: all possible pairs of reference and query
#' points are compared, and the nearest neighbors are returned.
#'
#' This is accurate but scales poorly with dataset size, so use with caution
#' with larger datasets. Having the exact neighbors as a ground truth to compare
#' with approximate results is useful for benchmarking and determining
#' parameter settings of the approximate methods.
#'
#' @param query Matrix of `n` query items, with observations in the rows and
#'   features in the columns. Optionally, the data may be passed with the
#'   observations in the columns, by setting `obs = "C"`, which should be more
#'   efficient. The `reference` data must be passed in the same orientation as
#'   `query`. Possible formats are [base::data.frame()], [base::matrix()]
#'   or [Matrix::sparseMatrix()]. Sparse matrices should be in `dgCMatrix`
#'   format. Dataframes will be converted to `numerical` matrix format
#'   internally, so if your data columns are `logical` and intended to be used
#'   with the specialized binary `metric`s, you should convert it to a logical
#'   matrix first (otherwise you will get the slower dense numerical version).
#' @param reference Matrix of `m` reference items, with observations in the rows
#'   and features in the columns. The nearest neighbors to the queries are
#'   calculated from this data. Optionally, the data may be passed with the
#'   observations in the columns, by setting `obs = "C"`, which should be more
#'   efficient. The `query` data must be passed in the same format and
#'   orientation as `reference`. Possible formats are [base::data.frame()],
#'   [base::matrix()] or [Matrix::sparseMatrix()]. Sparse matrices should be in
#'   `dgCMatrix` format.
#' @param k Number of nearest neighbors to return.
#' @param metric Type of distance calculation to use. One of:
#'   - `"braycurtis"`
#'   - `"canberra"`
#'   - `"chebyshev"`
#'   - `"correlation"` (1 minus the Pearson correlation)
#'   - `"cosine"`
#'   - `"dice"`
#'   - `"euclidean"`
#'   - `"hamming"`
#'   - `"hellinger"`
#'   - `"jaccard"`
#'   - `"jensenshannon"`
#'   - `"kulsinski"`
#'   - `"sqeuclidean"` (squared Euclidean)
#'   - `"manhattan"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"spearmanr"` (1 minus the Spearman rank correlation)
#'   - `"symmetrickl"` (symmetric Kullback-Leibler divergence)
#'   - `"tsss"` (Triangle Area Similarity-Sector Area Similarity or TS-SS
#'   metric)
#'   - `"yule"`
#'
#'   For non-sparse data, the following variants are available with
#'   preprocessing: this trades memory for a potential speed up during the
#'   distance calculation. Some minor numerical differences should be expected
#'   compared to the non-preprocessed versions:
#'   - `"cosine-preprocess"`: `cosine` with preprocessing.
#'   - `"correlation-preprocess"`: `correlation` with preprocessing.
#'
#'   For non-sparse binary data passed as a `logical` matrix, the following
#'   metrics have specialized variants which should be substantially faster than
#'   the non-binary variants (in other cases the logical data will be treated as
#'   a dense numeric vector of 0s and 1s):
#'   - `"dice"`
#'   - `"hamming"`
#'   - `"jaccard"`
#'   - `"kulsinski"`
#'   - `"matching"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"yule"`
#' @param use_alt_metric If `TRUE`, use faster metrics that maintain the
#'   ordering of distances internally (e.g. squared Euclidean distances if using
#'   `metric = "euclidean"`), then apply a correction at the end. Probably
#'   the only reason to set this to `FALSE` is if you suspect that some
#'   sort of numeric issue is occurring with your data in the alternative code
#'   path.
#' @param n_threads Number of threads to use.
#' @param verbose If `TRUE`, log information to the console.
#' @param obs set to `"C"` to indicate that the input `query` and `reference`
#'   orientation stores each observation as a column (the orientation must be
#'   consistent). The default `"R"` means that observations are stored in each
#'   row. Storing the data by row is usually more convenient, but internally
#'   your data will be converted to column storage. Passing it already
#'   column-oriented will save some memory and (a small amount of) CPU usage.
#' @return the nearest neighbor graph as a list containing:
#'   * `idx` an n by k matrix containing the nearest neighbor indices in
#'   `reference`.
#'   * `dist` an n by k matrix containing the nearest neighbor distances to the
#'   items in `reference`.
#' @examples
#' # 100 reference iris items
#' iris_ref <- iris[iris$Species %in% c("setosa", "versicolor"), ]
#'
#' # 50 query items
#' iris_query <- iris[iris$Species == "versicolor", ]
#'
#' # For each item in iris_query find the 4 nearest neighbors in iris_ref
#' # If you pass a data frame, non-numeric columns are removed
#' # set verbose = TRUE to get details on the progress being made
#' iris_query_nn <- brute_force_knn_query(iris_query,
#'   reference = iris_ref,
#'   k = 4, metric = "euclidean", verbose = TRUE
#' )
#'
#' # Manhattan (l1) distance
#' iris_query_nn <- brute_force_knn_query(iris_query,
#'   reference = iris_ref,
#'   k = 4, metric = "manhattan"
#' )
#' @export
brute_force_knn_query <- function(query,
                                  reference,
                                  k,
                                  metric = "euclidean",
                                  use_alt_metric = TRUE,
                                  n_threads = 0,
                                  verbose = FALSE,
                                  obs = "R") {
  obs <- match.arg(toupper(obs), c("C", "R"))

  actual_metric <- get_actual_metric(use_alt_metric, metric, reference, verbose)

  check_sparse(reference, query)
  reference <- x2m(reference)
  query <- x2m(query)
  if (obs == "R") {
    reference <- Matrix::t(reference)
    query <- Matrix::t(query)
  }
  check_k(k, ncol(reference))

  tsmessage(
    thread_msg(
      "Calculating brute force k-nearest neighbors from reference with k = ",
      k,
      n_threads = n_threads
    )
  )

  if (is_sparse(reference)) {
    res <- rnn_sparse_brute_force_query(
      ref_ind = reference@i,
      ref_ptr = reference@p,
      ref_data = reference@x,
      query_ind = query@i,
      query_ptr = query@p,
      query_data = query@x,
      ndim = nrow(query),
      nnbrs = k,
      metric = actual_metric,
      n_threads = n_threads,
      verbose = verbose
    )
  } else if (is.logical(reference)) {
    res <- rnn_logical_brute_force_query(reference,
      query,
      k,
      actual_metric,
      n_threads = n_threads,
      verbose = verbose
    )
  } else {
    res <- rnn_brute_force_query(reference,
      query,
      k,
      actual_metric,
      n_threads = n_threads,
      verbose = verbose
    )
  }
  res$idx <- res$idx + 1

  if (use_alt_metric) {
    res$dist <-
      apply_alt_metric_correction(metric, res$dist, is_sparse(reference))
  }
  tsmessage("Finished")

  res
}

#' Query nearest neighbors by random selection
#'
#' Run queries against reference data to return randomly selected neighbors.
#' This is not a useful query method on its own, but can be used with other
#' methods which require initialization.
#'
#' @param query Matrix of `n` query items, with observations in the rows and
#'   features in the columns. Optionally, the data may be passed with the
#'   observations in the columns, by setting `obs = "C"`, which should be more
#'   efficient. The `reference` data must be passed in the same orientation as
#'   `query`. Possible formats are [base::data.frame()], [base::matrix()]
#'   or [Matrix::sparseMatrix()]. Sparse matrices should be in `dgCMatrix`
#'   format. Dataframes will be converted to `numerical` matrix format
#'   internally, so if your data columns are `logical` and intended to be used
#'   with the specialized binary `metric`s, you should convert it to a logical
#'   matrix first (otherwise you will get the slower dense numerical version).
#' @param reference Matrix of `m` reference items, with observations in the rows
#'   and features in the columns. The nearest neighbors to the queries are
#'   randomly selected from this data. Optionally, the data may be passed with
#'   the observations in the columns, by setting `obs = "C"`, which should be
#'   more efficient. The `query` data must be passed in the same orientation
#'   and format as `reference`. Possible formats are [base::data.frame()],
#'   [base::matrix()] or [Matrix::sparseMatrix()]. Sparse matrices should be in
#'   `dgCMatrix` format.
#' @param k Number of nearest neighbors to return.
#' @param metric Type of distance calculation to use. One of:
#'   - `"braycurtis"`
#'   - `"canberra"`
#'   - `"chebyshev"`
#'   - `"correlation"` (1 minus the Pearson correlation)
#'   - `"cosine"`
#'   - `"dice"`
#'   - `"euclidean"`
#'   - `"hamming"`
#'   - `"hellinger"`
#'   - `"jaccard"`
#'   - `"jensenshannon"`
#'   - `"kulsinski"`
#'   - `"sqeuclidean"` (squared Euclidean)
#'   - `"manhattan"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"spearmanr"` (1 minus the Spearman rank correlation)
#'   - `"symmetrickl"` (symmetric Kullback-Leibler divergence)
#'   - `"tsss"` (Triangle Area Similarity-Sector Area Similarity or TS-SS
#'   metric)
#'   - `"yule"`
#'
#'   For non-sparse data, the following variants are available with
#'   preprocessing: this trades memory for a potential speed up during the
#'   distance calculation. Some minor numerical differences should be expected
#'   compared to the non-preprocessed versions:
#'   - `"cosine-preprocess"`: `cosine` with preprocessing.
#'   - `"correlation-preprocess"`: `correlation` with preprocessing.
#'
#'   For non-sparse binary data passed as a `logical` matrix, the following
#'   metrics have specialized variants which should be substantially faster than
#'   the non-binary variants (in other cases the logical data will be treated as
#'   a dense numeric vector of 0s and 1s):
#'   - `"dice"`
#'   - `"hamming"`
#'   - `"jaccard"`
#'   - `"kulsinski"`
#'   - `"matching"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"yule"`
#' @param use_alt_metric If `TRUE`, use faster metrics that maintain the
#'   ordering of distances internally (e.g. squared Euclidean distances if using
#'   `metric = "euclidean"`), then apply a correction at the end. Probably
#'   the only reason to set this to `FALSE` is if you suspect that some
#'   sort of numeric issue is occurring with your data in the alternative code
#'   path.
#' @param order_by_distance If `TRUE` (the default), then results for each
#'   item are returned by increasing distance. If you don't need the results
#'   sorted, e.g. you are going to pass the results as initialization to another
#'   routine like [graph_knn_query()], set this to `FALSE` to save a
#'   small amount of computational time.
#' @param n_threads Number of threads to use.
#' @param verbose If `TRUE`, log information to the console.
#' @param obs set to `"C"` to indicate that the input `query` and `reference`
#'   orientation stores each observation as a column (the orientation must be
#'   consistent). The default `"R"` means that observations are stored in each
#'   row. Storing the data by row is usually more convenient, but internally
#'   your data will be converted to column storage. Passing it already
#'   column-oriented will save some memory and (a small amount of) CPU usage.
#' @return an approximate nearest neighbor graph as a list containing:
#'   * `idx` an n by k matrix containing the nearest neighbor indices.
#'   * `dist` an n by k matrix containing the nearest neighbor distances.
#' @examples
#' # 100 reference iris items
#' iris_ref <- iris[iris$Species %in% c("setosa", "versicolor"), ]
#'
#' # 50 query items
#' iris_query <- iris[iris$Species == "versicolor", ]
#'
#' # For each item in iris_query find 4 random neighbors in iris_ref
#' # If you pass a data frame, non-numeric columns are removed
#' # set verbose = TRUE to get details on the progress being made
#' iris_query_random_nbrs <- random_knn_query(iris_query,
#'   reference = iris_ref,
#'   k = 4, metric = "euclidean", verbose = TRUE
#' )
#'
#' # Manhattan (l1) distance
#' iris_query_random_nbrs <- random_knn_query(iris_query,
#'   reference = iris_ref,
#'   k = 4, metric = "manhattan"
#' )
#' @export
random_knn_query <-
  function(query,
           reference,
           k,
           metric = "euclidean",
           use_alt_metric = TRUE,
           order_by_distance = TRUE,
           n_threads = 0,
           verbose = FALSE,
           obs = "R") {
    obs <- match.arg(toupper(obs), c("C", "R"))
    n_obs <- switch(obs,
      R = nrow,
      C = ncol,
      stop("Unknown obs type")
    )

    check_sparse(reference, query)

    reference <- x2m(reference)
    query <- x2m(query)
    check_k(k, n_obs(reference))

    actual_metric <-
      get_actual_metric(use_alt_metric, metric, reference, verbose)

    if (obs == "R") {
      reference <- Matrix::t(reference)
      query <- Matrix::t(query)
    }

    res <- random_knn_impl(
      reference = reference,
      query = query,
      k = k,
      actual_metric = actual_metric,
      order_by_distance = order_by_distance,
      n_threads = n_threads,
      verbose = verbose
    )
    res$idx <- res$idx + 1

    if (use_alt_metric) {
      res$dist <-
        apply_alt_metric_correction(metric, res$dist, is_sparse(reference))
    }

    tsmessage("Finished")
    res
  }

#' Query a search graph for nearest neighbors
#'
#' Run queries against a search graph, to return nearest neighbors taken from
#' the reference data used to build that graph.
#'
#' A greedy beam search is used to query the graph, combining two search pruning
#' strategies. The first, due to Iwasaki and Miyazaki (2018), only considers
#' new candidates within a relative distance of the current furthest neighbor
#' in the query's graph. The second, due to Harwood and Drummond (2016), puts a
#' limit on the absolute number of distance calculations to carry out. See the
#' `epsilon` and `max_search_fraction` parameters respectively.
#'
#' @param query Matrix of `n` query items, with observations in the rows and
#'   features in the columns. Optionally, the data may be passed with the
#'   observations in the columns, by setting `obs = "C"`, which should be more
#'   efficient. The `reference` data must be passed in the same orientation as
#'   `query`. Possible formats are [base::data.frame()], [base::matrix()]
#'   or [Matrix::sparseMatrix()]. Sparse matrices should be in `dgCMatrix`
#'   format. Dataframes will be converted to `numerical` matrix format
#'   internally, so if your data columns are `logical` and intended to be used
#'   with the specialized binary `metric`s, you should convert it to a logical
#'   matrix first (otherwise you will get the slower dense numerical version).
#' @param reference Matrix of `m` reference items, with observations in the rows
#'   and features in the columns. The nearest neighbors to the queries are
#'   calculated from this data. Optionally, the data may be passed with the
#'   observations in the columns, by setting `obs = "C"`, which should be more
#'   efficient. The `query` data must be passed in the same format and
#'   orientation as `reference`. Possible formats are [base::data.frame()],
#'   [base::matrix()] or [Matrix::sparseMatrix()]. Sparse matrices should be in
#'   `dgCMatrix` format.
#' @param reference_graph Search graph of the `reference` data. A neighbor
#'   graph, such as that output from [nnd_knn()] can be used, but
#'   preferably a suitably prepared sparse search graph should be used, such as
#'   that output by [prepare_search_graph()].
#' @param k Number of nearest neighbors to return. Optional if `init` is
#'   specified.
#' @param metric Type of distance calculation to use. One of:
#'   - `"braycurtis"`
#'   - `"canberra"`
#'   - `"chebyshev"`
#'   - `"correlation"` (1 minus the Pearson correlation)
#'   - `"cosine"`
#'   - `"dice"`
#'   - `"euclidean"`
#'   - `"hamming"`
#'   - `"hellinger"`
#'   - `"jaccard"`
#'   - `"jensenshannon"`
#'   - `"kulsinski"`
#'   - `"sqeuclidean"` (squared Euclidean)
#'   - `"manhattan"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"spearmanr"` (1 minus the Spearman rank correlation)
#'   - `"symmetrickl"` (symmetric Kullback-Leibler divergence)
#'   - `"tsss"` (Triangle Area Similarity-Sector Area Similarity or TS-SS
#'   metric)
#'   - `"yule"`
#'
#'   For non-sparse data, the following variants are available with
#'   preprocessing: this trades memory for a potential speed up during the
#'   distance calculation. Some minor numerical differences should be expected
#'   compared to the non-preprocessed versions:
#'   - `"cosine-preprocess"`: `cosine` with preprocessing.
#'   - `"correlation-preprocess"`: `correlation` with preprocessing.
#'
#'   For non-sparse binary data passed as a `logical` matrix, the following
#'   metrics have specialized variants which should be substantially faster than
#'   the non-binary variants (in other cases the logical data will be treated as
#'   a dense numeric vector of 0s and 1s):
#'   - `"dice"`
#'   - `"hamming"`
#'   - `"jaccard"`
#'   - `"kulsinski"`
#'   - `"matching"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"yule"`
#' @param use_alt_metric If `TRUE`, use faster metrics that maintain the
#'   ordering of distances internally (e.g. squared Euclidean distances if using
#'   `metric = "euclidean"`), then apply a correction at the end. Probably the
#'   only reason to set this to `FALSE` is if you suspect that some sort of
#'   numeric issue is occurring with your data in the alternative code path. If
#'   a search forest is used for initialization via the `init` parameter, then
#'   the metric is fetched from there and this setting is ignored.
#' @param init Initial `query` neighbor graph to optimize. If not provided, `k`
#'   random neighbors are created. If provided, the input format must be one of:
#'   1. A list containing:
#'
#'       * `idx` an `n` by `k` matrix containing the nearest neighbor indices.
#'       * `dist` (optional) an `n` by `k` matrix containing the nearest neighbor
#'       distances.
#'
#'       If `k` and `init` are specified as arguments to this function, and the
#'       number of neighbors provided in `init` is not equal to `k` then:
#'
#'       * if `k` is smaller, only the `k` closest values in `init` are retained.
#'       * if `k` is larger, then random neighbors will be chosen to fill `init` to
#'       the size of `k`. Note that there is no checking if any of the random
#'       neighbors are duplicates of what is already in `init` so effectively fewer
#'       than `k` neighbors may be chosen for some observations under these
#'       circumstances.
#'
#'       If the input distances are omitted, they will be calculated for you.
#'  1. A random projection forest, such as that returned from [rpf_build()] or
#'     [rpf_knn()] with `ret_forest = TRUE`.
#' @param epsilon Controls trade-off between accuracy and search cost, as
#'   described by Iwasaki and Miyazaki (2018), by specifying a distance
#'   tolerance on whether to explore the neighbors of candidate points. The
#'   larger the value, the more neighbors will be searched. A value of 0.1
#'   allows query-candidate distances to be 10% larger than the current
#'   most-distant neighbor of the query point, 0.2 means 20%, and so on.
#'   Suggested values are between 0-0.5, although this value is highly dependent
#'   on the distribution of distances in the dataset (higher dimensional data
#'   should choose a smaller cutoff). Too large a value of `epsilon` will result
#'   in the query search approaching brute force comparison. Use this parameter
#'   in conjunction with `max_search_fraction` and [prepare_search_graph()] to
#'   prevent excessive run time. Default is 0.1. If you set `verbose = TRUE`,
#'   statistics of the number of distance calculations will be logged which
#'   can help you tune `epsilon`.
#' @param max_search_fraction Maximum fraction of the reference data to search.
#'  This is a value between 0 (search none of the reference data) and 1 (search
#'  all of the data if necessary). This works in conjunction with `epsilon` and
#'  will terminate the search early if the specified fraction of the reference
#'  data has been searched. Default is 1.
#' @param n_threads Number of threads to use.
#' @param verbose If `TRUE`, log information to the console.
#' @param obs set to `"C"` to indicate that the input `query` and `reference`
#'   orientation stores each observation as a column (the orientation must be
#'   consistent). The default `"R"` means that observations are stored in each
#'   row. Storing the data by row is usually more convenient, but internally
#'   your data will be converted to column storage. Passing it already
#'   column-oriented will save some memory and (a small amount of) CPU usage.
#' @return the approximate nearest neighbor graph as a list containing:
#'   * `idx` a `n` by `k` matrix containing the nearest neighbor indices
#'     specifying the row of the neighbor in `reference`.
#'   * `dist` a `n` by `k` matrix containing the nearest neighbor distances.
#' @examples
#' # 100 reference iris items
#' iris_ref <- iris[iris$Species %in% c("setosa", "versicolor"), ]
#'
#' # 50 query items
#' iris_query <- iris[iris$Species == "versicolor", ]
#'
#' # First, find the approximate 4-nearest neighbor graph for the references:
#' iris_ref_graph <- nnd_knn(iris_ref, k = 4)
#'
#' # For each item in iris_query find the 4 nearest neighbors in iris_ref.
#' # You need to pass both the reference data and the reference graph.
#' # If you pass a data frame, non-numeric columns are removed.
#' # set verbose = TRUE to get details on the progress being made
#' iris_query_nn <- graph_knn_query(iris_query, iris_ref, iris_ref_graph,
#'   k = 4, metric = "euclidean", verbose = TRUE
#' )
#'
#' # A more complete example, converting the initial knn into a search graph
#' # and using a filtered random projection forest to initialize the search
#' # create initial knn and forest
#' iris_ref_graph <- nnd_knn(iris_ref, k = 4, init = "tree", ret_forest = TRUE)
#' # keep the best tree in the forest
#' forest <- rpf_filter(iris_ref_graph, n_trees = 1)
#' # expand the knn into a search graph
#' iris_ref_search_graph <- prepare_search_graph(iris_ref, iris_ref_graph)
#' # run the query with the improved graph and initialization
#' iris_query_nn <- graph_knn_query(iris_query, iris_ref, iris_ref_search_graph,
#'   init = forest, k = 4
#' )
#'
#' @references
#' Harwood, B., & Drummond, T. (2016).
#' Fanng: Fast approximate nearest neighbour graphs.
#' In *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*
#' (pp. 5713-5722).
#'
#' Iwasaki, M., & Miyazaki, D. (2018).
#' Optimization of indexing based on k-nearest neighbor graph for proximity
#' search in high-dimensional data.
#' *arXiv preprint arXiv:1810.07355*.
#' @export
graph_knn_query <- function(query,
                            reference,
                            reference_graph,
                            k = NULL,
                            metric = "euclidean",
                            init = NULL,
                            epsilon = 0.1,
                            max_search_fraction = 1.0,
                            use_alt_metric = TRUE,
                            n_threads = 0,
                            verbose = FALSE,
                            obs = "R") {
  obs <- match.arg(toupper(obs), c("C", "R"))
  check_sparse(reference, query)
  reference <- x2m(reference)
  query <- x2m(query)
  if (obs == "R") {
    reference <- Matrix::t(reference)
    query <- Matrix::t(query)
  }

  if (is.list(init) && is_rpforest(init)) {
    tsmessage("Reading metric data from forest")
    actual_metric <- init$actual_metric
    metric <- init$original_metric
    use_alt_metric <- init$use_alt_metric
    if (use_alt_metric && metric != actual_metric) {
      tsmessage("Using alt metric '", actual_metric, "' for '", metric, "'")
    } else {
      tsmessage("Using metric '", metric, "'")
    }
  } else {
    actual_metric <-
      get_actual_metric(use_alt_metric, metric, reference, verbose)
  }

  # reference and query must be column-oriented at this point
  if (is.null(init)) {
    if (is.null(k)) {
      if (is.list(reference_graph)) {
        k <- get_reference_graph_k(reference_graph)
        tsmessage("Using k = ", k, " from graph")
      } else {
        stop("Must provide k")
      }
    }
    check_k(k, ncol(reference))
    tsmessage("Initializing from random neighbors")
    init <- random_knn_impl(
      query = query,
      reference = reference,
      k = k,
      actual_metric = actual_metric,
      order_by_distance = FALSE,
      n_threads = n_threads,
      verbose = verbose
    )
    # FIXME: can we just do the unzeroing inside the init?
    init$idx <- init$idx + 1
  } else if (is.list(init) && is_rpforest(init)) {
    init <- rpf_knn_query(
      query = query,
      reference = reference,
      forest = init,
      k = k,
      cache = TRUE,
      n_threads = n_threads,
      verbose = verbose,
      obs = "C"
    )
    if (use_alt_metric) {
      init$dist <-
        apply_alt_metric_uncorrection(metric, init$dist, is_sparse(reference))
    }
  } else if ((is.list(init) && !is.null(init$idx)) || is.matrix(init)) {
    tsmessage("Initializing from user-supplied graph")
    if (is.matrix(init)) {
      init <- list(idx = init)
    }

    # user-supplied distances may need to be transformed to the actual metric
    if (use_alt_metric && !is.null(init$dist)) {
      init$dist <-
        apply_alt_metric_uncorrection(metric, init$dist, is_sparse(reference))
    }

    if (is.null(k)) {
      k <- ncol(init$idx)
      tsmessage("Using k = ", k, " from initial graph")
    }
  } else {
    stop("Unsupported type of 'init'")
  }

  init <-
    prepare_init_graph(
      nn = init,
      k = k,
      query = query,
      data = reference,
      metric = actual_metric,
      n_threads = n_threads,
      verbose = verbose
    )

  if (is.list(reference_graph) && any(reference_graph$idx == 0)) {
    tsmessage("Warning: reference knn graph contains missing data")
  }

  stopifnot(
    !is.null(query),
    (
      methods::is(query, "matrix") ||
        methods::is(query, "sparseMatrix")
    )
  )
  stopifnot(
    !is.null(init$idx),
    methods::is(init$idx, "matrix"),
    ncol(init$idx) == k,
    nrow(init$idx) == ncol(query)
  )
  stopifnot(
    !is.null(init$dist),
    methods::is(init$dist, "matrix"),
    ncol(init$dist) == k,
    nrow(init$dist) == ncol(query)
  )
  if (is.list(reference_graph)) {
    reference_dist <- reference_graph$dist
    reference_idx <- reference_graph$idx
    stopifnot(!is.null(reference), (
      methods::is(reference, "matrix") ||
        methods::is(reference, "sparseMatrix")
    ))
    stopifnot(
      !is.null(reference_idx),
      methods::is(reference_idx, "matrix"),
      nrow(reference_idx) == ncol(reference)
    )
    stopifnot(
      !is.null(reference_dist),
      methods::is(reference_dist, "matrix"),
      nrow(reference_dist) == ncol(reference)
    )
    reference_graph_list <- graph_to_list(reference_graph)
  } else {
    stopifnot(methods::is(reference_graph, "sparseMatrix"))
    reference_graph_list <- tcsparse_to_list(reference_graph)
  }

  tsmessage(
    thread_msg(
      "Searching nearest neighbor graph with epsilon = ",
      epsilon,
      " and max_search_fraction = ",
      max_search_fraction,
      n_threads = n_threads
    )
  )

  args <- list(
    reference_graph_list = reference_graph_list,
    nn_idx = init$idx,
    nn_dist = init$dist,
    metric = actual_metric,
    epsilon = epsilon,
    max_search_fraction = max_search_fraction,
    n_threads = n_threads,
    verbose = verbose
  )
  if (is_sparse(reference)) {
    res <- do.call(
      rnn_sparse_query,
      c(
        list(
          ref_ind = reference@i,
          ref_ptr = reference@p,
          ref_data = reference@x,
          query_ind = query@i,
          query_ptr = query@p,
          query_data = query@x,
          ndim = nrow(reference)
        ),
        args
      )
    )
  } else if (is.logical(reference)) {
    res <- do.call(
      rnn_logical_query,
      c(
        list(reference = reference, query = query),
        args
      )
    )
  } else {
    res <- do.call(
      rnn_query,
      c(
        list(reference = reference, query = query),
        args
      )
    )
  }
  if (use_alt_metric) {
    res$dist <-
      apply_alt_metric_correction(metric, res$dist, is_sparse(reference))
  }
  tsmessage("Finished")
  res
}

# Search Graph Preparation ------------------------------------------------

#' Convert a nearest neighbor graph into a search graph
#'
#' Create a graph using existing nearest neighbor data to balance search
#' speed and accuracy using the occlusion pruning and truncation strategies
#' of Harwood and Drummond (2016). The resulting search graph should be more
#' efficient for querying new data than the original nearest neighbor graph.
#'
#' An approximate nearest neighbor graph is not very useful for querying via
#' [graph_knn_query()], especially if the query data is initialized randomly:
#' some items in the data set may not be in the nearest neighbor list of any
#' other item and can therefore never be returned as a neighbor, no matter how
#' close they are to the query. Even those which do appear in at least one
#' neighbor list may not be reachable by expanding an arbitrary starting list if
#' the neighbor graph contains disconnected components.
#'
#' Converting the directed graph represented by the neighbor graph to an
#' undirected graph by adding an edge from item `j` to `i` if
#' an edge exists from `i` to `j` (i.e. creating the mutual neighbor
#' graph) solves the problems above, but can result in inefficient searches.
#' Although the out-degree of each item is restricted to the number of neighbors
#' the in-degree has no such restrictions: a given item could be very "popular"
#' and in a large number of neighbors lists. Therefore mutualizing the neighbor
#' graph can result in some items with a large number of neighbors to search.
#' These usually have very similar neighborhoods so there is nothing to be
#' gained from searching all of them.
#'
#' To balance accuracy and search time, the following procedure is carried out:
#'
#' 1. The graph is "diversified" by occlusion pruning.
#' 1. The reverse graph is formed by reversing the direction of all edges in
#' the pruned graph.
#' 1. The reverse graph is diversified by occlusion pruning.
#' 1. The pruned forward and pruned reverse graph are merged.
#' 1. The outdegree of each node in the merged graph is truncated.
#' 1. The truncated merged graph is returned as the prepared search graph.
#'
#' Explicit zero distances in the `graph` will be converted to a small positive
#' number to avoid being dropped in the sparse representation. The one exception
#' is the "self" distance, i.e. any edge in the `graph` which links a node to
#' itself (the diagonal of the sparse distance matrix). These trivial edges
#' aren't useful for search purposes and are always dropped.
#'
#' @param data Matrix of `n` items, with observations in the rows and features
#'   in the columns. Optionally, input can be passed with observations in the
#'   columns, by setting `obs = "C"`, which should be more efficient. Possible
#'   formats are [base::data.frame()], [base::matrix()] or
#'   [Matrix::sparseMatrix()]. Sparse matrices should be in `dgCMatrix` format.
#'   Dataframes will be converted to `numerical` matrix format internally, so if
#'   your data columns are `logical` and intended to be used with the
#'   specialized binary `metric`s, you should convert it to a logical matrix
#'   first (otherwise you will get the slower dense numerical version).
#' @param graph neighbor graph for `data`, a list containing:
#'   * `idx` an `n` by `k` matrix containing the nearest neighbor indices of
#'   the data in `data`.
#'   * `dist` an `n` by `k` matrix containing the nearest neighbor distances.
#' @param metric Type of distance calculation to use. One of:
#'   - `"braycurtis"`
#'   - `"canberra"`
#'   - `"chebyshev"`
#'   - `"correlation"` (1 minus the Pearson correlation)
#'   - `"cosine"`
#'   - `"dice"`
#'   - `"euclidean"`
#'   - `"hamming"`
#'   - `"hellinger"`
#'   - `"jaccard"`
#'   - `"jensenshannon"`
#'   - `"kulsinski"`
#'   - `"sqeuclidean"` (squared Euclidean)
#'   - `"manhattan"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"spearmanr"` (1 minus the Spearman rank correlation)
#'   - `"symmetrickl"` (symmetric Kullback-Leibler divergence)
#'   - `"tsss"` (Triangle Area Similarity-Sector Area Similarity or TS-SS
#'   metric)
#'   - `"yule"`
#'
#'   For non-sparse data, the following variants are available with
#'   preprocessing: this trades memory for a potential speed up during the
#'   distance calculation. Some minor numerical differences should be expected
#'   compared to the non-preprocessed versions:
#'   - `"cosine-preprocess"`: `cosine` with preprocessing.
#'   - `"correlation-preprocess"`: `correlation` with preprocessing.
#'
#'   For non-sparse binary data passed as a `logical` matrix, the following
#'   metrics have specialized variants which should be substantially faster than
#'   the non-binary variants (in other cases the logical data will be treated as
#'   a dense numeric vector of 0s and 1s):
#'   - `"dice"`
#'   - `"hamming"`
#'   - `"jaccard"`
#'   - `"kulsinski"`
#'   - `"matching"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"yule"`
#' @param use_alt_metric If `TRUE`, use faster metrics that maintain the
#'   ordering of distances internally (e.g. squared Euclidean distances if using
#'   `metric = "euclidean"`), then apply a correction at the end. Probably
#'   the only reason to set this to `FALSE` is if you suspect that some
#'   sort of numeric issue is occurring with your data in the alternative code
#'   path.
#' @param diversify_prob the degree of diversification of the search graph
#'   by removing unnecessary edges through occlusion pruning. This should take a
#'   value between `0` (no diversification) and `1` (remove as many edges as
#'   possible) and is treated as the probability of a neighbor being removed if
#'   it is found to be an "occlusion". If item `p` and `q`, two members of the
#'   neighbor list of item `i`, are closer to each other than they are to `i`,
#'   then the nearer neighbor `p` is said to "occlude" `q`. It is likely that
#'   `q` will be in the neighbor list of `p` so there is no need to retain it in
#'   the neighbor list of `i`. You may also set this to `NULL` to skip any
#'   occlusion pruning. Note that occlusion pruning is carried out twice, once
#'   to the forward neighbors, and once to the reverse neighbors. Reducing this
#'   value will result in a more dense graph. This is similar to increasing the
#'   "alpha" parameter used by in the DiskAnn pruning method of Subramanya and
#'   co-workers (2014).
#' @param pruning_degree_multiplier How strongly to truncate the final neighbor
#'   list for each item. The neighbor list of each item will be truncated to
#'   retain only the closest `d` neighbors, where
#'   `d = k * pruning_degree_multiplier`, and `k` is the
#'   original number of neighbors per item in `graph`. Roughly, values
#'   larger than `1` will keep all the nearest neighbors of an item, plus
#'   the given fraction of reverse neighbors (if they exist). For example,
#'   setting this to `1.5` will keep all the forward neighbors and then
#'   half as many of the reverse neighbors, although exactly which neighbors are
#'   retained is also dependent on any occlusion pruning that occurs. Set this
#'   to `NULL` to skip this step.
#' @param prune_reverse If `TRUE`, prune the reverse neighbors of each item
#'   before the reverse graph diversification step using
#'   `pruning_degree_multiplier`. Because the number of reverse neighbors can be
#'   much larger than the number of forward neighbors, this can help to avoid
#'   excessive computation during the diversification step, with little overall
#'   effect on the final search graph. Default is `FALSE`.
#' @param n_threads Number of threads to use.
#' @param verbose If `TRUE`, log information to the console.
#' @param obs set to `"C"` to indicate that the input `data` orientation stores
#'   each observation as a column. The default `"R"` means that observations are
#'   stored in each row. Storing the data by row is usually more convenient, but
#'   internally your data will be converted to column storage. Passing it
#'   already column-oriented will save some memory and (a small amount of) CPU
#'   usage.
#' @return a search graph for `data` based on `graph`, represented as a sparse
#'   matrix, suitable for use with [graph_knn_query()].
#' @examples
#' # 100 reference iris items
#' iris_ref <- iris[iris$Species %in% c("setosa", "versicolor"), ]
#'
#' # 50 query items
#' iris_query <- iris[iris$Species == "versicolor", ]
#'
#' # First, find the approximate 4-nearest neighbor graph for the references:
#' ref_ann_graph <- nnd_knn(iris_ref, k = 4)
#'
#' # Create a graph for querying with
#' ref_search_graph <- prepare_search_graph(iris_ref, ref_ann_graph)
#'
#' # Using the search graph rather than the ref_ann_graph directly may give
#' # more accurate or faster results
#' iris_query_nn <- graph_knn_query(
#'   query = iris_query, reference = iris_ref,
#'   reference_graph = ref_search_graph, k = 4, metric = "euclidean",
#'   verbose = TRUE
#' )
#' @references
#' Harwood, B., & Drummond, T. (2016).
#' Fanng: Fast approximate nearest neighbour graphs.
#' In *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*
#' (pp. 5713-5722).
#'
#' Jayaram Subramanya, S., Devvrit, F., Simhadri, H. V., Krishnawamy, R., & Kadekodi, R. (2019).
#' Diskann: Fast accurate billion-point nearest neighbor search on a single node.
#' *Advances in Neural Information Processing Systems*, *32*.
#' @seealso [graph_knn_query()]
#' @export
prepare_search_graph <- function(data,
                                 graph,
                                 metric = "euclidean",
                                 use_alt_metric = TRUE,
                                 diversify_prob = 1.0,
                                 pruning_degree_multiplier = 1.5,
                                 prune_reverse = FALSE,
                                 n_threads = 0,
                                 verbose = FALSE,
                                 obs = "R") {
  obs <- match.arg(toupper(obs), c("C", "R"))

  if (!is.null(pruning_degree_multiplier)) {
    stopifnot(pruning_degree_multiplier > 0)
  }
  if (!is.null(diversify_prob)) {
    stopifnot(
      diversify_prob <= 1,
      diversify_prob >= 0
    )
  }

  actual_metric <-
    get_actual_metric(use_alt_metric, metric, data, verbose)

  if (is_sparse(graph)) {
    sp <- Matrix::t(graph)
    n_nbrs <- mean(diff(sp@p))
  } else {
    n_nbrs <- check_graph(graph)$k
    tsmessage("Converting graph to sparse format")
    sp <- graph_to_csparse(graph)
  }

  if (use_alt_metric) {
    sp@x <-
      apply_alt_metric_uncorrection(metric, sp@x, is_sparse(data))
  }

  sp <- preserve_zeros(sp)

  data <- x2m(data)
  if (obs == "R") {
    data <- Matrix::t(data)
  }
  if (!is.null(diversify_prob) && diversify_prob > 0) {
    tsmessage("Diversifying forward graph")
    fdiv <- diversify(
      data,
      sp,
      metric = actual_metric,
      prune_probability = diversify_prob,
      verbose = verbose,
      n_threads = n_threads
    )
  } else {
    fdiv <- sp
    tsmessage(
      "Forward graph has # edges = ",
      Matrix::nnzero(fdiv),
      " (",
      formatC(100 * nn_sparsity_sp(fdiv)),
      "% sparse)"
    )
  }
  rsp <- reverse_knn_sp(fdiv)
  if (!is.null(diversify_prob) && diversify_prob > 0) {
    if (prune_reverse && !is.null(pruning_degree_multiplier) &&
        !is.infinite(pruning_degree_multiplier)) {
      max_degree <- max(round(n_nbrs * pruning_degree_multiplier), 1)
      tsmessage("Degree pruning reverse graph to max degree: ", max_degree)
      rsp <-
        degree_prune(
          rsp,
          max_degree = max_degree,
          verbose = verbose,
          n_threads = n_threads
        )
    }

    tsmessage("Diversifying reverse graph")
    rdiv <- diversify(
      data,
      rsp,
      metric = actual_metric,
      prune_probability = diversify_prob,
      verbose = verbose,
      n_threads = n_threads
    )
  } else {
    rdiv <- rsp
    tsmessage(
      "Reverse graph has # edges = ",
      Matrix::nnzero(rdiv),
      " (",
      formatC(100 * nn_sparsity_sp(rdiv)),
      "% sparse)"
    )
  }
  tsmessage("Merging diversified forward and reverse graph")
  merged <- merge_graphs_sp(fdiv, rdiv)

  if (!is.null(pruning_degree_multiplier) &&
    !is.infinite(pruning_degree_multiplier)) {
    max_degree <- max(round(n_nbrs * pruning_degree_multiplier), 1)
    tsmessage("Degree pruning merged graph to max degree: ", max_degree)
    res <-
      degree_prune(
        merged,
        max_degree = max_degree,
        verbose = verbose,
        n_threads = n_threads
      )
  } else {
    res <- merged
    tsmessage(
      "Merged graph has # edges = ",
      Matrix::nnzero(res),
      " (",
      formatC(100 * nn_sparsity_sp(res)),
      "% sparse)"
    )
  }

  if (use_alt_metric) {
    res@x <-
      apply_alt_metric_correction(metric, res@x, is_sparse(data))
  }

  tsmessage("Finished preparing search graph")
  # return this pre-transposed so we don't have to do it in graph_knn_query
  Matrix::t(res)
}

# Harwood, B., & Drummond, T. (2016).
# Fanng: Fast approximate nearest neighbour graphs.
# In *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*
# (pp. 5713-5722).
# "Occlusion pruning"
#
# prune_probability behaves a bit like (but in the opposite direction of) alpha
# Jayaram Subramanya, S., Devvrit, F., Simhadri, H. V., Krishnawamy, R., & Kadekodi, R. (2019).
# Diskann: Fast accurate billion-point nearest neighbor search on a single node.
# Advances in Neural Information Processing Systems, 32.
# the pynndescent implementation at cmuparlay/pbbsbench uses alpha instead of
# a prune_probability (see also https://arxiv.org/abs/2305.04359)
diversify <- function(data,
                      graph,
                      metric = "euclidean",
                      prune_probability = 1.0,
                      n_threads = 0,
                      verbose = FALSE) {
  nnz_before <- Matrix::nnzero(graph)
  sp_before <- nn_sparsity_sp(graph)
  stopifnot(methods::is(graph, "sparseMatrix"))
  gl <- csparse_to_list(graph)

  tsmessage("Occlusion pruning with probability: ", prune_probability)
  args <- list(
    graph_list = gl,
    metric = metric,
    prune_probability = prune_probability,
    n_threads = n_threads,
    verbose = verbose
  )
  if (is_sparse(data)) {
    gl_div <- do.call(rnn_sparse_diversify, c(args, list(
      ind = data@i,
      ptr = data@p,
      data = data@x,
      ndim = nrow(data)
    )))
  } else if (is.logical(data)) {
    gl_div <- do.call(rnn_logical_diversify, c(args, list(
      data = data
    )))
  } else {
    gl_div <- do.call(rnn_diversify, c(args, list(
      data = data
    )))
  }
  res <- list_to_sparse(gl_div)
  nnz_after <- Matrix::nnzero(res)
  tsmessage(
    "Diversifying reduced # edges from ",
    nnz_before,
    " to ",
    nnz_after,
    " (",
    formatC(100 * sp_before),
    "% to ",
    formatC(100 * nn_sparsity_sp(res)),
    "% sparse)"
  )
  res
}

# FANNG: Fast Approximate Nearest Neighbour Graphs
# "truncating"
# Harwood, B., & Drummond, T. (2016).
# Fanng: Fast approximate nearest neighbour graphs.
# In *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*
# (pp. 5713-5722).
degree_prune <-
  function(graph,
           max_degree = 20,
           n_threads = 0,
           verbose = FALSE) {
    stopifnot(methods::is(graph, "sparseMatrix"))
    nnz_before <- Matrix::nnzero(graph)
    sp_before <- nn_sparsity_sp(graph)
    gl <- csparse_to_list(graph)

    gl_div <-
      rnn_degree_prune(gl, max_degree, n_threads = n_threads)
    res <- list_to_sparse(gl_div)
    nnz_after <- Matrix::nnzero(res)
    tsmessage(
      "Degree pruning to max ",
      max_degree,
      " reduced # edges from ",
      nnz_before,
      " to ",
      nnz_after,
      " (",
      formatC(100 * sp_before),
      "% to ",
      formatC(100 * nn_sparsity_sp(res)),
      "% sparse)"
    )
    res
  }


merge_graphs_sp <- function(g1, g2) {
  gl1 <- csparse_to_list(g1)
  gl2 <- csparse_to_list(g2)

  gl_merge <- rnn_merge_graph_lists(gl1, gl2)
  list_to_sparse(gl_merge)
}

nn_sparsity <- function(graph) {
  sum(graph$idx == 0) / prod(dim(graph$idx))
}

nn_sparsity_sp <- function(graph) {
  Matrix::nnzero(graph) / prod(graph@Dim)
}

reverse_knn_sp <- function(graph) {
  stopifnot(methods::is(graph, "sparseMatrix"))
  Matrix::t(Matrix::drop0(graph))
}

# Merge -------------------------------------------------------------------

#' Merge multiple approximate nearest neighbors graphs
#'
#' `merge_knn` takes a list of nearest neighbor graphs and merges them into a
#' single graph, with the same number of neighbors as the first graph. This is
#' useful to combine the results of multiple different nearest neighbor
#' searches: the output will be at least as accurate as the most accurate of the
#' two input graphs, and ideally will be more accurate than either.
#'
#' @param graphs A list of nearest neighbor graphs to merge. Each item in the
#'   list should consist of a sub-list containing:
#'   * `idx` an n by k matrix containing the k nearest neighbor indices.
#'   * `dist` an n by k matrix containing k nearest neighbor distances.
#'   The number of neighbors can differ between graphs, but the merged result
#'   will have the same number of neighbors as the first graph in the list.
#' @param is_query If `TRUE` then the graphs are treated as the result of a knn
#'   query, not a knn building process. Or: is the graph bipartite? This should
#'   be set to `TRUE` if `nn_graphs` are the results of using e.g.
#'   [graph_knn_query()] or [random_knn_query()], and set to `FALSE` if these
#'   are the results of [nnd_knn()] or [random_knn()]. The difference is that if
#'   `is_query = FALSE`, if an index `p` is found in `nn_graph1[i, ]`, i.e. `p`
#'   is a neighbor of `i` with distance `d`, then it is assumed that `i` is a
#'   neighbor of `p` with the same distance. If `is_query = TRUE`, then `i` and
#'   `p` are indexes into two different datasets and the symmetry does not hold.
#'   If you aren't sure what case applies to you, it's safe (but potentially
#'   inefficient) to set `is_query = TRUE`.
#' @param n_threads Number of threads to use.
#' @param verbose If `TRUE`, log information to the console.
#' @return a list containing:
#'   * `idx` an n by k matrix containing the merged nearest neighbor indices.
#'   * `dist` an n by k matrix containing the merged nearest neighbor distances.
#'
#'   The size of `k` in the output graph is the same as that of the first
#'   item in `nn_graphs`.
#' @examples
#' set.seed(1337)
#' # Nearest neighbor descent with 15 neighbors for iris three times,
#' # starting from a different random initialization each time
#' iris_rnn1 <- nnd_knn(iris, k = 15, n_iters = 1)
#' iris_rnn2 <- nnd_knn(iris, k = 15, n_iters = 1)
#' iris_rnn3 <- nnd_knn(iris, k = 15, n_iters = 1)
#'
#' # Merged results should be an improvement over individual results
#' iris_mnn <- merge_knn(list(iris_rnn1, iris_rnn2, iris_rnn3))
#' sum(iris_mnn$dist) < sum(iris_rnn1$dist)
#' sum(iris_mnn$dist) < sum(iris_rnn2$dist)
#' sum(iris_mnn$dist) < sum(iris_rnn3$dist)
#' @export
merge_knn <- function(graphs,
                      is_query = FALSE,
                      n_threads = 0,
                      verbose = FALSE) {
  if (length(graphs) == 0) {
    return(list())
  }
  validate_are_mergeablel(graphs)

  rnn_merge_nn_all(graphs,
    is_query,
    n_threads = n_threads,
    verbose = verbose
  )
}

# Overlap -----------------------------------------------------------------

#' Overlap between the indices of two nearest neighbor graphs
#'
#' Calculates the mean average number of neighbors in common between the two
#' graphs. The per-item overlap can also be returned. This function can be
#' useful as a measure of accuracy of approximation algorithms, if the
#' exact nearest neighbors are known, or as a measure of diversity of two
#' different approximate graphs.
#'
#' The graph format is the same as that returned by e.g. [nnd_knn()] and should
#' be of dimensions n by k, where n is the number of points and k is the number
#' of neighbors. If you pass a neighbor graph directly, the index matrix will be
#' extracted if present. If the two graphs have different numbers of neighbors,
#' then the smaller number of neighbors is used.
#'
#' @param idx1 Indices of a nearest neighbor graph, i.e. a matrix of nearest
#'   neighbor indices. Can also be a list containing an `idx` element.
#' @param idx2 Indices of a nearest neighbor graph, i.e. a matrix of nearest
#'   neighbor indices. Can also be a list containing an `idx` element. This is
#'   considered to be the ground truth.
#' @param k Number of neighbors to consider. If `NULL`, then the minimum of the
#'   number of neighbors in `idx1` and `idx2` is used.
#' @param ret_vec If `TRUE`, also return a vector containing the per-item overlap.
#' @return The mean overlap between `idx1` and `idx2`. If `ret_vec = TRUE`,
#'  then a list containing the mean overlap and the overlap of each item in
#'  is returned with names `mean` and `overlaps`, respectively.
#' @examples
#' set.seed(1337)
#' # Generate two random neighbor graphs for iris
#' iris_rnn1 <- random_knn(iris, k = 15)
#' iris_rnn2 <- random_knn(iris, k = 15)
#'
#' # Overlap between the two graphs
#' mean_overlap <- neighbor_overlap(iris_rnn1, iris_rnn2)
#'
#' # Also get a vector of per-item overlap
#' overlap_res <- neighbor_overlap(iris_rnn1, iris_rnn2, ret_vec = TRUE)
#' summary(overlap_res$overlaps)
#' @export
neighbor_overlap <-
  function(idx1,
           idx2,
           k = NULL,
           ret_vec = FALSE) {
    vec <- neighbor_overlapv(idx1, idx2, k)
    mean_overlap <- mean(vec)
    if (ret_vec) {
      res <- list(
        mean = mean_overlap,
        overlaps = vec
      )
    } else {
      res <- mean_overlap
    }
    res
  }

neighbor_overlapv <-
  function(idx,
           ref_idx,
           k = NULL) {
    if (is.list(idx)) {
      idx <- idx$idx
    }
    if (is.list(ref_idx)) {
      ref_idx <- ref_idx$idx
    }

    if (is.null(k)) {
      k <- min(ncol(idx), ncol(ref_idx))
    }

    if (ncol(ref_idx) < k) {
      stop("Not enough columns in ref_idx for k = ", k)
    }

    n <- nrow(idx)
    if (nrow(ref_idx) != n) {
      stop("Not enough rows in ref_idx")
    }

    nbr_start <- 1
    nbr_end <- k

    ref_start <- nbr_start
    ref_end <- nbr_end

    nbr_range <- nbr_start:nbr_end
    ref_range <- ref_start:ref_end

    total_intersect <- rep(0, times = n)
    for (i in 1:n) {
      total_intersect[i] <-
        length(intersect(idx[i, nbr_range], ref_idx[i, ref_range]))
    }

    total_intersect / k
  }
Any scripts or data that you put into this service are public.
rnndescent documentation built on May 29, 2024, 8:38 a.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
rnndescent
Nearest Neighbor Descent Method for Approximate Nearest Neighbors

R/rnndescent.R
In rnndescent: Nearest Neighbor Descent Method for Approximate Nearest Neighbors

Defines functions neighbor_overlapv neighbor_overlap merge_knn reverse_knn_sp nn_sparsity_sp nn_sparsity merge_graphs_sp degree_prune diversify prepare_search_graph graph_knn_query random_knn_query brute_force_knn_query nnd_knn random_knn brute_force_knn rnnd_knn rnnd_query rnnd_build

Documented in brute_force_knn brute_force_knn_query graph_knn_query merge_knn neighbor_overlap nnd_knn prepare_search_graph random_knn random_knn_query rnnd_build rnnd_knn rnnd_query

Try the rnndescent package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

rnndescent Nearest Neighbor Descent Method for Approximate Nearest Neighbors

R/rnndescent.R In rnndescent: Nearest Neighbor Descent Method for Approximate Nearest Neighbors

Defines functions neighbor_overlapv neighbor_overlap merge_knn reverse_knn_sp nn_sparsity_sp nn_sparsity merge_graphs_sp degree_prune diversify prepare_search_graph graph_knn_query random_knn_query brute_force_knn_query nnd_knn random_knn brute_force_knn rnnd_knn rnnd_query rnnd_build

Documented in brute_force_knn brute_force_knn_query graph_knn_query merge_knn neighbor_overlap nnd_knn prepare_search_graph random_knn random_knn_query rnnd_build rnnd_knn rnnd_query

Try the rnndescent package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

rnndescent
Nearest Neighbor Descent Method for Approximate Nearest Neighbors

R/rnndescent.R
In rnndescent: Nearest Neighbor Descent Method for Approximate Nearest Neighbors