R/rnndescent.R

Defines functions neighbor_overlapv neighbor_overlap merge_knn reverse_knn_sp nn_sparsity_sp nn_sparsity merge_graphs_sp degree_prune diversify prepare_search_graph graph_knn_query random_knn_query brute_force_knn_query nnd_knn random_knn brute_force_knn rnnd_knn rnnd_query rnnd_build

Documented in brute_force_knn brute_force_knn_query graph_knn_query merge_knn neighbor_overlap nnd_knn prepare_search_graph random_knn random_knn_query rnnd_build rnnd_knn rnnd_query

# API ---------------------------------------------------------------------

#' Build approximate nearest neighbors index and neighbor graph
#'
#' This function builds an approximate nearest neighbors graph with convenient
#' defaults, then prepares the index for querying new data, for later use with
#' [rnnd_query()]. For more control over the process, please see the other
#' functions in the package.
#'
#' The process of k-nearest neighbor graph construction using Random Projection
#' Forests (Dasgupta and Freund, 2008) for initialization and Nearest Neighbor
#' Descent (Dong and co-workers, 2011) for refinement. Index preparation, uses
#' the graph diversification method of Harwood and Drummond (2016).
#'
#' @param data Matrix of `n` items to generate neighbors for, with observations
#'   in the rows and features in the columns. Optionally, input can be passed
#'   with observations in the columns, by setting `obs = "C"`, which should be
#'   more efficient. Possible formats are [base::data.frame()], [base::matrix()]
#'   or [Matrix::sparseMatrix()]. Sparse matrices should be in `dgCMatrix`
#'   format. Dataframes will be converted to `numerical` matrix format
#'   internally, so if your data columns are `logical` and intended to be used
#'   with the specialized binary `metric`s, you should convert it to a logical
#'   matrix first (otherwise you will get the slower dense numerical version).
#' @param k Number of nearest neighbors to build the index for. You can specify
#'   a different number when running `rnnd_query`, but the index is calibrated
#'   using this value so it's recommended to set `k` according to the likely
#'   number of neighbors you will want to retrieve. Optional if `init` is
#'   specified, in which case `k` can be inferred from the `init` data. If you
#'   do both, then the specified version of `k` will take precedence.
#' @param metric Type of distance calculation to use. One of:
#'   - `"braycurtis"`
#'   - `"canberra"`
#'   - `"chebyshev"`
#'   - `"correlation"` (1 minus the Pearson correlation)
#'   - `"cosine"`
#'   - `"dice"`
#'   - `"euclidean"`
#'   - `"hamming"`
#'   - `"hellinger"`
#'   - `"jaccard"`
#'   - `"jensenshannon"`
#'   - `"kulsinski"`
#'   - `"sqeuclidean"` (squared Euclidean)
#'   - `"manhattan"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"spearmanr"` (1 minus the Spearman rank correlation)
#'   - `"symmetrickl"` (symmetric Kullback-Leibler divergence)
#'   - `"tsss"` (Triangle Area Similarity-Sector Area Similarity or TS-SS
#'   metric)
#'   - `"yule"`
#'
#'   For non-sparse data, the following variants are available with
#'   preprocessing: this trades memory for a potential speed up during the
#'   distance calculation. Some minor numerical differences should be expected
#'   compared to the non-preprocessed versions:
#'   - `"cosine-preprocess"`: `cosine` with preprocessing.
#'   - `"correlation-preprocess"`: `correlation` with preprocessing.
#'
#'   For non-sparse binary data passed as a `logical` matrix, the following
#'   metrics have specialized variants which should be substantially faster than
#'   the non-binary variants (in other cases the logical data will be treated as
#'   a dense numeric vector of 0s and 1s):
#'   - `"dice"`
#'   - `"hamming"`
#'   - `"jaccard"`
#'   - `"kulsinski"`
#'   - `"matching"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"yule"`
#' @param use_alt_metric If `TRUE`, use faster metrics that maintain the
#'   ordering of distances internally (e.g. squared Euclidean distances if using
#'   `metric = "euclidean"`), then apply a correction at the end. Probably
#'   the only reason to set this to `FALSE` is if you suspect that some
#'   sort of numeric issue is occurring with your data in the alternative code
#'   path.
#' @param init Name of the initialization strategy or initial `data` neighbor
#'   graph to optimize. One of:
#'   - `"rand"` random initialization (the default).
#'   - `"tree"` use the random projection tree method of Dasgupta and Freund
#'   (2008).
#'   - a pre-calculated neighbor graph. A list containing:
#'       - `idx` an `n` by `k` matrix containing the nearest neighbor indices.
#'       - `dist` (optional) an `n` by `k` matrix containing the nearest
#'       neighbor distances. If the input distances are omitted, they will be
#'       calculated for you.'
#'
#'   If `k` and `init` are specified as arguments to this function, and the
#'   number of neighbors provided in `init` is not equal to `k` then:
#'
#'   * if `k` is smaller, only the `k` closest values in `init` are retained.
#'   * if `k` is larger, then random neighbors will be chosen to fill `init` to
#'   the size of `k`. Note that there is no checking if any of the random
#'   neighbors are duplicates of what is already in `init` so effectively fewer
#'   than `k` neighbors may be chosen for some observations under these
#'   circumstances.
#' @param n_trees The number of trees to use in the RP forest. A larger number
#'   will give more accurate results at the cost of a longer computation time.
#'   The default of `NULL` means that the number is chosen based on the number
#'   of observations in `data`. Only used if `init = "tree"`.
#' @param leaf_size The maximum number of items that can appear in a leaf. This
#'   value should be chosen to match the expected number of neighbors you will
#'   want to retrieve when running queries (e.g. if you want find 50 nearest
#'   neighbors set `leaf_size = 50`) and should not be set to a value smaller
#'   than `10`. Only used if `init = "tree"`.
#' @param max_tree_depth The maximum depth of the tree to build (default = 200).
#'   If the maximum tree depth is exceeded then the leaf size of a tree may
#'   exceed `leaf_size` which can result in a large number of neighbor distances
#'   being calculated. If `verbose = TRUE` a message will be logged to indicate
#'   that the leaf size is large. However, increasing the `max_tree_depth` may
#'   not help: it may be that there is something unusual about the distribution
#'   of your data set under your chose `metric` that makes a tree-based
#'   initialization inappropriate. Only used if `init = "tree"`.
#' @param margin A character string specifying the method used to  assign points
#'   to one side of the hyperplane or the other. Possible values are:
#'   - `"explicit"` categorizes all distance metrics as either Euclidean or
#'   Angular (Euclidean after normalization), explicitly calculates a hyperplane
#'   and offset, and then calculates the margin based on the dot product with
#'   the hyperplane.
#'   - `"implicit"` calculates the distance from a point to each of the
#'   points defining the normal vector. The margin is calculated by comparing the
#'   two distances: the point is assigned to the side of the hyperplane that
#'   the normal vector point with the closest distance belongs to.
#'   - `"auto"` (the default) picks the margin method depending on whether a
#'   binary-specific `metric` such as `"bhammming"` is chosen, in which case
#'   `"implicit"` is used, and `"explicit"` otherwise: binary-specific metrics
#'   involve storing the data in a way that isn't very efficient for the
#'   `"explicit"` method and the binary-specific metric is usually a lot faster
#'   than the generic equivalent such that the cost of two distance calculations
#'   for the margin method is still faster.
#'
#'   Only used if `init = "tree"`.
#' @param n_iters Number of iterations of nearest neighbor descent to carry out.
#'   By default, this will be chosen based on the number of observations in
#'   `data`.
#' @param max_candidates Maximum number of candidate neighbors to try for each
#'   item in each iteration. Use relative to `k` to emulate the "rho"
#'   sampling parameter in the nearest neighbor descent paper. By default, this
#'   is set to `k` or `60`, whichever is smaller.
#' @param delta The minimum relative change in the neighbor graph allowed before
#'   early stopping. Should be a value between 0 and 1. The smaller the value,
#'   the smaller the amount of progress between iterations is allowed. Default
#'   value of `0.001` means that at least 0.1% of the neighbor graph must
#'   be updated at each iteration.
#' @param low_memory If `TRUE`, use a lower memory, but more
#'   computationally expensive approach to index construction. If set to
#'   `FALSE`, you should see a noticeable speed improvement, especially when
#'   using a smaller number of threads, so this is worth trying if you have the
#'   memory to spare.
#' @param weight_by_degree If `TRUE`, then candidates for the local join are
#'   weighted according to their in-degree, so that if there are more than
#'   `max_candidates` in a candidate list, candidates with a smaller degree are
#'   favored for retention. This prevents items with large numbers of edges
#'   crowding out other items and for high-dimensional data is likely to provide
#'   a small improvement in accuracy. Because this incurs a small extra cost of
#'   counting the degree of each node, and because it tends to delay early
#'   convergence, by default this is `FALSE`.
#' @param n_search_trees, the number of trees to keep in the search forest as
#'   part of index preparation. The default is `1`.
#' @param diversify_prob the degree of diversification of the search graph
#'   by removing unnecessary edges through occlusion pruning. This should take a
#'   value between `0` (no diversification) and `1` (remove as many edges as
#'   possible) and is treated as the probability of a neighbor being removed if
#'   it is found to be an "occlusion". If item `p` and `q`, two members of the
#'   neighbor list of item `i`, are closer to each other than they are to `i`,
#'   then the nearer neighbor `p` is said to "occlude" `q`. It is likely that
#'   `q` will be in the neighbor list of `p` so there is no need to retain it in
#'   the neighbor list of `i`. You may also set this to `NULL` to skip any
#'   occlusion pruning. Note that occlusion pruning is carried out twice, once
#'   to the forward neighbors, and once to the reverse neighbors.
#' @param pruning_degree_multiplier How strongly to truncate the final neighbor
#'   list for each item. The neighbor list of each item will be truncated to
#'   retain only the closest `d` neighbors, where
#'   `d = k * pruning_degree_multiplier`, and `k` is the
#'   original number of neighbors per item in `graph`. Roughly, values
#'   larger than `1` will keep all the nearest neighbors of an item, plus
#'   the given fraction of reverse neighbors (if they exist). For example,
#'   setting this to `1.5` will keep all the forward neighbors and then
#'   half as many of the reverse neighbors, although exactly which neighbors are
#'   retained is also dependent on any occlusion pruning that occurs. Set this
#'   to `NULL` to skip this step.
#' @param prune_reverse If `TRUE`, prune the reverse neighbors of each item
#'   before the reverse graph diversification step using
#'   `pruning_degree_multiplier`. Because the number of reverse neighbors can be
#'   much larger than the number of forward neighbors, this can help to avoid
#'   excessive computation during the diversification step, with little overall
#'   effect on the final search graph. Default is `FALSE`.
#' @param n_threads Number of threads to use.
#' @param verbose If `TRUE`, log information to the console.
#' @param progress Determines the type of progress information logged during the
#'   nearest neighbor descent stage when `verbose = TRUE`. Options are:
#'   * `"bar"`: a simple text progress bar.
#'   * `"dist"`: the sum of the distances in the approximate knn graph at the
#'     end of each iteration.
#' @param obs set to `"C"` to indicate that the input `data` orientation stores
#'   each observation as a column. The default `"R"` means that observations are
#'   stored in each row. Storing the data by row is usually more convenient, but
#'   internally your data will be converted to column storage. Passing it
#'   already column-oriented will save some memory and (a small amount of) CPU
#'   usage.
#' @return the approximate nearest neighbor index, a list containing:
#'    * `graph` the k-nearest neighbor graph, a list containing:
#'        * `idx` an n by k matrix containing the nearest neighbor indices.
#'        * `dist` an n by k matrix containing the nearest neighbor distances.
#'    * Other list items are intended only for internal use by other functions
#'    such as [rnnd_query()].
#' @seealso [rnnd_query()]
#' @examples
#' iris_even <- iris[seq_len(nrow(iris)) %% 2 == 0, ]
#' iris_odd <- iris[seq_len(nrow(iris)) %% 2 == 1, ]
#'
#' # Find 4 (approximate) nearest neighbors using Euclidean distance
#' iris_even_index <- rnnd_build(iris_even, k = 4)
#' iris_odd_nbrs <- rnnd_query(index = iris_even_index, query = iris_odd, k = 4)
#'
#' @references
#' Dasgupta, S., & Freund, Y. (2008, May).
#' Random projection trees and low dimensional manifolds.
#' In *Proceedings of the fortieth annual ACM symposium on Theory of computing*
#' (pp. 537-546).
#' \doi{10.1145/1374376.1374452}.
#'
#' Dong, W., Moses, C., & Li, K. (2011, March).
#' Efficient k-nearest neighbor graph construction for generic similarity measures.
#' In *Proceedings of the 20th international conference on World Wide Web*
#' (pp. 577-586).
#' ACM.
#' \doi{10.1145/1963405.1963487}.
#'
#' Harwood, B., & Drummond, T. (2016).
#' Fanng: Fast approximate nearest neighbour graphs.
#' In *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*
#' (pp. 5713-5722).
#' @export
rnnd_build <- function(data,
                       k = 30,
                       metric = "euclidean",
                       use_alt_metric = TRUE,
                       init = "tree",
                       n_trees = NULL,
                       leaf_size = NULL,
                       max_tree_depth = 200,
                       margin = "auto",
                       n_iters = NULL,
                       delta = 0.001,
                       max_candidates = NULL,
                       low_memory = TRUE,
                       weight_by_degree = FALSE,
                       n_search_trees = 1,
                       pruning_degree_multiplier = 1.5,
                       diversify_prob = 1.0,
                       prune_reverse = FALSE,
                       n_threads = 0,
                       verbose = FALSE,
                       progress = "bar",
                       obs = "R") {
  data <- x2m(data)
  if (obs == "R") {
    data <- Matrix::t(data)
  }

  index <- nnd_knn(
    data,
    k = k,
    metric = metric,
    use_alt_metric = use_alt_metric,
    init = init,
    init_args = list(
      n_trees = n_trees,
      leaf_size = leaf_size,
      max_tree_depth = max_tree_depth,
      margin = margin
    ),
    ret_forest = TRUE,
    n_iters = n_iters,
    delta = delta,
    max_candidates = max_candidates,
    low_memory = low_memory,
    weight_by_degree = weight_by_degree,
    n_threads = n_threads,
    verbose = verbose,
    progress = progress,
    obs = "C"
  )

  index <- list(
    graph = list(idx = index$idx, dist = index$dist),
    forest = index$forest
  )

  index$prep <- list(
    pruning_degree_multiplier = pruning_degree_multiplier,
    diversify_prob = diversify_prob,
    n_search_trees = n_search_trees,
    prune_reverse = prune_reverse,
    is_prepared = FALSE
  )

  index$data <- data
  index$original_metric <- metric
  index$use_alt_metric <- use_alt_metric

  if (!is.null(index$forest)) {
    index$search_forest <-
      rpf_filter(
        nn = index$graph,
        forest = index$forest,
        n_trees = index$prep$n_search_trees,
        n_threads = n_threads,
        verbose = verbose
      )
  }

  search_graph <- prepare_search_graph(
    data = index$data,
    graph = index$graph,
    metric = index$original_metric,
    use_alt_metric = index$use_alt_metric,
    diversify_prob = index$prep$diversify_prob,
    pruning_degree_multiplier = index$prep$pruning_degree_multiplier,
    prune_reverse = index$prep$prune_reverse,
    n_threads = n_threads,
    verbose = verbose,
    obs = "C"
  )
  index$prep$is_prepared <- TRUE
  index$search_graph <- search_graph

  # RP Forests can be large so after preparation we delete this
  index$forest <- NULL
  index
}

#' Query an index for approximate nearest neighbors
#'
#' Takes a nearest neighbor index produced by [rnnd_build()] and uses it to
#' find the nearest neighbors of a query set of observations, using a
#' back-tracking search with the search size determined by the method of
#' Iwasaki and Miyazaki (2018). For further control over the search effort, the
#' total number of distance calculations can also be bounded, similar to the
#' method of Harwood and Drummond (2016).
#'
#' @param index A nearest neighbor index produced by [rnnd_build()].
#' @param query Matrix of `n` query items, with observations in the rows and
#'   features in the columns. Optionally, the data may be passed with the
#'   observations in the columns, by setting `obs = "C"`, which should be more
#'   efficient. Possible formats are [base::data.frame()], [base::matrix()]
#'   or [Matrix::sparseMatrix()]. Sparse matrices should be in `dgCMatrix`
#'   format. Dataframes will be converted to `numerical` matrix format
#'   internally, so if your data columns are `logical` and intended to be used
#'   with the specialized binary `metric`s, you should convert it to a logical
#'   matrix first (otherwise you will get the slower dense numerical version).
#'   Sparse and non-sparse data cannot be mixed, so if the data used to build
#'   index was sparse, the `query` data must also be sparse. and vice versa.
#' @param k Number of nearest neighbors to return.
#' @param epsilon Controls trade-off between accuracy and search cost, as
#'   described by Iwasaki and Miyazaki (2018). Setting `epsilon` to a positive
#'   value specifies a distance tolerance on whether to explore the neighbors of
#'   candidate points. The larger the value, the more neighbors will be
#'   searched. A value of 0.1 allows query-candidate distances to be 10% larger
#'   than the current most-distant neighbor of the query point, 0.2 means 20%,
#'   and so on. Suggested values are between 0-0.5, although this value is
#'   highly dependent on the distribution of distances in the dataset (higher
#'   dimensional data should choose a smaller cutoff). Too large a value of
#'   `epsilon` will result in the query search approaching brute force
#'   comparison. Use this parameter in conjunction with `max_search_fraction` to
#'   prevent excessive run time. Default is 0.1. If you set `verbose = TRUE`,
#'   statistics of the number of distance calculations will be logged which can
#'   help you tune `epsilon`.
#' @param max_search_fraction Maximum fraction of the reference data to search.
#'  This is a value between 0 (search none of the reference data) and 1 (search
#'  all of the data if necessary). This works in conjunction with `epsilon` and
#'  will terminate the search early if the specified fraction of the reference
#'  data has been searched. Default is 1.
#' @param n_threads Number of threads to use.
#' @param init An optional matrix of `k` initial nearest neighbors for each
#'  query point.
#' @param verbose If `TRUE`, log information to the console.
#' @param obs set to `"C"` to indicate that the input `data` orientation stores
#'   each observation as a column. The default `"R"` means that observations are
#'   stored in each row. Storing the data by row is usually more convenient, but
#'   internally your data will be converted to column storage. Passing it
#'   already column-oriented will save some memory and (a small amount of) CPU
#'   usage.
#' @return the approximate nearest neighbor index, a list containing:
#'  * `idx` an n by k matrix containing the nearest neighbor indices.
#'  * `dist` an n by k matrix containing the nearest neighbor distances.
#' @seealso [rnnd_query()]
#' @examples
#' iris_even <- iris[seq_len(nrow(iris)) %% 2 == 0, ]
#' iris_odd <- iris[seq_len(nrow(iris)) %% 2 == 1, ]
#'
#' iris_even_index <- rnnd_build(iris_even, k = 4)
#' iris_odd_nbrs <- rnnd_query(index = iris_even_index, query = iris_odd, k = 4)
#'
#' @references
#' Harwood, B., & Drummond, T. (2016).
#' Fanng: Fast approximate nearest neighbour graphs.
#' In *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*
#' (pp. 5713-5722).
#'
#' Iwasaki, M., & Miyazaki, D. (2018).
#' Optimization of indexing based on k-nearest neighbor graph for proximity search in high-dimensional data.
#' *arXiv preprint* *arXiv:1810.07355*.
#' <https://arxiv.org/abs/1810.07355>
#' @export
rnnd_query <-
  function(index,
           query,
           k = 30,
           epsilon = 0.1,
           max_search_fraction = 1,
           init = NULL,
           n_threads = 0,
           verbose = FALSE,
           obs = "R") {
    if (is.null(init) && !is.null(index$search_forest)) {
      init <- index$search_forest
    }

    query <- x2m(query)
    if (obs == "R") {
      query <- Matrix::t(query)
    }
    res <- graph_knn_query(
      query = query,
      reference = index$data,
      reference_graph = index$search_graph,
      k = k,
      metric = index$original_metric,
      init = init,
      epsilon = epsilon,
      max_search_fraction = max_search_fraction,
      use_alt_metric = index$use_alt_metric,
      n_threads = n_threads,
      verbose = verbose,
      obs = "C"
    )
    res
  }


#' Find approximate nearest neighbors
#'
#' This function builds an approximate nearest neighbors graph of the provided
#' data using convenient defaults. It does not return an index for later
#' querying, to speed the graph construction and reduce the size and complexity
#' of the return value.
#'
#' The process of k-nearest neighbor graph construction using Random Projection
#' Forests (Dasgupta and Freund, 2008) for initialization and Nearest Neighbor
#' Descent (Dong and co-workers, 2011) for refinement. If you are sure you will
#' not want to query new data then compared to [rnnd_build()] this function has
#' the advantage of not storing the index, which can be very large.
#'
#' @param data Matrix of `n` items to generate neighbors for, with observations
#'   in the rows and features in the columns. Optionally, input can be passed
#'   with observations in the columns, by setting `obs = "C"`, which should be
#'   more efficient. Possible formats are [base::data.frame()], [base::matrix()]
#'   or [Matrix::sparseMatrix()]. Sparse matrices should be in `dgCMatrix`
#'   format. Dataframes will be converted to `numerical` matrix format
#'   internally, so if your data columns are `logical` and intended to be used
#'   with the specialized binary `metric`s, you should convert it to a logical
#'   matrix first (otherwise you will get the slower dense numerical version).
#' @param k Number of nearest neighbors to return. Optional if `init` is
#'   specified.
#' @param metric Type of distance calculation to use. One of:
#'   - `"braycurtis"`
#'   - `"canberra"`
#'   - `"chebyshev"`
#'   - `"correlation"` (1 minus the Pearson correlation)
#'   - `"cosine"`
#'   - `"dice"`
#'   - `"euclidean"`
#'   - `"hamming"`
#'   - `"hellinger"`
#'   - `"jaccard"`
#'   - `"jensenshannon"`
#'   - `"kulsinski"`
#'   - `"sqeuclidean"` (squared Euclidean)
#'   - `"manhattan"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"spearmanr"` (1 minus the Spearman rank correlation)
#'   - `"symmetrickl"` (symmetric Kullback-Leibler divergence)
#'   - `"tsss"` (Triangle Area Similarity-Sector Area Similarity or TS-SS
#'   metric)
#'   - `"yule"`
#'
#'   For non-sparse data, the following variants are available with
#'   preprocessing: this trades memory for a potential speed up during the
#'   distance calculation. Some minor numerical differences should be expected
#'   compared to the non-preprocessed versions:
#'   - `"cosine-preprocess"`: `cosine` with preprocessing.
#'   - `"correlation-preprocess"`: `correlation` with preprocessing.
#'
#'   For non-sparse binary data passed as a `logical` matrix, the following
#'   metrics have specialized variants which should be substantially faster than
#'   the non-binary variants (in other cases the logical data will be treated as
#'   a dense numeric vector of 0s and 1s):
#'   - `"dice"`
#'   - `"hamming"`
#'   - `"jaccard"`
#'   - `"kulsinski"`
#'   - `"matching"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"yule"`
#' @param use_alt_metric If `TRUE`, use faster metrics that maintain the
#'   ordering of distances internally (e.g. squared Euclidean distances if using
#'   `metric = "euclidean"`), then apply a correction at the end. Probably
#'   the only reason to set this to `FALSE` is if you suspect that some
#'   sort of numeric issue is occurring with your data in the alternative code
#'   path.
#' @param init Name of the initialization strategy or initial `data` neighbor
#'   graph to optimize. One of:
#'   - `"rand"` random initialization (the default).
#'   - `"tree"` use the random projection tree method of Dasgupta and Freund
#'   (2008).
#'   - a pre-calculated neighbor graph. A list containing:
#'       - `idx` an `n` by `k` matrix containing the nearest neighbor indices.
#'       - `dist` (optional) an `n` by `k` matrix containing the nearest
#'       neighbor distances. If the input distances are omitted, they will be
#'       calculated for you.'
#'
#'   If `k` and `init` are specified as arguments to this function, and the
#'   number of neighbors provided in `init` is not equal to `k` then:
#'
#'   * if `k` is smaller, only the `k` closest values in `init` are retained.
#'   * if `k` is larger, then random neighbors will be chosen to fill `init` to
#'   the size of `k`. Note that there is no checking if any of the random
#'   neighbors are duplicates of what is already in `init` so effectively fewer
#'   than `k` neighbors may be chosen for some observations under these
#'   circumstances.
#' @param n_trees The number of trees to use in the RP forest. A larger number
#'   will give more accurate results at the cost of a longer computation time.
#'   The default of `NULL` means that the number is chosen based on the number
#'   of observations in `data`. Only used if `init = "tree"`.
#' @param leaf_size The maximum number of items that can appear in a leaf. This
#'   value should be chosen to match the expected number of neighbors you will
#'   want to retrieve when running queries (e.g. if you want find 50 nearest
#'   neighbors set `leaf_size = 50`) and should not be set to a value smaller
#'   than `10`. Only used if `init = "tree"`.
#' @param max_tree_depth The maximum depth of the tree to build (default = 200).
#'   If the maximum tree depth is exceeded then the leaf size of a tree may
#'   exceed `leaf_size` which can result in a large number of neighbor distances
#'   being calculated. If `verbose = TRUE` a message will be logged to indicate
#'   that the leaf size is large. However, increasing the `max_tree_depth` may
#'   not help: it may be that there is something unusual about the distribution
#'   of your data set under your chose `metric` that makes a tree-based
#'   initialization inappropriate. Only used if `init = "tree"`.
#' @param margin A character string specifying the method used to  assign points
#'   to one side of the hyperplane or the other. Possible values are:
#'   - `"explicit"` categorizes all distance metrics as either Euclidean or
#'   Angular (Euclidean after normalization), explicitly calculates a hyperplane
#'   and offset, and then calculates the margin based on the dot product with
#'   the hyperplane.
#'   - `"implicit"` calculates the distance from a point to each of the
#'   points defining the normal vector. The margin is calculated by comparing the
#'   two distances: the point is assigned to the side of the hyperplane that
#'   the normal vector point with the closest distance belongs to.
#'   - `"auto"` (the default) picks the margin method depending on whether a
#'   binary-specific `metric` such as `"bhammming"` is chosen, in which case
#'   `"implicit"` is used, and `"explicit"` otherwise: binary-specific metrics
#'   involve storing the data in a way that isn't very efficient for the
#'   `"explicit"` method and the binary-specific metric is usually a lot faster
#'   than the generic equivalent such that the cost of two distance calculations
#'   for the margin method is still faster.
#'
#'   Only used if `init = "tree"`.
#' @param n_iters Number of iterations of nearest neighbor descent to carry out.
#'   By default, this will be chosen based on the number of observations in
#'   `data`.
#' @param max_candidates Maximum number of candidate neighbors to try for each
#'   item in each iteration. Use relative to `k` to emulate the "rho"
#'   sampling parameter in the nearest neighbor descent paper. By default, this
#'   is set to `k` or `60`, whichever is smaller.
#' @param delta The minimum relative change in the neighbor graph allowed before
#'   early stopping. Should be a value between 0 and 1. The smaller the value,
#'   the smaller the amount of progress between iterations is allowed. Default
#'   value of `0.001` means that at least 0.1% of the neighbor graph must
#'   be updated at each iteration.
#' @param low_memory If `TRUE`, use a lower memory, but more
#'   computationally expensive approach to index construction. If set to
#'   `FALSE`, you should see a noticeable speed improvement, especially when
#'   using a smaller number of threads, so this is worth trying if you have the
#'   memory to spare.
#' @param weight_by_degree If `TRUE`, then candidates for the local join are
#'   weighted according to their in-degree, so that if there are more than
#'   `max_candidates` in a candidate list, candidates with a smaller degree are
#'   favored for retention. This prevents items with large numbers of edges
#'   crowding out other items and for high-dimensional data is likely to provide
#'   a small improvement in accuracy. Because this incurs a small extra cost of
#'   counting the degree of each node, and because it tends to delay early
#'   convergence, by default this is `FALSE`.
#' @param n_threads Number of threads to use.
#' @param verbose If `TRUE`, log information to the console.
#' @param progress Determines the type of progress information logged during the
#'   nearest neighbor descent stage when `verbose = TRUE`. Options are:
#'   * `"bar"`: a simple text progress bar.
#'   * `"dist"`: the sum of the distances in the approximate knn graph at the
#'     end of each iteration.
#' @param obs set to `"C"` to indicate that the input `data` orientation stores
#'   each observation as a column. The default `"R"` means that observations are
#'   stored in each row. Storing the data by row is usually more convenient, but
#'   internally your data will be converted to column storage. Passing it
#'   already column-oriented will save some memory and (a small amount of) CPU
#'   usage.
#' @return the approximate nearest neighbor index, a list containing:
#'    * `idx` an n by k matrix containing the nearest neighbor indices.
#'    * `dist` an n by k matrix containing the nearest neighbor distances.
#' @seealso [rnnd_build()], [rnnd_query()]
#' @examples
#'
#' # Find 4 (approximate) nearest neighbors using Euclidean distance
#' iris_knn <- rnnd_knn(iris, k = 4)
#'
#' @references
#' Dasgupta, S., & Freund, Y. (2008, May).
#' Random projection trees and low dimensional manifolds.
#' In *Proceedings of the fortieth annual ACM symposium on Theory of computing*
#' (pp. 537-546).
#' \doi{10.1145/1374376.1374452}.
#'
#' Dong, W., Moses, C., & Li, K. (2011, March).
#' Efficient k-nearest neighbor graph construction for generic similarity measures.
#' In *Proceedings of the 20th international conference on World Wide Web*
#' (pp. 577-586).
#' ACM.
#' \doi{10.1145/1963405.1963487}.
#'
#' @export
rnnd_knn <- function(data,
                     k = 30,
                     metric = "euclidean",
                     use_alt_metric = TRUE,
                     init = "tree",
                     n_trees = NULL,
                     leaf_size = NULL,
                     max_tree_depth = 200,
                     margin = "auto",
                     n_iters = NULL,
                     delta = 0.001,
                     max_candidates = NULL,
                     weight_by_degree = FALSE,
                     low_memory = TRUE,
                     n_threads = 0,
                     verbose = FALSE,
                     progress = "bar",
                     obs = "R") {
  data <- x2m(data)
  if (obs == "R") {
    data <- Matrix::t(data)
  }

  nnd_knn(
    data,
    k = k,
    metric = metric,
    use_alt_metric = use_alt_metric,
    init = init,
    init_args = list(
      n_trees = n_trees,
      leaf_size = leaf_size,
      max_tree_depth = max_tree_depth,
      margin = margin
    ),
    ret_forest = FALSE,
    n_iters = n_iters,
    delta = delta,
    max_candidates = max_candidates,
    low_memory = low_memory,
    weight_by_degree = weight_by_degree,
    n_threads = n_threads,
    verbose = verbose,
    progress = progress,
    obs = "C"
  )
}


# kNN Construction --------------------------------------------------------

#' Find exact nearest neighbors by brute force
#'
#' Returns the exact nearest neighbors of a dataset. A brute force search is
#' carried out: all possible pairs of points are compared, and the nearest
#' neighbors are returned.
#'
#' This method is accurate but scales poorly with dataset size, so use with
#' caution with larger datasets. Having the exact neighbors as a ground truth to
#' compare with approximate results is useful for benchmarking and determining
#' parameter settings of the approximate methods.
#'
#' @param data Matrix of `n` items to generate neighbors for, with observations
#'   in the rows and features in the columns. Optionally, input can be passed
#'   with observations in the columns, by setting `obs = "C"`, which should be
#'   more efficient. Possible formats are [base::data.frame()], [base::matrix()]
#'   or [Matrix::sparseMatrix()]. Sparse matrices should be in `dgCMatrix`
#'   format. Dataframes will be converted to `numerical` matrix format
#'   internally, so if your data columns are `logical` and intended to be used
#'   with the specialized binary `metric`s, you should convert it to a logical
#'   matrix first (otherwise you will get the slower dense numerical version).
#' @param k Number of nearest neighbors to return.
#' @param metric Type of distance calculation to use. One of:
#'   - `"braycurtis"`
#'   - `"canberra"`
#'   - `"chebyshev"`
#'   - `"correlation"` (1 minus the Pearson correlation)
#'   - `"cosine"`
#'   - `"dice"`
#'   - `"euclidean"`
#'   - `"hamming"`
#'   - `"hellinger"`
#'   - `"jaccard"`
#'   - `"jensenshannon"`
#'   - `"kulsinski"`
#'   - `"sqeuclidean"` (squared Euclidean)
#'   - `"manhattan"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"spearmanr"` (1 minus the Spearman rank correlation)
#'   - `"symmetrickl"` (symmetric Kullback-Leibler divergence)
#'   - `"tsss"` (Triangle Area Similarity-Sector Area Similarity or TS-SS
#'   metric)
#'   - `"yule"`
#'
#'   For non-sparse data, the following variants are available with
#'   preprocessing: this trades memory for a potential speed up during the
#'   distance calculation. Some minor numerical differences should be expected
#'   compared to the non-preprocessed versions:
#'   - `"cosine-preprocess"`: `cosine` with preprocessing.
#'   - `"correlation-preprocess"`: `correlation` with preprocessing.
#'
#'   For non-sparse binary data passed as a `logical` matrix, the following
#'   metrics have specialized variants which should be substantially faster than
#'   the non-binary variants (in other cases the logical data will be treated as
#'   a dense numeric vector of 0s and 1s):
#'   - `"dice"`
#'   - `"hamming"`
#'   - `"jaccard"`
#'   - `"kulsinski"`
#'   - `"matching"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"yule"`
#' @param use_alt_metric If `TRUE`, use faster metrics that maintain the
#'   ordering of distances internally (e.g. squared Euclidean distances if using
#'   `metric = "euclidean"`), then apply a correction at the end. Probably
#'   the only reason to set this to `FALSE` is if you suspect that some
#'   sort of numeric issue is occurring with your data in the alternative code
#'   path.
#' @param n_threads Number of threads to use.
#' @param verbose If `TRUE`, log information to the console.
#' @param obs set to `"C"` to indicate that the input `data` orientation stores
#'   each observation as a column. The default `"R"` means that observations are
#'   stored in each row. Storing the data by row is usually more convenient, but
#'   internally your data will be converted to column storage. Passing it
#'   already column-oriented will save some memory and (a small amount of) CPU
#'   usage.
#' @return the nearest neighbor graph as a list containing:
#'   * `idx` an n by k matrix containing the nearest neighbor indices.
#'   * `dist` an n by k matrix containing the nearest neighbor distances.
#' @examples
#' # Find the 4 nearest neighbors using Euclidean distance
#' # If you pass a data frame, non-numeric columns are removed
#' iris_nn <- brute_force_knn(iris, k = 4, metric = "euclidean")
#'
#' # Manhattan (l1) distance
#' iris_nn <- brute_force_knn(iris, k = 4, metric = "manhattan")
#'
#' # Multi-threading: you can choose the number of threads to use: in real
#' # usage, you will want to set n_threads to at least 2
#' iris_nn <- brute_force_knn(iris, k = 4, metric = "manhattan", n_threads = 1)
#'
#' # Use verbose flag to see information about progress
#' iris_nn <- brute_force_knn(iris, k = 4, metric = "euclidean", verbose = TRUE)
#' @export
brute_force_knn <- function(data,
                            k,
                            metric = "euclidean",
                            use_alt_metric = TRUE,
                            n_threads = 0,
                            verbose = FALSE,
                            obs = "R") {
  obs <- match.arg(toupper(obs), c("C", "R"))
  n_obs <- switch(obs,
    R = nrow,
    C = ncol,
    stop("Unknown obs type")
  )

  data <- x2m(data)
  check_k(k, n_obs(data))

  actual_metric <- get_actual_metric(use_alt_metric, metric, data, verbose)

  tsmessage(
    thread_msg(
      "Calculating brute force k-nearest neighbors with k = ",
      k,
      n_threads = n_threads
    )
  )
  if (obs == "R") {
    data <- Matrix::t(data)
  }
  if (is_sparse(data)) {
    res <-
      rnn_sparse_brute_force(
        ind = data@i,
        ptr = data@p,
        data = data@x,
        ndim = nrow(data),
        nnbrs = k,
        metric = actual_metric,
        n_threads = n_threads,
        verbose = verbose
      )
  } else if (is.logical(data)) {
    res <-
      rnn_logical_brute_force(data,
        k,
        actual_metric,
        n_threads = n_threads,
        verbose = verbose
      )
  } else {
    res <-
      rnn_brute_force(data,
        k,
        actual_metric,
        n_threads = n_threads,
        verbose = verbose
      )
  }

  res$idx <- res$idx + 1

  if (use_alt_metric) {
    res$dist <-
      apply_alt_metric_correction(metric, res$dist, is_sparse(data))
  }
  tsmessage("Finished")
  res
}

#' Find nearest neighbors by random selection
#'
#' Create a neighbor graph by randomly selecting neighbors. This is not a useful
#' nearest neighbor method on its own, but can be used with other methods which
#' require initialization, such as [nnd_knn()].
#'
#' @param data Matrix of `n` items to generate random neighbors for, with
#'   observations in the rows and features in the columns. Optionally, input can
#'   be passed with observations in the columns, by setting `obs = "C"`, which
#'   should be more efficient. Possible formats are [base::data.frame()],
#'   [base::matrix()] or [Matrix::sparseMatrix()]. Sparse matrices should be in
#'   `dgCMatrix` format. Dataframes will be converted to `numerical` matrix
#'   format internally, so if your data columns are `logical` and intended to be
#'   used with the specialized binary `metric`s, you should convert it to a
#'   logical matrix first (otherwise you will get the slower dense numerical
#'   version).
#' @param k Number of nearest neighbors to return.
#' @param metric Type of distance calculation to use. One of:
#'   - `"braycurtis"`
#'   - `"canberra"`
#'   - `"chebyshev"`
#'   - `"correlation"` (1 minus the Pearson correlation)
#'   - `"cosine"`
#'   - `"dice"`
#'   - `"euclidean"`
#'   - `"hamming"`
#'   - `"hellinger"`
#'   - `"jaccard"`
#'   - `"jensenshannon"`
#'   - `"kulsinski"`
#'   - `"sqeuclidean"` (squared Euclidean)
#'   - `"manhattan"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"spearmanr"` (1 minus the Spearman rank correlation)
#'   - `"symmetrickl"` (symmetric Kullback-Leibler divergence)
#'   - `"tsss"` (Triangle Area Similarity-Sector Area Similarity or TS-SS
#'   metric)
#'   - `"yule"`
#'
#'   For non-sparse data, the following variants are available with
#'   preprocessing: this trades memory for a potential speed up during the
#'   distance calculation. Some minor numerical differences should be expected
#'   compared to the non-preprocessed versions:
#'   - `"cosine-preprocess"`: `cosine` with preprocessing.
#'   - `"correlation-preprocess"`: `correlation` with preprocessing.
#'
#'   For non-sparse binary data passed as a `logical` matrix, the following
#'   metrics have specialized variants which should be substantially faster than
#'   the non-binary variants (in other cases the logical data will be treated as
#'   a dense numeric vector of 0s and 1s):
#'   - `"dice"`
#'   - `"hamming"`
#'   - `"jaccard"`
#'   - `"kulsinski"`
#'   - `"matching"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"yule"`
#' @param use_alt_metric If `TRUE`, use faster metrics that maintain the
#'   ordering of distances internally (e.g. squared Euclidean distances if using
#'   `metric = "euclidean"`), then apply a correction at the end. Probably
#'   the only reason to set this to `FALSE` is if you suspect that some
#'   sort of numeric issue is occurring with your data in the alternative code
#'   path.
#' @param order_by_distance If `TRUE` (the default), then results for each
#'   item are returned by increasing distance. If you don't need the results
#'   sorted, e.g. you are going to pass the results as initialization to another
#'   routine like [nnd_knn()], set this to `FALSE` to save a small amount of
#'   computational time.
#' @param n_threads Number of threads to use.
#' @param verbose If `TRUE`, log information to the console.
#' @param obs set to `"C"` to indicate that the input `data` orientation stores
#'   each observation as a column. The default `"R"` means that observations are
#'   stored in each row. Storing the data by row is usually more convenient, but
#'   internally your data will be converted to column storage. Passing it
#'   already column-oriented will save some memory and (a small amount of) CPU
#'   usage.
#' @return a random neighbor graph as a list containing:
#'   * `idx` an n by k matrix containing the nearest neighbor indices.
#'   * `dist` an n by k matrix containing the nearest neighbor distances.
#' @examples
#' # Find 4 random neighbors and calculate their Euclidean distance
#' # If you pass a data frame, non-numeric columns are removed
#' iris_nn <- random_knn(iris, k = 4, metric = "euclidean")
#'
#' # Manhattan (l1) distance
#' iris_nn <- random_knn(iris, k = 4, metric = "manhattan")
#'
#' # Multi-threading: you can choose the number of threads to use: in real
#' # usage, you will want to set n_threads to at least 2
#' iris_nn <- random_knn(iris, k = 4, metric = "manhattan", n_threads = 1)
#'
#' # Use verbose flag to see information about progress
#' iris_nn <- random_knn(iris, k = 4, metric = "euclidean", verbose = TRUE)
#'
#' # These results can be improved by nearest neighbors descent. You don't need
#' # to specify k here because this is worked out from the initial input
#' iris_nn <- nnd_knn(iris, init = iris_nn, metric = "euclidean", verbose = TRUE)
#' @export
random_knn <-
  function(data,
           k,
           metric = "euclidean",
           use_alt_metric = TRUE,
           order_by_distance = TRUE,
           n_threads = 0,
           verbose = FALSE,
           obs = "R") {
    obs <- match.arg(toupper(obs), c("C", "R"))
    n_obs <- switch(obs,
      R = nrow,
      C = ncol,
      stop("Unknown obs type")
    )
    data <- x2m(data)
    check_k(k, n_obs(data))

    actual_metric <- get_actual_metric(use_alt_metric, metric, data, verbose)

    if (obs == "R") {
      data <- Matrix::t(data)
    }
    res <- random_knn_impl(
      reference = data,
      k = k,
      actual_metric = actual_metric,
      order_by_distance = order_by_distance,
      n_threads = n_threads,
      verbose = verbose
    )
    res$idx <- res$idx + 1

    if (use_alt_metric) {
      res$dist <-
        apply_alt_metric_correction(metric, res$dist, is_sparse(data))
    }
    tsmessage("Finished")
    res
  }

#' Find nearest neighbors using nearest neighbor descent
#'
#' Uses the Nearest Neighbor Descent method due to Dong and co-workers (2011)
#' to optimize an approximate nearest neighbor graph.
#'
#' If no initial graph is provided, a random graph is generated, or you may also
#' specify the use of a graph generated from a forest of random projection
#' trees, using the method of Dasgupta and Freund (2008).
#'
#' @param data Matrix of `n` items to generate neighbors for, with observations
#'   in the rows and features in the columns. Optionally, input can be passed
#'   with observations in the columns, by setting `obs = "C"`, which should be
#'   more efficient. Possible formats are [base::data.frame()], [base::matrix()]
#'   or [Matrix::sparseMatrix()]. Sparse matrices should be in `dgCMatrix`
#'   format. Dataframes will be converted to `numerical` matrix format
#'   internally, so if your data columns are `logical` and intended to be used
#'   with the specialized binary `metric`s, you should convert it to a logical
#'   matrix first (otherwise you will get the slower dense numerical version).
#' @param k Number of nearest neighbors to return. Optional if `init` is
#'   specified.
#' @param metric Type of distance calculation to use. One of:
#'   - `"braycurtis"`
#'   - `"canberra"`
#'   - `"chebyshev"`
#'   - `"correlation"` (1 minus the Pearson correlation)
#'   - `"cosine"`
#'   - `"dice"`
#'   - `"euclidean"`
#'   - `"hamming"`
#'   - `"hellinger"`
#'   - `"jaccard"`
#'   - `"jensenshannon"`
#'   - `"kulsinski"`
#'   - `"sqeuclidean"` (squared Euclidean)
#'   - `"manhattan"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"spearmanr"` (1 minus the Spearman rank correlation)
#'   - `"symmetrickl"` (symmetric Kullback-Leibler divergence)
#'   - `"tsss"` (Triangle Area Similarity-Sector Area Similarity or TS-SS
#'   metric)
#'   - `"yule"`
#'
#'   For non-sparse data, the following variants are available with
#'   preprocessing: this trades memory for a potential speed up during the
#'   distance calculation. Some minor numerical differences should be expected
#'   compared to the non-preprocessed versions:
#'   - `"cosine-preprocess"`: `cosine` with preprocessing.
#'   - `"correlation-preprocess"`: `correlation` with preprocessing.
#'
#'   For non-sparse binary data passed as a `logical` matrix, the following
#'   metrics have specialized variants which should be substantially faster than
#'   the non-binary variants (in other cases the logical data will be treated as
#'   a dense numeric vector of 0s and 1s):
#'   - `"dice"`
#'   - `"hamming"`
#'   - `"jaccard"`
#'   - `"kulsinski"`
#'   - `"matching"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"yule"`
#' @param init Name of the initialization strategy or initial `data` neighbor
#'   graph to optimize. One of:
#'   - `"rand"` random initialization (the default).
#'   - `"tree"` use the random projection tree method of Dasgupta and Freund
#'   (2008).
#'   - a pre-calculated neighbor graph. A list containing:
#'       - `idx` an `n` by `k` matrix containing the nearest neighbor indices.
#'       - `dist` (optional) an `n` by `k` matrix containing the nearest
#'       neighbor distances. If the input distances are omitted, they will be
#'       calculated for you.'
#'
#'   If `k` and `init` are specified as arguments to this function, and the
#'   number of neighbors provided in `init` is not equal to `k` then:
#'
#'   * if `k` is smaller, only the `k` closest values in `init` are retained.
#'   * if `k` is larger, then random neighbors will be chosen to fill `init` to
#'   the size of `k`. Note that there is no checking if any of the random
#'   neighbors are duplicates of what is already in `init` so effectively fewer
#'   than `k` neighbors may be chosen for some observations under these
#'   circumstances.
#' @param init_args a list containing arguments to pass to the random partition
#'   forest initialization. See [rpf_knn()] for possible arguments. To avoid
#'   inconsistences with the tree calculation and subsequent nearest neighbor
#'   descent optimization, if you attempt to provide a `metric` or
#'   `use_alt_metric` option in this list it will be ignored.
#' @param n_iters Number of iterations of nearest neighbor descent to carry out.
#'   By default, this will be chosen based on the number of observations in
#'   `data`.
#' @param max_candidates Maximum number of candidate neighbors to try for each
#'   item in each iteration. Use relative to `k` to emulate the "rho"
#'   sampling parameter in the nearest neighbor descent paper. By default, this
#'   is set to `k` or `60`, whichever is smaller.
#' @param delta The minimum relative change in the neighbor graph allowed before
#'   early stopping. Should be a value between 0 and 1. The smaller the value,
#'   the smaller the amount of progress between iterations is allowed. Default
#'   value of `0.001` means that at least 0.1% of the neighbor graph must
#'   be updated at each iteration.
#' @param low_memory If `TRUE`, use a lower memory, but more
#'   computationally expensive approach to index construction. If set to
#'   `FALSE`, you should see a noticeable speed improvement, especially
#'   when using a smaller number of threads, so this is worth trying if you have
#'   the memory to spare.
#' @param weight_by_degree If `TRUE`, then candidates for the local join are
#'   weighted according to their in-degree, so that if there are more than
#'   `max_candidates` in a candidate list, candidates with a smaller degree are
#'   favored for retention. This prevents items with large numbers of edges
#'   crowding out other items and for high-dimensional data is likely to provide
#'   a small improvement in accuracy. Because this incurs a small extra cost of
#'   counting the degree of each node, and because it tends to delay early
#'   convergence, by default this is `FALSE`.
#' @param use_alt_metric If `TRUE`, use faster metrics that maintain the
#'   ordering of distances internally (e.g. squared Euclidean distances if using
#'   `metric = "euclidean"`), then apply a correction at the end. Probably
#'   the only reason to set this to `FALSE` is if you suspect that some
#'   sort of numeric issue is occurring with your data in the alternative code
#'   path.
#' @param ret_forest If `TRUE` and `init = "tree"` then the RP forest used to
#'   initialize the nearest neighbors will be returned with the nearest neighbor
#'   data. See the `Value` section for details. The returned forest can be used
#'   as part of initializing the search for new data: see [rpf_knn_query()] and
#'   [rpf_filter()] for more details.
#' @param n_threads Number of threads to use.
#' @param verbose If `TRUE`, log information to the console.
#' @param progress Determines the type of progress information logged if
#'   `verbose = TRUE`. Options are:
#'   * `"bar"`: a simple text progress bar.
#'   * `"dist"`: the sum of the distances in the approximate knn graph at the
#'     end of each iteration.
#' @param obs set to `"C"` to indicate that the input `data` orientation stores
#'   each observation as a column. The default `"R"` means that observations are
#'   stored in each row. Storing the data by row is usually more convenient, but
#'   internally your data will be converted to column storage. Passing it
#'   already column-oriented will save some memory and (a small amount of) CPU
#'   usage.
#' @return the approximate nearest neighbor graph as a list containing:
#'   * `idx` an n by k matrix containing the nearest neighbor indices.
#'   * `dist` an n by k matrix containing the nearest neighbor distances.
#'   * `forest` (if `init = "tree"` and `ret_forest = TRUE` only): the RP forest
#'      used to initialize the neighbor data.
#' @examples
#' # Find 4 (approximate) nearest neighbors using Euclidean distance
#' # If you pass a data frame, non-numeric columns are removed
#' iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean")
#'
#' # Manhattan (l1) distance
#' iris_nn <- nnd_knn(iris, k = 4, metric = "manhattan")
#'
#' # Multi-threading: you can choose the number of threads to use: in real
#' # usage, you will want to set n_threads to at least 2
#' iris_nn <- nnd_knn(iris, k = 4, metric = "manhattan", n_threads = 1)
#'
#' # Use verbose flag to see information about progress
#' iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", verbose = TRUE)
#'
#' # Nearest neighbor descent uses random initialization, but you can pass any
#' # approximation using the init argument (as long as the metrics used to
#' # calculate the initialization are compatible with the metric options used
#' # by nnd_knn).
#' iris_nn <- random_knn(iris, k = 4, metric = "euclidean")
#' iris_nn <- nnd_knn(iris, init = iris_nn, metric = "euclidean", verbose = TRUE)
#'
#' # Number of iterations controls how much optimization is attempted. A smaller
#' # value will run faster but give poorer results
#' iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", n_iters = 2)
#'
#' # You can also control the amount of work done within an iteration by
#' # setting max_candidates
#' iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", max_candidates = 50)
#'
#' # Optimization may also stop early if not much progress is being made. This
#' # convergence criterion can be controlled via delta. A larger value will
#' # stop progress earlier. The verbose flag will provide some information if
#' # convergence is occurring before all iterations are carried out.
#' set.seed(1337)
#' iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", n_iters = 5, delta = 0.5)
#'
#' # To ensure that descent only stops if no improvements are made, set delta = 0
#' set.seed(1337)
#' iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", n_iters = 5, delta = 0)
#'
#' # A faster version of the algorithm is available that avoids repeated
#' # distance calculations at the cost of using more RAM. Set low_memory to
#' # FALSE to try it.
#' set.seed(1337)
#' iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", low_memory = FALSE)
#'
#' # Using init = "tree" is usually more efficient than random initialization.
#' # arguments to the tree initialization method can be passed via the init_args
#' # list
#' set.seed(1337)
#' iris_nn <- nnd_knn(iris, k = 4, init = "tree", init_args = list(n_trees = 5))
#' @references
#' Dasgupta, S., & Freund, Y. (2008, May).
#' Random projection trees and low dimensional manifolds.
#' In *Proceedings of the fortieth annual ACM symposium on Theory of computing*
#' (pp. 537-546).
#' \doi{10.1145/1374376.1374452}.
#'
#' Dong, W., Moses, C., & Li, K. (2011, March).
#' Efficient k-nearest neighbor graph construction for generic similarity measures.
#' In *Proceedings of the 20th international conference on World Wide Web*
#' (pp. 577-586).
#' ACM.
#' \doi{10.1145/1963405.1963487}.
#' @export
nnd_knn <- function(data,
                    k = NULL,
                    metric = "euclidean",
                    init = "rand",
                    init_args = NULL,
                    n_iters = NULL,
                    max_candidates = NULL,
                    delta = 0.001,
                    low_memory = TRUE,
                    weight_by_degree = FALSE,
                    use_alt_metric = TRUE,
                    n_threads = 0,
                    verbose = FALSE,
                    progress = "bar",
                    obs = "R",
                    ret_forest = FALSE) {
  stopifnot(tolower(progress) %in% c("bar", "dist"))
  obs <- match.arg(toupper(obs), c("C", "R"))

  actual_metric <-
    get_actual_metric(use_alt_metric, metric, data, verbose)

  data <- x2m(data)
  if (obs == "R") {
    data <- Matrix::t(data)
  }

  # data must be column-oriented at this point
  if (is.character(init)) {
    if (is.null(k)) {
      stop("Must provide k")
    }
    check_k(k, ncol(data))
    init <- match.arg(tolower(init), c("rand", "tree"))

    # if the use provided tree init args that could clash with nnd args
    # ignore them
    if (init == "tree" && is.list(init_args)) {
      if ("use_alt_metric" %in% init_args) {
        init_args$use_alt_metric <- use_alt_metric
      }
      if ("metric" %in% init_args) {
        init_args$metric <- metric
      }
    }
    tsmessage("Initializing neighbors using '", init, "' method")
    init <- switch(init,
      "rand" = random_knn_impl(
        reference = data,
        k = k,
        actual_metric = actual_metric,
        order_by_distance = FALSE,
        n_threads = n_threads,
        verbose = verbose
      ),
      "tree" = do.call(rpf_knn_impl, lmerge(
        # defaults we have no reason to change should match rpf_knn or rpf_build
        list(
          data,
          k = k,
          metric = metric,
          use_alt_metric = use_alt_metric,
          actual_metric = actual_metric,
          n_trees = NULL,
          leaf_size = NULL,
          max_tree_depth = 200,
          include_self = FALSE, # this is changed from default on purpose
          ret_forest = ret_forest,
          margin = "auto",
          unzero = FALSE,
          n_threads = n_threads,
          verbose = verbose
        ),
        init_args
      )),
      stop("Unknown initialization option '", init, "'")
    )
    # FIXME: can we just turn off unzero in tree and random return?
    init$idx <- init$idx + 1
    if (any(init$idx == 0)) {
      tsmessage(
        "Warning: initialization failed to find ",
        k,
        " neighbors for all points"
      )
    }
  } else {
    tsmessage("Initializing from user-supplied graph")
    # user-supplied input may need to be transformed to the actual metric
    if (use_alt_metric &&
      !is.null(init) && is.list(init) && !is.null(init$dist)) {
      tsmessage(
        "Applying metric correction to initial distances from '",
        metric, "' to '", actual_metric, "'"
      )
      init$dist <-
        apply_alt_metric_uncorrection(metric, init$dist, is_sparse(data))
    }

    if (is.null(k)) {
      k <- ncol(init$idx)
    }
  }
  forest <- NULL
  if (ret_forest && is.list(init) && !is.null(init$forest)) {
    forest <- init$forest
    init$forest <- NULL
  }

  init <-
    prepare_init_graph(
      init,
      k,
      data = data,
      metric = actual_metric,
      n_threads = n_threads,
      verbose = verbose
    )

  if (is.null(max_candidates)) {
    max_candidates <- min(k, 60)
  }
  if (is.null(n_iters)) {
    n_iters <- max(5, round(log2(ncol(data))))
  }
  tsmessage(
    thread_msg(
      "Running nearest neighbor descent for ",
      n_iters,
      " iterations",
      n_threads = n_threads
    )
  )

  nnd_args <- list(
    nn_idx = init$idx,
    nn_dist = init$dist,
    metric = actual_metric,
    n_iters = n_iters,
    max_candidates = max_candidates,
    delta = delta,
    low_memory = low_memory,
    weight_by_degree = weight_by_degree,
    n_threads = n_threads,
    verbose = verbose,
    progress_type = progress
  )
  if (is_sparse(data)) {
    nnd_fun <- rnn_sparse_descent
    nnd_args$data <- data@x
    nnd_args$ind <- data@i
    nnd_args$ptr <- data@p
    nnd_args$ndim <- nrow(data)
  } else if (is.logical(data)) {
    nnd_fun <- rnn_logical_descent
    nnd_args$data <- data
  } else {
    nnd_fun <- rnn_descent
    nnd_args$data <- data
  }
  res <- do.call(nnd_fun, nnd_args)

  if (use_alt_metric) {
    res$dist <-
      apply_alt_metric_correction(metric, res$dist, is_sparse(data))
  }
  if (any(res$idx == 0)) {
    tsmessage(
      "Warning: NN Descent failed to find ",
      k,
      " neighbors for all points"
    )
  }
  tsmessage("Finished")
  if (!is.null(forest)) {
    res$forest <- forest
  }
  res
}


# kNN Queries -------------------------------------------------------------

#' Query exact nearest neighbors by brute force
#'
#' Returns the exact nearest neighbors of query data to the reference data. A
#' brute force search is carried out: all possible pairs of reference and query
#' points are compared, and the nearest neighbors are returned.
#'
#' This is accurate but scales poorly with dataset size, so use with caution
#' with larger datasets. Having the exact neighbors as a ground truth to compare
#' with approximate results is useful for benchmarking and determining
#' parameter settings of the approximate methods.
#'
#' @param query Matrix of `n` query items, with observations in the rows and
#'   features in the columns. Optionally, the data may be passed with the
#'   observations in the columns, by setting `obs = "C"`, which should be more
#'   efficient. The `reference` data must be passed in the same orientation as
#'   `query`. Possible formats are [base::data.frame()], [base::matrix()]
#'   or [Matrix::sparseMatrix()]. Sparse matrices should be in `dgCMatrix`
#'   format. Dataframes will be converted to `numerical` matrix format
#'   internally, so if your data columns are `logical` and intended to be used
#'   with the specialized binary `metric`s, you should convert it to a logical
#'   matrix first (otherwise you will get the slower dense numerical version).
#' @param reference Matrix of `m` reference items, with observations in the rows
#'   and features in the columns. The nearest neighbors to the queries are
#'   calculated from this data. Optionally, the data may be passed with the
#'   observations in the columns, by setting `obs = "C"`, which should be more
#'   efficient. The `query` data must be passed in the same format and
#'   orientation as `reference`. Possible formats are [base::data.frame()],
#'   [base::matrix()] or [Matrix::sparseMatrix()]. Sparse matrices should be in
#'   `dgCMatrix` format.
#' @param k Number of nearest neighbors to return.
#' @param metric Type of distance calculation to use. One of:
#'   - `"braycurtis"`
#'   - `"canberra"`
#'   - `"chebyshev"`
#'   - `"correlation"` (1 minus the Pearson correlation)
#'   - `"cosine"`
#'   - `"dice"`
#'   - `"euclidean"`
#'   - `"hamming"`
#'   - `"hellinger"`
#'   - `"jaccard"`
#'   - `"jensenshannon"`
#'   - `"kulsinski"`
#'   - `"sqeuclidean"` (squared Euclidean)
#'   - `"manhattan"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"spearmanr"` (1 minus the Spearman rank correlation)
#'   - `"symmetrickl"` (symmetric Kullback-Leibler divergence)
#'   - `"tsss"` (Triangle Area Similarity-Sector Area Similarity or TS-SS
#'   metric)
#'   - `"yule"`
#'
#'   For non-sparse data, the following variants are available with
#'   preprocessing: this trades memory for a potential speed up during the
#'   distance calculation. Some minor numerical differences should be expected
#'   compared to the non-preprocessed versions:
#'   - `"cosine-preprocess"`: `cosine` with preprocessing.
#'   - `"correlation-preprocess"`: `correlation` with preprocessing.
#'
#'   For non-sparse binary data passed as a `logical` matrix, the following
#'   metrics have specialized variants which should be substantially faster than
#'   the non-binary variants (in other cases the logical data will be treated as
#'   a dense numeric vector of 0s and 1s):
#'   - `"dice"`
#'   - `"hamming"`
#'   - `"jaccard"`
#'   - `"kulsinski"`
#'   - `"matching"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"yule"`
#' @param use_alt_metric If `TRUE`, use faster metrics that maintain the
#'   ordering of distances internally (e.g. squared Euclidean distances if using
#'   `metric = "euclidean"`), then apply a correction at the end. Probably
#'   the only reason to set this to `FALSE` is if you suspect that some
#'   sort of numeric issue is occurring with your data in the alternative code
#'   path.
#' @param n_threads Number of threads to use.
#' @param verbose If `TRUE`, log information to the console.
#' @param obs set to `"C"` to indicate that the input `query` and `reference`
#'   orientation stores each observation as a column (the orientation must be
#'   consistent). The default `"R"` means that observations are stored in each
#'   row. Storing the data by row is usually more convenient, but internally
#'   your data will be converted to column storage. Passing it already
#'   column-oriented will save some memory and (a small amount of) CPU usage.
#' @return the nearest neighbor graph as a list containing:
#'   * `idx` an n by k matrix containing the nearest neighbor indices in
#'   `reference`.
#'   * `dist` an n by k matrix containing the nearest neighbor distances to the
#'   items in `reference`.
#' @examples
#' # 100 reference iris items
#' iris_ref <- iris[iris$Species %in% c("setosa", "versicolor"), ]
#'
#' # 50 query items
#' iris_query <- iris[iris$Species == "versicolor", ]
#'
#' # For each item in iris_query find the 4 nearest neighbors in iris_ref
#' # If you pass a data frame, non-numeric columns are removed
#' # set verbose = TRUE to get details on the progress being made
#' iris_query_nn <- brute_force_knn_query(iris_query,
#'   reference = iris_ref,
#'   k = 4, metric = "euclidean", verbose = TRUE
#' )
#'
#' # Manhattan (l1) distance
#' iris_query_nn <- brute_force_knn_query(iris_query,
#'   reference = iris_ref,
#'   k = 4, metric = "manhattan"
#' )
#' @export
brute_force_knn_query <- function(query,
                                  reference,
                                  k,
                                  metric = "euclidean",
                                  use_alt_metric = TRUE,
                                  n_threads = 0,
                                  verbose = FALSE,
                                  obs = "R") {
  obs <- match.arg(toupper(obs), c("C", "R"))

  actual_metric <- get_actual_metric(use_alt_metric, metric, reference, verbose)

  check_sparse(reference, query)
  reference <- x2m(reference)
  query <- x2m(query)
  if (obs == "R") {
    reference <- Matrix::t(reference)
    query <- Matrix::t(query)
  }
  check_k(k, ncol(reference))

  tsmessage(
    thread_msg(
      "Calculating brute force k-nearest neighbors from reference with k = ",
      k,
      n_threads = n_threads
    )
  )

  if (is_sparse(reference)) {
    res <- rnn_sparse_brute_force_query(
      ref_ind = reference@i,
      ref_ptr = reference@p,
      ref_data = reference@x,
      query_ind = query@i,
      query_ptr = query@p,
      query_data = query@x,
      ndim = nrow(query),
      nnbrs = k,
      metric = actual_metric,
      n_threads = n_threads,
      verbose = verbose
    )
  } else if (is.logical(reference)) {
    res <- rnn_logical_brute_force_query(reference,
      query,
      k,
      actual_metric,
      n_threads = n_threads,
      verbose = verbose
    )
  } else {
    res <- rnn_brute_force_query(reference,
      query,
      k,
      actual_metric,
      n_threads = n_threads,
      verbose = verbose
    )
  }
  res$idx <- res$idx + 1

  if (use_alt_metric) {
    res$dist <-
      apply_alt_metric_correction(metric, res$dist, is_sparse(reference))
  }
  tsmessage("Finished")

  res
}

#' Query nearest neighbors by random selection
#'
#' Run queries against reference data to return randomly selected neighbors.
#' This is not a useful query method on its own, but can be used with other
#' methods which require initialization.
#'
#' @param query Matrix of `n` query items, with observations in the rows and
#'   features in the columns. Optionally, the data may be passed with the
#'   observations in the columns, by setting `obs = "C"`, which should be more
#'   efficient. The `reference` data must be passed in the same orientation as
#'   `query`. Possible formats are [base::data.frame()], [base::matrix()]
#'   or [Matrix::sparseMatrix()]. Sparse matrices should be in `dgCMatrix`
#'   format. Dataframes will be converted to `numerical` matrix format
#'   internally, so if your data columns are `logical` and intended to be used
#'   with the specialized binary `metric`s, you should convert it to a logical
#'   matrix first (otherwise you will get the slower dense numerical version).
#' @param reference Matrix of `m` reference items, with observations in the rows
#'   and features in the columns. The nearest neighbors to the queries are
#'   randomly selected from this data. Optionally, the data may be passed with
#'   the observations in the columns, by setting `obs = "C"`, which should be
#'   more efficient. The `query` data must be passed in the same orientation
#'   and format as `reference`. Possible formats are [base::data.frame()],
#'   [base::matrix()] or [Matrix::sparseMatrix()]. Sparse matrices should be in
#'   `dgCMatrix` format.
#' @param k Number of nearest neighbors to return.
#' @param metric Type of distance calculation to use. One of:
#'   - `"braycurtis"`
#'   - `"canberra"`
#'   - `"chebyshev"`
#'   - `"correlation"` (1 minus the Pearson correlation)
#'   - `"cosine"`
#'   - `"dice"`
#'   - `"euclidean"`
#'   - `"hamming"`
#'   - `"hellinger"`
#'   - `"jaccard"`
#'   - `"jensenshannon"`
#'   - `"kulsinski"`
#'   - `"sqeuclidean"` (squared Euclidean)
#'   - `"manhattan"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"spearmanr"` (1 minus the Spearman rank correlation)
#'   - `"symmetrickl"` (symmetric Kullback-Leibler divergence)
#'   - `"tsss"` (Triangle Area Similarity-Sector Area Similarity or TS-SS
#'   metric)
#'   - `"yule"`
#'
#'   For non-sparse data, the following variants are available with
#'   preprocessing: this trades memory for a potential speed up during the
#'   distance calculation. Some minor numerical differences should be expected
#'   compared to the non-preprocessed versions:
#'   - `"cosine-preprocess"`: `cosine` with preprocessing.
#'   - `"correlation-preprocess"`: `correlation` with preprocessing.
#'
#'   For non-sparse binary data passed as a `logical` matrix, the following
#'   metrics have specialized variants which should be substantially faster than
#'   the non-binary variants (in other cases the logical data will be treated as
#'   a dense numeric vector of 0s and 1s):
#'   - `"dice"`
#'   - `"hamming"`
#'   - `"jaccard"`
#'   - `"kulsinski"`
#'   - `"matching"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"yule"`
#' @param use_alt_metric If `TRUE`, use faster metrics that maintain the
#'   ordering of distances internally (e.g. squared Euclidean distances if using
#'   `metric = "euclidean"`), then apply a correction at the end. Probably
#'   the only reason to set this to `FALSE` is if you suspect that some
#'   sort of numeric issue is occurring with your data in the alternative code
#'   path.
#' @param order_by_distance If `TRUE` (the default), then results for each
#'   item are returned by increasing distance. If you don't need the results
#'   sorted, e.g. you are going to pass the results as initialization to another
#'   routine like [graph_knn_query()], set this to `FALSE` to save a
#'   small amount of computational time.
#' @param n_threads Number of threads to use.
#' @param verbose If `TRUE`, log information to the console.
#' @param obs set to `"C"` to indicate that the input `query` and `reference`
#'   orientation stores each observation as a column (the orientation must be
#'   consistent). The default `"R"` means that observations are stored in each
#'   row. Storing the data by row is usually more convenient, but internally
#'   your data will be converted to column storage. Passing it already
#'   column-oriented will save some memory and (a small amount of) CPU usage.
#' @return an approximate nearest neighbor graph as a list containing:
#'   * `idx` an n by k matrix containing the nearest neighbor indices.
#'   * `dist` an n by k matrix containing the nearest neighbor distances.
#' @examples
#' # 100 reference iris items
#' iris_ref <- iris[iris$Species %in% c("setosa", "versicolor"), ]
#'
#' # 50 query items
#' iris_query <- iris[iris$Species == "versicolor", ]
#'
#' # For each item in iris_query find 4 random neighbors in iris_ref
#' # If you pass a data frame, non-numeric columns are removed
#' # set verbose = TRUE to get details on the progress being made
#' iris_query_random_nbrs <- random_knn_query(iris_query,
#'   reference = iris_ref,
#'   k = 4, metric = "euclidean", verbose = TRUE
#' )
#'
#' # Manhattan (l1) distance
#' iris_query_random_nbrs <- random_knn_query(iris_query,
#'   reference = iris_ref,
#'   k = 4, metric = "manhattan"
#' )
#' @export
random_knn_query <-
  function(query,
           reference,
           k,
           metric = "euclidean",
           use_alt_metric = TRUE,
           order_by_distance = TRUE,
           n_threads = 0,
           verbose = FALSE,
           obs = "R") {
    obs <- match.arg(toupper(obs), c("C", "R"))
    n_obs <- switch(obs,
      R = nrow,
      C = ncol,
      stop("Unknown obs type")
    )

    check_sparse(reference, query)

    reference <- x2m(reference)
    query <- x2m(query)
    check_k(k, n_obs(reference))

    actual_metric <-
      get_actual_metric(use_alt_metric, metric, reference, verbose)

    if (obs == "R") {
      reference <- Matrix::t(reference)
      query <- Matrix::t(query)
    }

    res <- random_knn_impl(
      reference = reference,
      query = query,
      k = k,
      actual_metric = actual_metric,
      order_by_distance = order_by_distance,
      n_threads = n_threads,
      verbose = verbose
    )
    res$idx <- res$idx + 1

    if (use_alt_metric) {
      res$dist <-
        apply_alt_metric_correction(metric, res$dist, is_sparse(reference))
    }

    tsmessage("Finished")
    res
  }

#' Query a search graph for nearest neighbors
#'
#' Run queries against a search graph, to return nearest neighbors taken from
#' the reference data used to build that graph.
#'
#' A greedy beam search is used to query the graph, combining two search pruning
#' strategies. The first, due to Iwasaki and Miyazaki (2018), only considers
#' new candidates within a relative distance of the current furthest neighbor
#' in the query's graph. The second, due to Harwood and Drummond (2016), puts a
#' limit on the absolute number of distance calculations to carry out. See the
#' `epsilon` and `max_search_fraction` parameters respectively.
#'
#' @param query Matrix of `n` query items, with observations in the rows and
#'   features in the columns. Optionally, the data may be passed with the
#'   observations in the columns, by setting `obs = "C"`, which should be more
#'   efficient. The `reference` data must be passed in the same orientation as
#'   `query`. Possible formats are [base::data.frame()], [base::matrix()]
#'   or [Matrix::sparseMatrix()]. Sparse matrices should be in `dgCMatrix`
#'   format. Dataframes will be converted to `numerical` matrix format
#'   internally, so if your data columns are `logical` and intended to be used
#'   with the specialized binary `metric`s, you should convert it to a logical
#'   matrix first (otherwise you will get the slower dense numerical version).
#' @param reference Matrix of `m` reference items, with observations in the rows
#'   and features in the columns. The nearest neighbors to the queries are
#'   calculated from this data. Optionally, the data may be passed with the
#'   observations in the columns, by setting `obs = "C"`, which should be more
#'   efficient. The `query` data must be passed in the same format and
#'   orientation as `reference`. Possible formats are [base::data.frame()],
#'   [base::matrix()] or [Matrix::sparseMatrix()]. Sparse matrices should be in
#'   `dgCMatrix` format.
#' @param reference_graph Search graph of the `reference` data. A neighbor
#'   graph, such as that output from [nnd_knn()] can be used, but
#'   preferably a suitably prepared sparse search graph should be used, such as
#'   that output by [prepare_search_graph()].
#' @param k Number of nearest neighbors to return. Optional if `init` is
#'   specified.
#' @param metric Type of distance calculation to use. One of:
#'   - `"braycurtis"`
#'   - `"canberra"`
#'   - `"chebyshev"`
#'   - `"correlation"` (1 minus the Pearson correlation)
#'   - `"cosine"`
#'   - `"dice"`
#'   - `"euclidean"`
#'   - `"hamming"`
#'   - `"hellinger"`
#'   - `"jaccard"`
#'   - `"jensenshannon"`
#'   - `"kulsinski"`
#'   - `"sqeuclidean"` (squared Euclidean)
#'   - `"manhattan"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"spearmanr"` (1 minus the Spearman rank correlation)
#'   - `"symmetrickl"` (symmetric Kullback-Leibler divergence)
#'   - `"tsss"` (Triangle Area Similarity-Sector Area Similarity or TS-SS
#'   metric)
#'   - `"yule"`
#'
#'   For non-sparse data, the following variants are available with
#'   preprocessing: this trades memory for a potential speed up during the
#'   distance calculation. Some minor numerical differences should be expected
#'   compared to the non-preprocessed versions:
#'   - `"cosine-preprocess"`: `cosine` with preprocessing.
#'   - `"correlation-preprocess"`: `correlation` with preprocessing.
#'
#'   For non-sparse binary data passed as a `logical` matrix, the following
#'   metrics have specialized variants which should be substantially faster than
#'   the non-binary variants (in other cases the logical data will be treated as
#'   a dense numeric vector of 0s and 1s):
#'   - `"dice"`
#'   - `"hamming"`
#'   - `"jaccard"`
#'   - `"kulsinski"`
#'   - `"matching"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"yule"`
#' @param use_alt_metric If `TRUE`, use faster metrics that maintain the
#'   ordering of distances internally (e.g. squared Euclidean distances if using
#'   `metric = "euclidean"`), then apply a correction at the end. Probably the
#'   only reason to set this to `FALSE` is if you suspect that some sort of
#'   numeric issue is occurring with your data in the alternative code path. If
#'   a search forest is used for initialization via the `init` parameter, then
#'   the metric is fetched from there and this setting is ignored.
#' @param init Initial `query` neighbor graph to optimize. If not provided, `k`
#'   random neighbors are created. If provided, the input format must be one of:
#'   1. A list containing:
#'
#'       * `idx` an `n` by `k` matrix containing the nearest neighbor indices.
#'       * `dist` (optional) an `n` by `k` matrix containing the nearest neighbor
#'       distances.
#'
#'       If `k` and `init` are specified as arguments to this function, and the
#'       number of neighbors provided in `init` is not equal to `k` then:
#'
#'       * if `k` is smaller, only the `k` closest values in `init` are retained.
#'       * if `k` is larger, then random neighbors will be chosen to fill `init` to
#'       the size of `k`. Note that there is no checking if any of the random
#'       neighbors are duplicates of what is already in `init` so effectively fewer
#'       than `k` neighbors may be chosen for some observations under these
#'       circumstances.
#'
#'       If the input distances are omitted, they will be calculated for you.
#'  1. A random projection forest, such as that returned from [rpf_build()] or
#'     [rpf_knn()] with `ret_forest = TRUE`.
#' @param epsilon Controls trade-off between accuracy and search cost, as
#'   described by Iwasaki and Miyazaki (2018), by specifying a distance
#'   tolerance on whether to explore the neighbors of candidate points. The
#'   larger the value, the more neighbors will be searched. A value of 0.1
#'   allows query-candidate distances to be 10% larger than the current
#'   most-distant neighbor of the query point, 0.2 means 20%, and so on.
#'   Suggested values are between 0-0.5, although this value is highly dependent
#'   on the distribution of distances in the dataset (higher dimensional data
#'   should choose a smaller cutoff). Too large a value of `epsilon` will result
#'   in the query search approaching brute force comparison. Use this parameter
#'   in conjunction with `max_search_fraction` and [prepare_search_graph()] to
#'   prevent excessive run time. Default is 0.1. If you set `verbose = TRUE`,
#'   statistics of the number of distance calculations will be logged which
#'   can help you tune `epsilon`.
#' @param max_search_fraction Maximum fraction of the reference data to search.
#'  This is a value between 0 (search none of the reference data) and 1 (search
#'  all of the data if necessary). This works in conjunction with `epsilon` and
#'  will terminate the search early if the specified fraction of the reference
#'  data has been searched. Default is 1.
#' @param n_threads Number of threads to use.
#' @param verbose If `TRUE`, log information to the console.
#' @param obs set to `"C"` to indicate that the input `query` and `reference`
#'   orientation stores each observation as a column (the orientation must be
#'   consistent). The default `"R"` means that observations are stored in each
#'   row. Storing the data by row is usually more convenient, but internally
#'   your data will be converted to column storage. Passing it already
#'   column-oriented will save some memory and (a small amount of) CPU usage.
#' @return the approximate nearest neighbor graph as a list containing:
#'   * `idx` a `n` by `k` matrix containing the nearest neighbor indices
#'     specifying the row of the neighbor in `reference`.
#'   * `dist` a `n` by `k` matrix containing the nearest neighbor distances.
#' @examples
#' # 100 reference iris items
#' iris_ref <- iris[iris$Species %in% c("setosa", "versicolor"), ]
#'
#' # 50 query items
#' iris_query <- iris[iris$Species == "versicolor", ]
#'
#' # First, find the approximate 4-nearest neighbor graph for the references:
#' iris_ref_graph <- nnd_knn(iris_ref, k = 4)
#'
#' # For each item in iris_query find the 4 nearest neighbors in iris_ref.
#' # You need to pass both the reference data and the reference graph.
#' # If you pass a data frame, non-numeric columns are removed.
#' # set verbose = TRUE to get details on the progress being made
#' iris_query_nn <- graph_knn_query(iris_query, iris_ref, iris_ref_graph,
#'   k = 4, metric = "euclidean", verbose = TRUE
#' )
#'
#' # A more complete example, converting the initial knn into a search graph
#' # and using a filtered random projection forest to initialize the search
#' # create initial knn and forest
#' iris_ref_graph <- nnd_knn(iris_ref, k = 4, init = "tree", ret_forest = TRUE)
#' # keep the best tree in the forest
#' forest <- rpf_filter(iris_ref_graph, n_trees = 1)
#' # expand the knn into a search graph
#' iris_ref_search_graph <- prepare_search_graph(iris_ref, iris_ref_graph)
#' # run the query with the improved graph and initialization
#' iris_query_nn <- graph_knn_query(iris_query, iris_ref, iris_ref_search_graph,
#'   init = forest, k = 4
#' )
#'
#' @references
#' Harwood, B., & Drummond, T. (2016).
#' Fanng: Fast approximate nearest neighbour graphs.
#' In *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*
#' (pp. 5713-5722).
#'
#' Iwasaki, M., & Miyazaki, D. (2018).
#' Optimization of indexing based on k-nearest neighbor graph for proximity
#' search in high-dimensional data.
#' *arXiv preprint arXiv:1810.07355*.
#' @export
graph_knn_query <- function(query,
                            reference,
                            reference_graph,
                            k = NULL,
                            metric = "euclidean",
                            init = NULL,
                            epsilon = 0.1,
                            max_search_fraction = 1.0,
                            use_alt_metric = TRUE,
                            n_threads = 0,
                            verbose = FALSE,
                            obs = "R") {
  obs <- match.arg(toupper(obs), c("C", "R"))
  check_sparse(reference, query)
  reference <- x2m(reference)
  query <- x2m(query)
  if (obs == "R") {
    reference <- Matrix::t(reference)
    query <- Matrix::t(query)
  }

  if (is.list(init) && is_rpforest(init)) {
    tsmessage("Reading metric data from forest")
    actual_metric <- init$actual_metric
    metric <- init$original_metric
    use_alt_metric <- init$use_alt_metric
    if (use_alt_metric && metric != actual_metric) {
      tsmessage("Using alt metric '", actual_metric, "' for '", metric, "'")
    } else {
      tsmessage("Using metric '", metric, "'")
    }
  } else {
    actual_metric <-
      get_actual_metric(use_alt_metric, metric, reference, verbose)
  }

  # reference and query must be column-oriented at this point
  if (is.null(init)) {
    if (is.null(k)) {
      if (is.list(reference_graph)) {
        k <- get_reference_graph_k(reference_graph)
        tsmessage("Using k = ", k, " from graph")
      } else {
        stop("Must provide k")
      }
    }
    check_k(k, ncol(reference))
    tsmessage("Initializing from random neighbors")
    init <- random_knn_impl(
      query = query,
      reference = reference,
      k = k,
      actual_metric = actual_metric,
      order_by_distance = FALSE,
      n_threads = n_threads,
      verbose = verbose
    )
    # FIXME: can we just do the unzeroing inside the init?
    init$idx <- init$idx + 1
  } else if (is.list(init) && is_rpforest(init)) {
    init <- rpf_knn_query(
      query = query,
      reference = reference,
      forest = init,
      k = k,
      cache = TRUE,
      n_threads = n_threads,
      verbose = verbose,
      obs = "C"
    )
    if (use_alt_metric) {
      init$dist <-
        apply_alt_metric_uncorrection(metric, init$dist, is_sparse(reference))
    }
  } else if ((is.list(init) && !is.null(init$idx)) || is.matrix(init)) {
    tsmessage("Initializing from user-supplied graph")
    if (is.matrix(init)) {
      init <- list(idx = init)
    }

    # user-supplied distances may need to be transformed to the actual metric
    if (use_alt_metric && !is.null(init$dist)) {
      init$dist <-
        apply_alt_metric_uncorrection(metric, init$dist, is_sparse(reference))
    }

    if (is.null(k)) {
      k <- ncol(init$idx)
      tsmessage("Using k = ", k, " from initial graph")
    }
  } else {
    stop("Unsupported type of 'init'")
  }

  init <-
    prepare_init_graph(
      nn = init,
      k = k,
      query = query,
      data = reference,
      metric = actual_metric,
      n_threads = n_threads,
      verbose = verbose
    )

  if (is.list(reference_graph) && any(reference_graph$idx == 0)) {
    tsmessage("Warning: reference knn graph contains missing data")
  }

  stopifnot(
    !is.null(query),
    (
      methods::is(query, "matrix") ||
        methods::is(query, "sparseMatrix")
    )
  )
  stopifnot(
    !is.null(init$idx),
    methods::is(init$idx, "matrix"),
    ncol(init$idx) == k,
    nrow(init$idx) == ncol(query)
  )
  stopifnot(
    !is.null(init$dist),
    methods::is(init$dist, "matrix"),
    ncol(init$dist) == k,
    nrow(init$dist) == ncol(query)
  )
  if (is.list(reference_graph)) {
    reference_dist <- reference_graph$dist
    reference_idx <- reference_graph$idx
    stopifnot(!is.null(reference), (
      methods::is(reference, "matrix") ||
        methods::is(reference, "sparseMatrix")
    ))
    stopifnot(
      !is.null(reference_idx),
      methods::is(reference_idx, "matrix"),
      nrow(reference_idx) == ncol(reference)
    )
    stopifnot(
      !is.null(reference_dist),
      methods::is(reference_dist, "matrix"),
      nrow(reference_dist) == ncol(reference)
    )
    reference_graph_list <- graph_to_list(reference_graph)
  } else {
    stopifnot(methods::is(reference_graph, "sparseMatrix"))
    reference_graph_list <- tcsparse_to_list(reference_graph)
  }

  tsmessage(
    thread_msg(
      "Searching nearest neighbor graph with epsilon = ",
      epsilon,
      " and max_search_fraction = ",
      max_search_fraction,
      n_threads = n_threads
    )
  )

  args <- list(
    reference_graph_list = reference_graph_list,
    nn_idx = init$idx,
    nn_dist = init$dist,
    metric = actual_metric,
    epsilon = epsilon,
    max_search_fraction = max_search_fraction,
    n_threads = n_threads,
    verbose = verbose
  )
  if (is_sparse(reference)) {
    res <- do.call(
      rnn_sparse_query,
      c(
        list(
          ref_ind = reference@i,
          ref_ptr = reference@p,
          ref_data = reference@x,
          query_ind = query@i,
          query_ptr = query@p,
          query_data = query@x,
          ndim = nrow(reference)
        ),
        args
      )
    )
  } else if (is.logical(reference)) {
    res <- do.call(
      rnn_logical_query,
      c(
        list(reference = reference, query = query),
        args
      )
    )
  } else {
    res <- do.call(
      rnn_query,
      c(
        list(reference = reference, query = query),
        args
      )
    )
  }
  if (use_alt_metric) {
    res$dist <-
      apply_alt_metric_correction(metric, res$dist, is_sparse(reference))
  }
  tsmessage("Finished")
  res
}

# Search Graph Preparation ------------------------------------------------

#' Convert a nearest neighbor graph into a search graph
#'
#' Create a graph using existing nearest neighbor data to balance search
#' speed and accuracy using the occlusion pruning and truncation strategies
#' of Harwood and Drummond (2016). The resulting search graph should be more
#' efficient for querying new data than the original nearest neighbor graph.
#'
#' An approximate nearest neighbor graph is not very useful for querying via
#' [graph_knn_query()], especially if the query data is initialized randomly:
#' some items in the data set may not be in the nearest neighbor list of any
#' other item and can therefore never be returned as a neighbor, no matter how
#' close they are to the query. Even those which do appear in at least one
#' neighbor list may not be reachable by expanding an arbitrary starting list if
#' the neighbor graph contains disconnected components.
#'
#' Converting the directed graph represented by the neighbor graph to an
#' undirected graph by adding an edge from item `j` to `i` if
#' an edge exists from `i` to `j` (i.e. creating the mutual neighbor
#' graph) solves the problems above, but can result in inefficient searches.
#' Although the out-degree of each item is restricted to the number of neighbors
#' the in-degree has no such restrictions: a given item could be very "popular"
#' and in a large number of neighbors lists. Therefore mutualizing the neighbor
#' graph can result in some items with a large number of neighbors to search.
#' These usually have very similar neighborhoods so there is nothing to be
#' gained from searching all of them.
#'
#' To balance accuracy and search time, the following procedure is carried out:
#'
#' 1. The graph is "diversified" by occlusion pruning.
#' 1. The reverse graph is formed by reversing the direction of all edges in
#' the pruned graph.
#' 1. The reverse graph is diversified by occlusion pruning.
#' 1. The pruned forward and pruned reverse graph are merged.
#' 1. The outdegree of each node in the merged graph is truncated.
#' 1. The truncated merged graph is returned as the prepared search graph.
#'
#' Explicit zero distances in the `graph` will be converted to a small positive
#' number to avoid being dropped in the sparse representation. The one exception
#' is the "self" distance, i.e. any edge in the `graph` which links a node to
#' itself (the diagonal of the sparse distance matrix). These trivial edges
#' aren't useful for search purposes and are always dropped.
#'
#' @param data Matrix of `n` items, with observations in the rows and features
#'   in the columns. Optionally, input can be passed with observations in the
#'   columns, by setting `obs = "C"`, which should be more efficient. Possible
#'   formats are [base::data.frame()], [base::matrix()] or
#'   [Matrix::sparseMatrix()]. Sparse matrices should be in `dgCMatrix` format.
#'   Dataframes will be converted to `numerical` matrix format internally, so if
#'   your data columns are `logical` and intended to be used with the
#'   specialized binary `metric`s, you should convert it to a logical matrix
#'   first (otherwise you will get the slower dense numerical version).
#' @param graph neighbor graph for `data`, a list containing:
#'   * `idx` an `n` by `k` matrix containing the nearest neighbor indices of
#'   the data in `data`.
#'   * `dist` an `n` by `k` matrix containing the nearest neighbor distances.
#' @param metric Type of distance calculation to use. One of:
#'   - `"braycurtis"`
#'   - `"canberra"`
#'   - `"chebyshev"`
#'   - `"correlation"` (1 minus the Pearson correlation)
#'   - `"cosine"`
#'   - `"dice"`
#'   - `"euclidean"`
#'   - `"hamming"`
#'   - `"hellinger"`
#'   - `"jaccard"`
#'   - `"jensenshannon"`
#'   - `"kulsinski"`
#'   - `"sqeuclidean"` (squared Euclidean)
#'   - `"manhattan"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"spearmanr"` (1 minus the Spearman rank correlation)
#'   - `"symmetrickl"` (symmetric Kullback-Leibler divergence)
#'   - `"tsss"` (Triangle Area Similarity-Sector Area Similarity or TS-SS
#'   metric)
#'   - `"yule"`
#'
#'   For non-sparse data, the following variants are available with
#'   preprocessing: this trades memory for a potential speed up during the
#'   distance calculation. Some minor numerical differences should be expected
#'   compared to the non-preprocessed versions:
#'   - `"cosine-preprocess"`: `cosine` with preprocessing.
#'   - `"correlation-preprocess"`: `correlation` with preprocessing.
#'
#'   For non-sparse binary data passed as a `logical` matrix, the following
#'   metrics have specialized variants which should be substantially faster than
#'   the non-binary variants (in other cases the logical data will be treated as
#'   a dense numeric vector of 0s and 1s):
#'   - `"dice"`
#'   - `"hamming"`
#'   - `"jaccard"`
#'   - `"kulsinski"`
#'   - `"matching"`
#'   - `"rogerstanimoto"`
#'   - `"russellrao"`
#'   - `"sokalmichener"`
#'   - `"sokalsneath"`
#'   - `"yule"`
#' @param use_alt_metric If `TRUE`, use faster metrics that maintain the
#'   ordering of distances internally (e.g. squared Euclidean distances if using
#'   `metric = "euclidean"`), then apply a correction at the end. Probably
#'   the only reason to set this to `FALSE` is if you suspect that some
#'   sort of numeric issue is occurring with your data in the alternative code
#'   path.
#' @param diversify_prob the degree of diversification of the search graph
#'   by removing unnecessary edges through occlusion pruning. This should take a
#'   value between `0` (no diversification) and `1` (remove as many edges as
#'   possible) and is treated as the probability of a neighbor being removed if
#'   it is found to be an "occlusion". If item `p` and `q`, two members of the
#'   neighbor list of item `i`, are closer to each other than they are to `i`,
#'   then the nearer neighbor `p` is said to "occlude" `q`. It is likely that
#'   `q` will be in the neighbor list of `p` so there is no need to retain it in
#'   the neighbor list of `i`. You may also set this to `NULL` to skip any
#'   occlusion pruning. Note that occlusion pruning is carried out twice, once
#'   to the forward neighbors, and once to the reverse neighbors. Reducing this
#'   value will result in a more dense graph. This is similar to increasing the
#'   "alpha" parameter used by in the DiskAnn pruning method of Subramanya and
#'   co-workers (2014).
#' @param pruning_degree_multiplier How strongly to truncate the final neighbor
#'   list for each item. The neighbor list of each item will be truncated to
#'   retain only the closest `d` neighbors, where
#'   `d = k * pruning_degree_multiplier`, and `k` is the
#'   original number of neighbors per item in `graph`. Roughly, values
#'   larger than `1` will keep all the nearest neighbors of an item, plus
#'   the given fraction of reverse neighbors (if they exist). For example,
#'   setting this to `1.5` will keep all the forward neighbors and then
#'   half as many of the reverse neighbors, although exactly which neighbors are
#'   retained is also dependent on any occlusion pruning that occurs. Set this
#'   to `NULL` to skip this step.
#' @param prune_reverse If `TRUE`, prune the reverse neighbors of each item
#'   before the reverse graph diversification step using
#'   `pruning_degree_multiplier`. Because the number of reverse neighbors can be
#'   much larger than the number of forward neighbors, this can help to avoid
#'   excessive computation during the diversification step, with little overall
#'   effect on the final search graph. Default is `FALSE`.
#' @param n_threads Number of threads to use.
#' @param verbose If `TRUE`, log information to the console.
#' @param obs set to `"C"` to indicate that the input `data` orientation stores
#'   each observation as a column. The default `"R"` means that observations are
#'   stored in each row. Storing the data by row is usually more convenient, but
#'   internally your data will be converted to column storage. Passing it
#'   already column-oriented will save some memory and (a small amount of) CPU
#'   usage.
#' @return a search graph for `data` based on `graph`, represented as a sparse
#'   matrix, suitable for use with [graph_knn_query()].
#' @examples
#' # 100 reference iris items
#' iris_ref <- iris[iris$Species %in% c("setosa", "versicolor"), ]
#'
#' # 50 query items
#' iris_query <- iris[iris$Species == "versicolor", ]
#'
#' # First, find the approximate 4-nearest neighbor graph for the references:
#' ref_ann_graph <- nnd_knn(iris_ref, k = 4)
#'
#' # Create a graph for querying with
#' ref_search_graph <- prepare_search_graph(iris_ref, ref_ann_graph)
#'
#' # Using the search graph rather than the ref_ann_graph directly may give
#' # more accurate or faster results
#' iris_query_nn <- graph_knn_query(
#'   query = iris_query, reference = iris_ref,
#'   reference_graph = ref_search_graph, k = 4, metric = "euclidean",
#'   verbose = TRUE
#' )
#' @references
#' Harwood, B., & Drummond, T. (2016).
#' Fanng: Fast approximate nearest neighbour graphs.
#' In *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*
#' (pp. 5713-5722).
#'
#' Jayaram Subramanya, S., Devvrit, F., Simhadri, H. V., Krishnawamy, R., & Kadekodi, R. (2019).
#' Diskann: Fast accurate billion-point nearest neighbor search on a single node.
#' *Advances in Neural Information Processing Systems*, *32*.
#' @seealso [graph_knn_query()]
#' @export
prepare_search_graph <- function(data,
                                 graph,
                                 metric = "euclidean",
                                 use_alt_metric = TRUE,
                                 diversify_prob = 1.0,
                                 pruning_degree_multiplier = 1.5,
                                 prune_reverse = FALSE,
                                 n_threads = 0,
                                 verbose = FALSE,
                                 obs = "R") {
  obs <- match.arg(toupper(obs), c("C", "R"))

  if (!is.null(pruning_degree_multiplier)) {
    stopifnot(pruning_degree_multiplier > 0)
  }
  if (!is.null(diversify_prob)) {
    stopifnot(
      diversify_prob <= 1,
      diversify_prob >= 0
    )
  }

  actual_metric <-
    get_actual_metric(use_alt_metric, metric, data, verbose)

  if (is_sparse(graph)) {
    sp <- Matrix::t(graph)
    n_nbrs <- mean(diff(sp@p))
  } else {
    n_nbrs <- check_graph(graph)$k
    tsmessage("Converting graph to sparse format")
    sp <- graph_to_csparse(graph)
  }

  sp <- preserve_zeros(sp)

  if (use_alt_metric) {
    sp@x <-
      apply_alt_metric_uncorrection(metric, sp@x, is_sparse(data))
  }

  data <- x2m(data)
  if (obs == "R") {
    data <- Matrix::t(data)
  }
  if (!is.null(diversify_prob) && diversify_prob > 0) {
    tsmessage("Diversifying forward graph")
    fdiv <- diversify(
      data,
      sp,
      metric = actual_metric,
      prune_probability = diversify_prob,
      verbose = verbose,
      n_threads = n_threads
    )
  } else {
    fdiv <- sp
    tsmessage(
      "Forward graph has # edges = ",
      Matrix::nnzero(fdiv),
      " (",
      formatC(100 * nn_sparsity_sp(fdiv)),
      "% sparse)"
    )
  }
  rsp <- reverse_knn_sp(fdiv)
  if (!is.null(diversify_prob) && diversify_prob > 0) {
    if (prune_reverse && !is.null(pruning_degree_multiplier) &&
        !is.infinite(pruning_degree_multiplier)) {
      max_degree <- max(round(n_nbrs * pruning_degree_multiplier), 1)
      tsmessage("Degree pruning reverse graph to max degree: ", max_degree)
      rsp <-
        degree_prune(
          rsp,
          max_degree = max_degree,
          verbose = verbose,
          n_threads = n_threads
        )
    }

    tsmessage("Diversifying reverse graph")
    rdiv <- diversify(
      data,
      rsp,
      metric = actual_metric,
      prune_probability = diversify_prob,
      verbose = verbose,
      n_threads = n_threads
    )
  } else {
    rdiv <- rsp
    tsmessage(
      "Reverse graph has # edges = ",
      Matrix::nnzero(rdiv),
      " (",
      formatC(100 * nn_sparsity_sp(rdiv)),
      "% sparse)"
    )
  }
  tsmessage("Merging diversified forward and reverse graph")
  merged <- merge_graphs_sp(fdiv, rdiv)

  if (!is.null(pruning_degree_multiplier) &&
    !is.infinite(pruning_degree_multiplier)) {
    max_degree <- max(round(n_nbrs * pruning_degree_multiplier), 1)
    tsmessage("Degree pruning merged graph to max degree: ", max_degree)
    res <-
      degree_prune(
        merged,
        max_degree = max_degree,
        verbose = verbose,
        n_threads = n_threads
      )
  } else {
    res <- merged
    tsmessage(
      "Merged graph has # edges = ",
      Matrix::nnzero(res),
      " (",
      formatC(100 * nn_sparsity_sp(res)),
      "% sparse)"
    )
  }

  if (use_alt_metric) {
    res@x <-
      apply_alt_metric_correction(metric, res@x, is_sparse(data))
  }

  tsmessage("Finished preparing search graph")
  # return this pre-transposed so we don't have to do it in graph_knn_query
  Matrix::t(res)
}

# Harwood, B., & Drummond, T. (2016).
# Fanng: Fast approximate nearest neighbour graphs.
# In *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*
# (pp. 5713-5722).
# "Occlusion pruning"
#
# prune_probability behaves a bit like (but in the opposite direction of) alpha
# Jayaram Subramanya, S., Devvrit, F., Simhadri, H. V., Krishnawamy, R., & Kadekodi, R. (2019).
# Diskann: Fast accurate billion-point nearest neighbor search on a single node.
# Advances in Neural Information Processing Systems, 32.
# the pynndescent implementation at cmuparlay/pbbsbench uses alpha instead of
# a prune_probability (see also https://arxiv.org/abs/2305.04359)
diversify <- function(data,
                      graph,
                      metric = "euclidean",
                      prune_probability = 1.0,
                      n_threads = 0,
                      verbose = FALSE) {
  nnz_before <- Matrix::nnzero(graph)
  sp_before <- nn_sparsity_sp(graph)
  stopifnot(methods::is(graph, "sparseMatrix"))
  gl <- csparse_to_list(graph)

  tsmessage("Occlusion pruning with probability: ", prune_probability)
  args <- list(
    graph_list = gl,
    metric = metric,
    prune_probability = prune_probability,
    n_threads = n_threads,
    verbose = verbose
  )
  if (is_sparse(data)) {
    gl_div <- do.call(rnn_sparse_diversify, c(args, list(
      ind = data@i,
      ptr = data@p,
      data = data@x,
      ndim = nrow(data)
    )))
  } else if (is.logical(data)) {
    gl_div <- do.call(rnn_logical_diversify, c(args, list(
      data = data
    )))
  } else {
    gl_div <- do.call(rnn_diversify, c(args, list(
      data = data
    )))
  }
  res <- list_to_sparse(gl_div)
  nnz_after <- Matrix::nnzero(res)
  tsmessage(
    "Diversifying reduced # edges from ",
    nnz_before,
    " to ",
    nnz_after,
    " (",
    formatC(100 * sp_before),
    "% to ",
    formatC(100 * nn_sparsity_sp(res)),
    "% sparse)"
  )
  res
}

# FANNG: Fast Approximate Nearest Neighbour Graphs
# "truncating"
# Harwood, B., & Drummond, T. (2016).
# Fanng: Fast approximate nearest neighbour graphs.
# In *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*
# (pp. 5713-5722).
degree_prune <-
  function(graph,
           max_degree = 20,
           n_threads = 0,
           verbose = FALSE) {
    stopifnot(methods::is(graph, "sparseMatrix"))
    nnz_before <- Matrix::nnzero(graph)
    sp_before <- nn_sparsity_sp(graph)
    gl <- csparse_to_list(graph)

    gl_div <-
      rnn_degree_prune(gl, max_degree, n_threads = n_threads)
    res <- list_to_sparse(gl_div)
    nnz_after <- Matrix::nnzero(res)
    tsmessage(
      "Degree pruning to max ",
      max_degree,
      " reduced # edges from ",
      nnz_before,
      " to ",
      nnz_after,
      " (",
      formatC(100 * sp_before),
      "% to ",
      formatC(100 * nn_sparsity_sp(res)),
      "% sparse)"
    )
    res
  }


merge_graphs_sp <- function(g1, g2) {
  gl1 <- csparse_to_list(g1)
  gl2 <- csparse_to_list(g2)

  gl_merge <- rnn_merge_graph_lists(gl1, gl2)
  list_to_sparse(gl_merge)
}

nn_sparsity <- function(graph) {
  sum(graph$idx == 0) / prod(dim(graph$idx))
}

nn_sparsity_sp <- function(graph) {
  Matrix::nnzero(graph) / prod(graph@Dim)
}

reverse_knn_sp <- function(graph) {
  stopifnot(methods::is(graph, "sparseMatrix"))
  Matrix::t(Matrix::drop0(graph))
}

# Merge -------------------------------------------------------------------

#' Merge multiple approximate nearest neighbors graphs
#'
#' `merge_knn` takes a list of nearest neighbor graphs and merges them into a
#' single graph, with the same number of neighbors as the first graph. This is
#' useful to combine the results of multiple different nearest neighbor
#' searches: the output will be at least as accurate as the most accurate of the
#' two input graphs, and ideally will be more accurate than either.
#'
#' @param graphs A list of nearest neighbor graphs to merge. Each item in the
#'   list should consist of a sub-list containing:
#'   * `idx` an n by k matrix containing the k nearest neighbor indices.
#'   * `dist` an n by k matrix containing k nearest neighbor distances.
#'   The number of neighbors can differ between graphs, but the merged result
#'   will have the same number of neighbors as the first graph in the list.
#' @param is_query If `TRUE` then the graphs are treated as the result of a knn
#'   query, not a knn building process. Or: is the graph bipartite? This should
#'   be set to `TRUE` if `nn_graphs` are the results of using e.g.
#'   [graph_knn_query()] or [random_knn_query()], and set to `FALSE` if these
#'   are the results of [nnd_knn()] or [random_knn()]. The difference is that if
#'   `is_query = FALSE`, if an index `p` is found in `nn_graph1[i, ]`, i.e. `p`
#'   is a neighbor of `i` with distance `d`, then it is assumed that `i` is a
#'   neighbor of `p` with the same distance. If `is_query = TRUE`, then `i` and
#'   `p` are indexes into two different datasets and the symmetry does not hold.
#'   If you aren't sure what case applies to you, it's safe (but potentially
#'   inefficient) to set `is_query = TRUE`.
#' @param n_threads Number of threads to use.
#' @param verbose If `TRUE`, log information to the console.
#' @return a list containing:
#'   * `idx` an n by k matrix containing the merged nearest neighbor indices.
#'   * `dist` an n by k matrix containing the merged nearest neighbor distances.
#'
#'   The size of `k` in the output graph is the same as that of the first
#'   item in `nn_graphs`.
#' @examples
#' set.seed(1337)
#' # Nearest neighbor descent with 15 neighbors for iris three times,
#' # starting from a different random initialization each time
#' iris_rnn1 <- nnd_knn(iris, k = 15, n_iters = 1)
#' iris_rnn2 <- nnd_knn(iris, k = 15, n_iters = 1)
#' iris_rnn3 <- nnd_knn(iris, k = 15, n_iters = 1)
#'
#' # Merged results should be an improvement over individual results
#' iris_mnn <- merge_knn(list(iris_rnn1, iris_rnn2, iris_rnn3))
#' sum(iris_mnn$dist) < sum(iris_rnn1$dist)
#' sum(iris_mnn$dist) < sum(iris_rnn2$dist)
#' sum(iris_mnn$dist) < sum(iris_rnn3$dist)
#' @export
merge_knn <- function(graphs,
                      is_query = FALSE,
                      n_threads = 0,
                      verbose = FALSE) {
  if (length(graphs) == 0) {
    return(list())
  }
  validate_are_mergeablel(graphs)

  rnn_merge_nn_all(graphs,
    is_query,
    n_threads = n_threads,
    verbose = verbose
  )
}

# Overlap -----------------------------------------------------------------

#' Overlap between the indices of two nearest neighbor graphs
#'
#' Calculates the mean average number of neighbors in common between the two
#' graphs. The per-item overlap can also be returned. This function can be
#' useful as a measure of accuracy of approximation algorithms, if the
#' exact nearest neighbors are known, or as a measure of diversity of two
#' different approximate graphs.
#'
#' The graph format is the same as that returned by e.g. [nnd_knn()] and should
#' be of dimensions n by k, where n is the number of points and k is the number
#' of neighbors. If you pass a neighbor graph directly, the index matrix will be
#' extracted if present. If the two graphs have different numbers of neighbors,
#' then the smaller number of neighbors is used.
#'
#' @param idx1 Indices of a nearest neighbor graph, i.e. a matrix of nearest
#'   neighbor indices. Can also be a list containing an `idx` element.
#' @param idx2 Indices of a nearest neighbor graph, i.e. a matrix of nearest
#'   neighbor indices. Can also be a list containing an `idx` element. This is
#'   considered to be the ground truth.
#' @param k Number of neighbors to consider. If `NULL`, then the minimum of the
#'   number of neighbors in `idx1` and `idx2` is used.
#' @param ret_vec If `TRUE`, also return a vector containing the per-item overlap.
#' @return The mean overlap between `idx1` and `idx2`. If `ret_vec = TRUE`,
#'  then a list containing the mean overlap and the overlap of each item in
#'  is returned with names `mean` and `overlaps`, respectively.
#' @examples
#' set.seed(1337)
#' # Generate two random neighbor graphs for iris
#' iris_rnn1 <- random_knn(iris, k = 15)
#' iris_rnn2 <- random_knn(iris, k = 15)
#'
#' # Overlap between the two graphs
#' mean_overlap <- neighbor_overlap(iris_rnn1, iris_rnn2)
#'
#' # Also get a vector of per-item overlap
#' overlap_res <- neighbor_overlap(iris_rnn1, iris_rnn2, ret_vec = TRUE)
#' summary(overlap_res$overlaps)
#' @export
neighbor_overlap <-
  function(idx1,
           idx2,
           k = NULL,
           ret_vec = FALSE) {
    vec <- neighbor_overlapv(idx1, idx2, k)
    mean_overlap <- mean(vec)
    if (ret_vec) {
      res <- list(
        mean = mean_overlap,
        overlaps = vec
      )
    } else {
      res <- mean_overlap
    }
    res
  }

neighbor_overlapv <-
  function(idx,
           ref_idx,
           k = NULL) {
    if (is.list(idx)) {
      idx <- idx$idx
    }
    if (is.list(ref_idx)) {
      ref_idx <- ref_idx$idx
    }

    if (is.null(k)) {
      k <- min(ncol(idx), ncol(ref_idx))
    }

    if (ncol(ref_idx) < k) {
      stop("Not enough columns in ref_idx for k = ", k)
    }

    n <- nrow(idx)
    if (nrow(ref_idx) != n) {
      stop("Not enough rows in ref_idx")
    }

    nbr_start <- 1
    nbr_end <- k

    ref_start <- nbr_start
    ref_end <- nbr_end

    nbr_range <- nbr_start:nbr_end
    ref_range <- ref_start:ref_end

    total_intersect <- rep(0, times = n)
    for (i in 1:n) {
      total_intersect[i] <-
        length(intersect(idx[i, nbr_range], ref_idx[i, ref_range]))
    }

    total_intersect / k
  }
jlmelville/rnndescent documentation built on April 19, 2024, 8:26 p.m.