similarity_graph | R Documentation |
Create a graph (as a sparse symmetric weighted adjacency matrix) representing the similarities between items in a data set. No dimensionality reduction is carried out. By default, the similarities are calculated using the merged fuzzy simplicial set approach in the Uniform Manifold Approximation and Projection (UMAP) method (McInnes et al., 2018), but the approach from LargeVis (Tang et al., 2016) can also be used.
similarity_graph(
X = NULL,
n_neighbors = NULL,
metric = "euclidean",
scale = NULL,
set_op_mix_ratio = 1,
local_connectivity = 1,
nn_method = NULL,
n_trees = 50,
search_k = 2 * n_neighbors * n_trees,
perplexity = 50,
method = "umap",
y = NULL,
target_n_neighbors = n_neighbors,
target_metric = "euclidean",
target_weight = 0.5,
pca = NULL,
pca_center = TRUE,
ret_extra = c(),
n_threads = NULL,
grain_size = 1,
kernel = "gauss",
tmpdir = tempdir(),
verbose = getOption("verbose", TRUE),
pca_method = NULL,
binary_edge_weights = FALSE,
nn_args = list()
)
X |
Input data. Can be a |
n_neighbors |
The size of local neighborhood (in terms of number of
neighboring sample points) used for manifold approximation. Larger values
result in more global views of the manifold, while smaller values result in
more local data being preserved. In general values should be in the range
|
metric |
Type of distance metric to use to find nearest neighbors. For
For
If rnndescent is
installed and
For more details see the package documentation of If Each metric calculation results in a separate fuzzy simplicial set, which are intersected together to produce the final set. Metric names can be repeated. Because non-numeric columns are removed from the data frame, it is safer to use column names than integer ids. Factor columns can also be used by specifying the metric name
For a given data block, you may override the |
scale |
Scaling to apply to
For |
set_op_mix_ratio |
Interpolate between (fuzzy) union and intersection as
the set operation used to combine local fuzzy simplicial sets to obtain a
global fuzzy simplicial sets. Both fuzzy set operations use the product
t-norm. The value of this parameter should be between |
local_connectivity |
The local connectivity required – i.e. the number
of nearest neighbors that should be assumed to be connected at a local
level. The higher this value the more connected the manifold becomes
locally. In practice this should be not more than the local intrinsic
dimension of the manifold. Ignored if |
nn_method |
Method for finding nearest neighbors. Options are:
By default, if
or a sparse distance matrix of type |
n_trees |
Number of trees to build when constructing the nearest
neighbor index. The more trees specified, the larger the index, but the
better the results. With |
search_k |
Number of nodes to search during the neighbor retrieval. The
larger k, the more the accurate results, but the longer the search takes.
With |
perplexity |
Used only if |
method |
How to generate the similarities between items. One of:
|
y |
Optional target data to add supervised or semi-supervised weighting
to the similarity graph . Can be a vector, matrix or data frame. Use the
Unlike |
target_n_neighbors |
Number of nearest neighbors to use to construct the
target simplicial set. Default value is |
target_metric |
The metric used to measure distance for |
target_weight |
Weighting factor between data topology and target
topology. A value of 0.0 weights entirely on data, a value of 1.0 weights
entirely on target. The default of 0.5 balances the weighting equally
between data and target. Only applies if |
pca |
If set to a positive integer value, reduce data to this number of
columns using PCA. Doesn't applied if the distance |
pca_center |
If |
ret_extra |
A vector indicating what extra data to return. May contain any combination of the following strings:
|
n_threads |
Number of threads to use. Default is half the number of
concurrent threads supported by the system. For nearest neighbor search,
only applies if |
grain_size |
The minimum amount of work to do on each thread. If this
value is set high enough, then less than |
kernel |
Used only if |
tmpdir |
Temporary directory to store nearest neighbor indexes during
nearest neighbor search. Default is |
verbose |
If |
pca_method |
Method to carry out any PCA dimensionality reduction when
the
|
binary_edge_weights |
If |
nn_args |
A list containing additional arguments to pass to the nearest
neighbor method. For
For
|
This is equivalent to running umap
with the
ret_extra = c("fgraph")
parameter, but without the overhead of
calculating (or returning) the optimized low-dimensional coordinates.
A sparse symmetrized matrix of the similarities between the items in
X
or if nn_method
contains pre-computed nearest neighbor
data, the items in nn_method
. Because of the symmetrization, there
may be more non-zero items in each column than the specified value of
n_neighbors
(or pre-computed neighbors in nn_method
).
If ret_extra
is specified then the return value will be a list
containing:
similarity_graph
the similarity graph as a sparse matrix
as described above.
nn
(if ret_extra
contained "nn"
) the nearest
neighbor data as a list called nn
. This contains one list for each
metric
calculated, itself containing a matrix idx
with the
integer ids of the neighbors; and a matrix dist
with the
distances. The nn
list (or a sub-list) can be used as input to the
nn_method
parameter.
sigma
(if ret_extra
contains "sigma"
),
a vector of calibrated parameters, one for each item in the input data,
reflecting the local data density for that item. The exact definition of
the values depends on the choice of the method
parameter.
rho
(if ret_extra
contains "sigma"
), a
vector containing the largest distance to the locally connected neighbors
of each item in the input data. This will exist only if
method = "umap"
.
localr
(if ret_extra
contains "localr"
) a
vector of the estimated local radii, the sum of "sigma"
and
"rho"
. This will exist only if method = "umap"
.
Dong, W., Moses, C., & Li, K. (2011, March). Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World Wide Web (pp. 577-586). ACM. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1145/1963405.1963487")}.
Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4), 824-836.
McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction arXiv preprint arXiv:1802.03426. https://arxiv.org/abs/1802.03426
Tang, J., Liu, J., Zhang, M., & Mei, Q. (2016, April). Visualizing large-scale and high-dimensional data. In Proceedings of the 25th International Conference on World Wide Web (pp. 287-297). International World Wide Web Conferences Steering Committee. https://arxiv.org/abs/1602.00370
iris30 <- iris[c(1:10, 51:60, 101:110), ]
# return a 30 x 30 sparse matrix with similarity data based on 10 nearest
# neighbors per item
iris30_sim_graph <- similarity_graph(iris30, n_neighbors = 10)
# Default is to use the UMAP method of calculating similarities, but LargeVis
# is also available: for that method, use perplexity instead of n_neighbors
# to control neighborhood size. Use ret_extra = "nn" to return nearest
# neighbor data as well as the similarity graph. Return value is a list
# containing similarity_graph' and 'nn' items.
iris30_lv_graph <- similarity_graph(iris30,
perplexity = 10,
method = "largevis", ret_extra = "nn"
)
# If you have the neighbor information you don't need the original data
iris30_lv_graph_nn <- similarity_graph(
nn_method = iris30_lv_graph$nn,
perplexity = 10, method = "largevis"
)
all(iris30_lv_graph_nn == iris30_lv_graph$similarity_graph)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.