lvish | R Documentation |
Carry out dimensionality reduction of a dataset using a method similar to LargeVis (Tang et al., 2016).
lvish(
X,
perplexity = 50,
n_neighbors = perplexity * 3,
n_components = 2,
metric = "euclidean",
n_epochs = -1,
learning_rate = 1,
scale = "maxabs",
init = "lvrandom",
init_sdev = NULL,
repulsion_strength = 7,
negative_sample_rate = 5,
nn_method = NULL,
n_trees = 50,
search_k = 2 * n_neighbors * n_trees,
n_threads = NULL,
n_sgd_threads = 0,
grain_size = 1,
kernel = "gauss",
pca = NULL,
pca_center = TRUE,
pcg_rand = TRUE,
fast_sgd = FALSE,
ret_nn = FALSE,
ret_extra = c(),
tmpdir = tempdir(),
verbose = getOption("verbose", TRUE),
batch = FALSE,
opt_args = NULL,
epoch_callback = NULL,
pca_method = NULL,
binary_edge_weights = FALSE,
nn_args = list(),
rng_type = NULL
)
X |
Input data. Can be a |
perplexity |
Controls the size of the local neighborhood used for
manifold approximation. This is the analogous to |
n_neighbors |
The number of neighbors to use when calculating the
|
n_components |
The dimension of the space to embed into. This defaults
to |
metric |
Type of distance metric to use to find nearest neighbors. For
For
If rnndescent is
installed and
For more details see the package documentation of If Each metric calculation results in a separate fuzzy simplicial set, which are intersected together to produce the final set. Metric names can be repeated. Because non-numeric columns are removed from the data frame, it is safer to use column names than integer ids. Factor columns can also be used by specifying the metric name
For a given data block, you may override the |
n_epochs |
Number of epochs to use during the optimization of the
embedded coordinates. The default is calculate the number of epochs
dynamically based on dataset size, to give the same number of edge samples
as the LargeVis defaults. This is usually substantially larger than the
UMAP defaults. If |
learning_rate |
Initial learning rate used in optimization of the coordinates. |
scale |
Scaling to apply to
For lvish, the default is |
init |
Type of initialization for the coordinates. Options are:
For spectral initializations, ( |
init_sdev |
If non- |
repulsion_strength |
Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples. |
negative_sample_rate |
The number of negative edge/1-simplex samples to use per positive edge/1-simplex sample in optimizing the low dimensional embedding. |
nn_method |
Method for finding nearest neighbors. Options are:
By default, if
Multiple nearest neighbor data (e.g. from two different precomputed
metrics) can be passed by passing a list containing the nearest neighbor
data lists as items.
The |
n_trees |
Number of trees to build when constructing the nearest
neighbor index. The more trees specified, the larger the index, but the
better the results. With |
search_k |
Number of nodes to search during the neighbor retrieval. The
larger k, the more the accurate results, but the longer the search takes.
With |
n_threads |
Number of threads to use (except during stochastic gradient
descent). Default is half the number of concurrent threads supported by the
system. For nearest neighbor search, only applies if
|
n_sgd_threads |
Number of threads to use during stochastic gradient
descent. If set to > 1, then be aware that if |
grain_size |
The minimum amount of work to do on each thread. If this
value is set high enough, then less than |
kernel |
Type of kernel function to create input probabilities. Can be
one of |
pca |
If set to a positive integer value, reduce data to this number of
columns using PCA. Doesn't applied if the distance |
pca_center |
If |
pcg_rand |
If |
fast_sgd |
If |
ret_nn |
If |
ret_extra |
A vector indicating what extra data to return. May contain any combination of the following strings:
|
tmpdir |
Temporary directory to store nearest neighbor indexes during
nearest neighbor search. Default is |
verbose |
If |
batch |
If |
opt_args |
A list of optimizer parameters, used when
|
epoch_callback |
A function which will be invoked at the end of every
epoch. Its signature should be:
|
pca_method |
Method to carry out any PCA dimensionality reduction when
the
|
binary_edge_weights |
If |
nn_args |
A list containing additional arguments to pass to the nearest
neighbor method. For
For
|
rng_type |
The type of random number generator to use during optimization. One of:
For backwards compatibility, by default this is unset and the choice of
|
lvish
differs from the official LargeVis implementation in the
following:
Only the nearest-neighbor index search phase is multi-threaded.
Matrix input data is not normalized.
The n_trees
parameter cannot be dynamically chosen based on
data set size.
Nearest neighbor results are not refined via the
neighbor-of-my-neighbor method. The search_k
parameter is twice
as large than default to compensate.
Gradient values are clipped to 4.0
rather than 5.0
.
Negative edges are generated by uniform sampling of vertexes rather than their degree ^ 0.75.
The default number of samples is much reduced. The default number of
epochs, n_epochs
, is set to 5000
, much larger than for
umap
, but may need to be increased further depending on your
dataset. Using init = "spectral"
can help.
A matrix of optimized coordinates, or:
if ret_nn = TRUE
(or ret_extra
contains "nn"
),
returns the nearest neighbor data as a list called nn
. This
contains one list for each metric
calculated, itself containing a
matrix idx
with the integer ids of the neighbors; and a matrix
dist
with the distances. The nn
list (or a sub-list) can be
used as input to the nn_method
parameter.
if ret_extra
contains "P"
, returns the high
dimensional probability matrix as a sparse matrix called P
, of
type dgCMatrix-class.
if ret_extra
contains "sigma"
, returns a vector of
the high dimensional gaussian bandwidths for each point, and
"dint"
a vector of estimates of the intrinsic dimensionality at
each point, based on the method given by Lee and co-workers (2015).
The returned list contains the combined data from any combination of
specifying ret_nn
and ret_extra
.
Belkin, M., & Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems (pp. 585-591). http://papers.nips.cc/paper/1961-laplacian-eigenmaps-and-spectral-techniques-for-embedding-and-clustering.pdf
Böhm, J. N., Berens, P., & Kobak, D. (2020). A unifying perspective on neighbor embeddings along the attraction-repulsion spectrum. arXiv preprint arXiv:2007.08902. https://arxiv.org/abs/2007.08902
Damrich, S., & Hamprecht, F. A. (2021). On UMAP's true loss function. Advances in Neural Information Processing Systems, 34. https://proceedings.neurips.cc/paper/2021/hash/2de5d16682c3c35007e4e92982f1a2ba-Abstract.html
Dong, W., Moses, C., & Li, K. (2011, March). Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World Wide Web (pp. 577-586). ACM. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1145/1963405.1963487")}.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. https://arxiv.org/abs/1412.6980
Lee, J. A., Peluffo-Ordóñez, D. H., & Verleysen, M. (2015). Multi-scale similarities in stochastic neighbour embedding: Reducing dimensionality while preserving both local and global structure. Neurocomputing, 169, 246-261.
Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4), 824-836.
McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction arXiv preprint arXiv:1802.03426. https://arxiv.org/abs/1802.03426
O'Neill, M. E. (2014). PCG: A family of simple fast space-efficient statistically good algorithms for random number generation (Report No. HMC-CS-2014-0905). Harvey Mudd College.
Tang, J., Liu, J., Zhang, M., & Mei, Q. (2016, April). Visualizing large-scale and high-dimensional data. In Proceedings of the 25th International Conference on World Wide Web (pp. 287-297). International World Wide Web Conferences Steering Committee. https://arxiv.org/abs/1602.00370
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9 (2579-2605). https://www.jmlr.org/papers/v9/vandermaaten08a.html
Wang, Y., Huang, H., Rudin, C., & Shaposhnik, Y. (2021). Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, TriMap, and PaCMAP for Data Visualization. Journal of Machine Learning Research, 22(201), 1-73. https://www.jmlr.org/papers/v22/20-1061.html
# Default number of epochs is much larger than for UMAP, assumes random
# initialization. Use perplexity rather than n_neighbors to control the size
# of the local neighborhood 20 epochs may be too small for a random
# initialization
iris_lvish <- lvish(iris,
perplexity = 50, learning_rate = 0.5,
init = "random", n_epochs = 20
)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.