umap_transform: Add New Points to an Existing Embedding

Description Usage Arguments Details Value Examples

View source: R/transform.R

Description

Carry out an embedding of new data using an existing embedding. Requires using the result of calling umap or tumap with ret_model = TRUE.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
umap_transform(
  X = NULL,
  model = NULL,
  nn_method = NULL,
  init_weighted = TRUE,
  search_k = NULL,
  tmpdir = tempdir(),
  n_epochs = NULL,
  n_threads = NULL,
  n_sgd_threads = 0,
  grain_size = 1,
  verbose = FALSE,
  init = "weighted",
  batch = NULL,
  learning_rate = NULL,
  opt_args = NULL,
  epoch_callback = NULL
)

Arguments

X

The new data to be transformed, either a matrix of data frame. Must have the same columns in the same order as the input data used to generate the model.

model

Data associated with an existing embedding.

nn_method

Optional pre-calculated nearest neighbor data.

The format is a list consisting of two elements:

  • "idx". A n_vertices x n_neighbors matrix where n_vertices is the number of items to be transformed. The contents of the matrix should be the integer indexes of the data used to generate the model, which are the n_neighbors-nearest neighbors of the data to be transformed.

  • "dist". A n_vertices x n_neighbors matrix containing the distances of the nearest neighbors.

Multiple nearest neighbor data (e.g. from two different pre-calculated metrics) can be passed by passing a list containing the nearest neighbor data lists as items. The X parameter is ignored when using pre-calculated nearest neighbor data.

init_weighted

If TRUE, then initialize the embedded coordinates of X using a weighted average of the coordinates of the nearest neighbors from the original embedding in model, where the weights used are the edge weights from the UMAP smoothed knn distances. Otherwise, use an un-weighted average. This parameter will be deprecated and removed at version 1.0 of this package. Use the init parameter as a replacement, replacing init_weighted = TRUE with init = "weighted" and init_weighted = FALSE with init = "average".

search_k

Number of nodes to search during the neighbor retrieval. The larger k, the more the accurate results, but the longer the search takes. Default is the value used in building the model is used.

tmpdir

Temporary directory to store nearest neighbor indexes during nearest neighbor search. Default is tempdir. The index is only written to disk if n_threads > 1; otherwise, this parameter is ignored.

n_epochs

Number of epochs to use during the optimization of the embedded coordinates. A value between 30 - 100 is a reasonable trade off between speed and thoroughness. By default, this value is set to one third the number of epochs used to build the model.

n_threads

Number of threads to use, (except during stochastic gradient descent). Default is half the number of concurrent threads supported by the system.

n_sgd_threads

Number of threads to use during stochastic gradient descent. If set to > 1, then be aware that if batch = FALSE, results will not be reproducible, even if set.seed is called with a fixed seed before running. Set to "auto" to use the same value as n_threads.

grain_size

Minimum batch size for multithreading. If the number of items to process in a thread falls below this number, then no threads will be used. Used in conjunction with n_threads and n_sgd_threads.

verbose

If TRUE, log details to the console.

init

how to initialize the transformed coordinates. One of:

  • "weighted" (The default). Use a weighted average of the coordinates of the nearest neighbors from the original embedding in model, where the weights used are the edge weights from the UMAP smoothed knn distances. Equivalent to init_weighted = TRUE.

  • "average". Use the mean average of the coordinates of the nearest neighbors from the original embedding in model. Equivalent to init_weighted = FALSE.

  • A matrix of user-specified input coordinates, which must have dimensions the same as (nrow(X), ncol(model$embedding)).

This parameter should be used in preference to init_weighted.

batch

If TRUE, then embedding coordinates are updated at the end of each epoch rather than during the epoch. In batch mode, results are reproducible with a fixed random seed even with n_sgd_threads > 1, at the cost of a slightly higher memory use. You may also have to modify learning_rate and increase n_epochs, so whether this provides a speed increase over the single-threaded optimization is likely to be dataset and hardware-dependent. If NULL, the transform will use the value provided in the model, if available. Default: FALSE.

learning_rate

Initial learning rate used in optimization of the coordinates. This overrides the value associated with the model. This should be left unspecified under most circumstances.

opt_args

A list of optimizer parameters, used when batch = TRUE. The default optimization method used is Adam (Kingma and Ba, 2014).

  • method The optimization method to use. Either "adam" or "sgd" (stochastic gradient descent). Default: "adam".

  • beta1 (Adam only). The weighting parameter for the exponential moving average of the first moment estimator. Effectively the momentum parameter. Should be a floating point value between 0 and 1. Higher values can smooth oscillatory updates in poorly-conditioned situations and may allow for a larger learning_rate to be specified, but too high can cause divergence. Default: 0.5.

  • beta2 (Adam only). The weighting parameter for the exponential moving average of the uncentered second moment estimator. Should be a floating point value between 0 and 1. Controls the degree of adaptivity in the step-size. Higher values put more weight on previous time steps. Default: 0.9.

  • eps (Adam only). Intended to be a small value to prevent division by zero, but in practice can also affect convergence due to its interaction with beta2. Higher values reduce the effect of the step-size adaptivity and bring the behavior closer to stochastic gradient descent with momentum. Typical values are between 1e-8 and 1e-3. Default: 1e-7.

  • alpha The initial learning rate. Default: the value of the learning_rate parameter.

If NULL, the transform will use the value provided in the model, if available.

epoch_callback

A function which will be invoked at the end of every epoch. Its signature should be: (epoch, n_epochs, coords, fixed_coords), where:

  • epoch The current epoch number (between 1 and n_epochs).

  • n_epochs Number of epochs to use during the optimization of the embedded coordinates.

  • coords The embedded coordinates as of the end of the current epoch, as a matrix with dimensions (N, n_components).

  • fixed_coords The originally embedded coordinates from the model. These are fixed and do not change. A matrix with dimensions (Nmodel, n_components) where Nmodel is the number of observations in the original data.

Details

Note that some settings are incompatible with the production of a UMAP model via umap: external neighbor data (passed via a list to the argument of the nn_method parameter), and factor columns that were included in the UMAP calculation via the metric parameter. In the latter case, the model produced is based only on the numeric data. A transformation is possible, but factor columns in the new data are ignored.

Value

A matrix of coordinates for X transformed into the space of the model.

Examples

1
2
3
4
5
6
iris_train <- iris[1:100, ]
iris_test <- iris[101:150, ]

# You must set ret_model = TRUE to return extra data needed
iris_train_umap <- umap(iris_train, ret_model = TRUE)
iris_test_umap <- umap_transform(iris_test, iris_train_umap)

jlmelville/uwot documentation built on Nov. 29, 2021, 5:38 a.m.