sneer: sneer: Stochastic Neighbor Embedding Experiments in R.
In jlmelville/sneer: Stochastic Neighbor Embedding Experiments in R

sneer

R Documentation

sneer: Stochastic Neighbor Embedding Experiments in R.

Description

A package for exploring probability-based embedding and related forms of dimensionality reduction. Its main goal is to implement multiple embedding methods within a single framework so comparison between them is easier, without worrying about the effect of differences in preprocessing, optimization and heuristics.

Carries out an embedding of a high-dimensional dataset into a two dimensional scatter plot, based on distance-based methods (e.g. Sammon maps) and probability-based methods (e.g. t-distributed Stochastic Neighbor Embedding).

Usage

sneer(df, indexes = NULL, ndim = 2, method = "tsne", alpha = 0.5,
  dof = 10, dyn = c(), lambda = 0.5, kappa = 0.5, scale_type = "none",
  perplexity = 32, perp_scale = "single", perp_scale_iter = NULL,
  perp_kernel_fun = "exp", prec_scale = "none", init = "pca",
  opt = "L-BFGS", eta = 1, max_iter = 1000, max_fn = Inf,
  max_gr = Inf, max_fg = Inf, report_every = 50, tol = 1e-04,
  exaggerate = NULL, exaggerate_off_iter = 100, plot_type = "plot",
  colors = NULL, color_name = NULL, labels = NULL, label_name = NULL,
  label_chars = NULL, point_size = 1, plot_labels = FALSE,
  color_scheme = grDevices::rainbow, equal_axes = FALSE, legend = TRUE,
  legend_rows = NULL, quality_measures = NULL, ret = c())

Arguments

`df`	Data frame or distance matrix (as dist object) to embed.
`indexes`	Indexes of the columns of the numerical variables to use in the embedding. The default of `NULL` will use all the numeric variables.
`ndim`	Number of output dimensions (normally 2).
`method`	Embedding method. See 'Details'.
`alpha`	Heavy tailedness parameter. Used only if the method is `"hssne"` or `"dhssne"`. For the latter, it specifies the initial value of the parameter.
`dof`	Initial number of degrees of freedom. Used only if the method is `"itsne"`. A value of 1 gives initial behavior like t-ASNE, and values approaching infinity behave like ASNE.
`dyn`	List containing kernel parameters to be optimized. See "Details".
`lambda`	NeRV parameter. Used only if the method is `"nerv"`.
`kappa`	JSE parameter. Used only if the method is `"jse"`.
`scale_type`	Type of scaling to carry out on the input data. See 'Details'.
`perplexity`	Target perplexity or vector of trial perplexities (if `perp_scale` is set). Applies to probability-based embedding methods only (i.e. anything that isn't PCA, MDS or Sammon mapping).
`perp_scale`	Type of perplexity scaling to apply. See 'Details'. Ignored by non-probability based methods.
`perp_scale_iter`	Number of iterations to scale perplexity values over. Must be smaller than the `max_iter` parameter. Default is to use `max_iter / 5`. Ignored by non-probability based methods or if `perp_scale` is not set.
`perp_kernel_fun`	The input data weight function. Either `"exp"` to use exponential function (the default) or `"step"` to use a step function. The latter emulates a k-nearest neighbor graph, but does not provide any of the efficiency advantages of a sparse matrix.
`prec_scale`	Whether to scale the output kernel precision based on perplexity results. See 'Details'. Ignored by non-probability based methods. Can't be used if `perp_kernel_fun` is set to `"step"`.
`init`	Type of initialization of the output coordinates. See 'Details'.
`opt`	Type of optimizer. See 'Details'.
`eta`	Learning rate when `opt` is set to `"TSNE"` and the initial step size for the bold driver and back tracking step search methods.
`max_iter`	Maximum number of iterations to carry out during the embedding. Ignored if the `method` is `"pca"`.
`max_fn`	Maximum number of cost function evaluations to carry out during the embedding. Ignored if the `method` is `"pca"`.
`max_gr`	Maximum number of cost function evaluations to carry out during the embedding. Ignored if the `method` is `"pca"`.
`max_fg`	Maximum number of the total of the cost function and gradient evaluations to carry out during the embedding. Ignored if the `method` is `"pca"`.
`report_every`	Frequency (in terms of iteration number) with which to update plot and report the cost function.
`tol`	Tolerance for comparing cost change (calculated according to the interval determined by `report_every`). If the change falls below this value, the optimization stops early.
`exaggerate`	If non-`NULL`, scales input probabilities by this value from iteration 0 until `exaggerate_off_iter`. Normally a value of `4` is used. Has no effect with PCA, Sammon mapping or metric MDS. Works best when using random initialization with (`init = "r"` or `"u"`. You probably don't want to use it if you are providing your own initial configuration (`init = "m"`).
`exaggerate_off_iter`	Iteration number to stop the "early exaggeration" scaling specified `exaggerate`. Has no effect if `exaggerate` is `NULL`.
`plot_type`	String code indicating the type of plot of the embedding to display: `"p"` to use the usual `plot` function; `"g"` to use the `ggplot2` package. You are responsible for installing and loading the ggplot2 package yourself.
`colors`	Vector of colors to use to color each point in the embedding plot.
`color_name`	Name of column of colors in `df` to be used to color the points directly. Ignored if `colors` is provided.
`labels`	Factor vector associated with (but not necessarily in) `df`. Used to map from factor levels to colors in the embedding plot (if no `color` or `color_name` is provided), and as text labels in the plot if `plot_labels` is `TRUE`. Ignored if `colors` or `color_name` is provided.
`label_name`	Name of a factor column in `df`, to be used like `labels`. Ignored if `labels` is provided.
`label_chars`	Number of characters to use for the labels in the embedding plot. Applies only when `plot_type` is set to `"p"`.
`point_size`	Size of the points (or text) in the embedding plot.
`plot_labels`	If `TRUE` and either `labels` or `label_name` is provided, then the specified factor column will be used to provide a text label associated with each point in the plot. Only useful for small dataset with short labels. Ignored if `plot_type` is not set to `"p"`.
`color_scheme`	Either a color ramp function, or the name of a Color Brewer palette name to use for mapping the factor specified by `labels` or `label_name`. Ignored if not using `labels` or `label_name`.
`equal_axes`	If `TRUE`, the embedding plot will have the axes scaled so that both the X and Y axes have the same extents. Only applies if `plot_type` is set to `"p"`.
`legend`	if `TRUE`, display the legend in the embedding plot. Applies when `plot_type` is `"g"` only.
`legend_rows`	Number of rows to use for displaying the legend in an embedding plot. Applies when `plot_type` is `"g"` only.
`quality_measures`	Vector of names of quality measures to apply to the finished embedding. See 'Details'. Values of the quality measures will be printed to screen after embedding and retained in the list that is returned from this function.
`ret`	Vector of names of extra data to return from the embedding. See 'Details'.

Details

The embedding methods available are:

"pca" The first two principal components.
"mmds" Metric multidimensional scaling.
"sammon" Sammon map.
"tsne" t-Distributed Stochastic Neighbor Embedding of van der Maaten and Hinton (2008).
"asne" Asymmetric Stochastic Neighbor Embedding of Hinton and Roweis (2002).
"ssne" Symmetric Stochastic Neighbor Embedding of Cook et al (2007).
"wssne" Weighted Symmetric Stochastic Neighbor Embedding of Yang et al (2014). Note that despite its name this version is a modification of t-SNE, not SSNE.
"hssne" Heavy-tailed Symmetric Stochastic Neighbor Embedding of Yang et al (2009).
"nerv" Neighbor Retrieval Visualizer of Venna et al (2010). NB: The original paper suggests setting the output weight function precisions to be equal to those of the input weights. Later papers don't mention this. For consistency with other embedding methods, the default behavior is not to transfer the precisions to the output function. To transfer precisions, set prec_scale = "transfer".
"jse" Jensen-Shannon Embedding of Lee at al (2013).
"itsne" Inhomogeneous t-SNE method of Kitazono et al (2016).
"dhssne" A "dynamic" version of HSSNE, inspired by the inhomogeneous t-SNE Method of Kitazono et al.

Custom embedding methods can also be used, via the embedder function.

The "dyn" parameter allows for kernel parameters to be optimized, if the output kernel is exponential or heavy-tailed, i.e. methods asne, ssne, nerv and jse (which use the exponential kernel) and hssne (which uses the heavy-tailed kernel). The parameter should be a list consisting of the following names:

For exponential kernels, "beta" (the precision of the exponential.)
For the heavy-tailed kernel, "alpha" (the heavy-tailedness), and "beta" (analogous to the precision of the exponential).
alt_opt If TRUE, then optimize non-coordinates separately from coordinates.
"kernel_opt_iter" Wait this number of iterations before beginning to optimize non-coordinate parameters.

The values of the list "beta" and "alpha" items should be one of:

"global" The parameter is the same for every point.
"point" The value is applied per point, and can be different for each point.
"static" The value is fixed at its initial value and is not optimized.

Setting a value to "static" only makes sense for kernels where there is more than one parameter that could be optimized and you don't want all of them optimized (e.g. you may only want to optimize alpha in the heavy-tailed kernel). It's an error to specify all parameters as "static".

The methods "dhssne" and "itsne" already use dynamic kernel optimization and don't require any further specification, but specifying the alt_opt and kernel_opt_iter list members will affect their behavior.

The following scaling options can be applied via the scale_type parameter:

"none" Do nothing. The default.
"matrix" Range scale the entire data so that the maximum value is 1 and the minimum 0.
"range" Range scale each column that the maximum value in each column is 1 and the minimum 0.
"sd" Scale each column so that its mean is 0 and standard deviation is 1.
"tsne" Center each column, then scale each element by the absolute maximum element value. This is the scaling carried out in Barnes-Hut t-SNE.

These arguments can be abbreviated. Default is to do no scaling. Zero variance columns will be removed even if no preprocessing is carried out.

The perplexity parameter is used in combination with the perp_scale parameter, which can take the following values:

"single" perplexity should be a single value, which will be used over the entire course of the embedding.
"step" perplexity should be a vector of perplexity values. Each perplexity will be used in turn over the course of the embedding, in sequential order. By starting with a large perplexity, and ending with the desired perplexity, it has been suggested by some researchers that local minima can be avoided.
"multi" The multiscaling method of Lee et al (2015). perplexity should be a vector of perplexity values. Each perplexity will be used in turn over the course of the embedding, in sequential order. Unlike with the "step" method, probability matrices from earlier perplexities are retained and combined by averaging. N.B. Multiscaling is not compatible with methods "itsne" or "dhssne".

These arguments can be abbreviated.

For perp_scale values that aren't "single", if a non-vector argument is suppied to the perplexity argument, it will be ignored, and a suitable vector of perplexity values will be used instead. For "multi" these will range from the the number of observations in the dataset divided by four down to 2, in descending powers of 2. For "step", 5 equally spaced values ranging from the number of observations divided by 2 down to 32 (or the number of observations divided by 4, if the dataset is smaller than 65 observations.)

The prec_scale parameter determines if the input weighting kernel precision parameters should be used to modify the output kernel parameter after the input probability calculation for a given perplexity value completes values are:

"none" Do nothing. Most embedding methods follow this strategy, leaving the output similarity kernels to all have unit precision.
"transfer" Transfer the input similarity kernel parameters to the output similarity kernel. This method was suggesed by Venna et al (2010). This is only compatible with methods "asne", "jse" and "nerv".
"scale" Scale the output kernel precisions based on the target perplexity and the intrinsic dimensionality of the input data. This method is part of the multiscaling technique proposed by Lee et al (2015).

These arguments can be abbreviated.

The prec_scale parameter will be ignored if the method used does not use an output similarity kernel with a free parameter, e.g. tsne or wtsne. Also, because the input and output similarity kernels must be of the same type, prec_scale is incompatible with setting perp_kernel_fun to "step".

For initializing the output coordinates, the options for the init parameter are:

"pca" Initialize using the first two scores of the PCA (using classical MDS if df is a distance matrix). Data will be centered, but not scaled unless the scale_type parameter is used.
"random" Initialize each coordinate value from a normal random distribution with a standard deviation of 1e-4, as suggested by van der Maaten and Hinton (2008).
"uniform" Initialize each coordinate value from a uniform random distribution between 0 and 1 as suggested by Venna et al (2010).
Coordinates may also be passed directly as a matrix. The dimensions must be correct for the input data.

Character arguments can be abbreviated.

For configuring the optimization method, the options for the opt parameter are:

"TSNE" The optimization method used in the original t-SNE paper: the Jacobs method for step size selection and a step function for the momentum: switching from 0.4 to 0.8 after 250 steps. You may need to modify the "eta" parameter to get good results, depending on how you have scaled and preprocessed your data, and the embedding method used.
"BFGS" The Broyden-Fletcher-Goldfarb-Shanno (BFGS) method. Requires storing an approximation to the Hessian, so not good for large datasets.
"L-BFGS" The limited-memory BFGS method (using the last ten updates). Default method.
"NEST" Momentum emulating Nesterov Accelerated Gradient (Sutskever and co-workers 2013).
"CG" Conjugate Gradient.
"SPEC" Spectral Direction partial Hessian method of Vladymyrov and Carreira-Perpinan (2012). Requires a probability-based embedding method and that the input probability matrix be symmetric. Some probability-based methods are not compatible (e.g. NeRV and JSE; t-SNE works with it, however). Also, while it works with the dense matrices used by sneer, because this method uses a Cholesky decomposition of the input probability matrix which has a complexity of O(N^3), it is intended to be used with sparse matrices. Its inclusion here is suitable for use with smaller datasets.

For the quality_measures argument, a vector with one or more of the following options can be supplied:

"rocauc" Calculate the area under the ROC curve, averaged over each observation, using the output distance matrix to rank each observation. Observations are partitioned into the positive and negative class depending upon the value of the label determined by the label_name argument. Only calculated if the label_name parameter is supplied.
"prauc" Calculate the area under the Precision-Recall curve. Only calculated if the label_name parameter is supplied.
"rnxauc" Calculate the area under the RNX curve, using the method of Lee et al (2015).

Options may be abbreviated.

Progress of the embedding is logged to the standard output every 50 iterations. The raw cost of the embedding will be provided along with some tolerances of either how the embedding or the cost has changed.

Because the different costs are not always scaled in a way that makes it obvious how well the embedding has performed, a normalized cost is also shown, where 0 is the minimum possible cost (coinciding with the probabilities or distances in the input and output space being matched), and a normalized cost of 1 is what you would get if you just set all the distances and probabilities to be equal to each other (i.e. ignoring any information from the input space).

Also, the embedding will be plotted. Plotting can be done with either the standard plot function (the default or by explicitly providing plot_type = "plot") or with the ggplot2 library (which you need to install and load yourself), by using plot_type = "ggplot2" (you may abbreviate these arguments). The goal has been to provide enough customization to give intelligible results for most datasets. The following are things to consider:

The plot symbols are normally filled circles. However, if you set the plot_text argument to TRUE, the labels argument can be used to provide a factor vector that provides a meaningful label for each data point. In this case, the text of each factor level will be used as a level. This creates a mess with all but the shortest labels and smallest datasets. There's also a label_fn parameter that lets you provide a function to convert the vector of labels to a different (preferably shorter) form, but you may want to just do it yourself ahead of time and add it to the data frame.
Points are colored using two strategies. The most straightforward way is to provide a vector of rgb color strings as an argument to colors. Each element of colors will be used to color the equivalent point in the data frame. Note, however, this is currently ignored when plotting with ggplot2.
The second way to color the embedding plot uses the labels parameter mentioned above. Each level of the factor used for labels will be mapped to a color and that used to color each point. The mapping is handled by the color_scheme parameter. It can be either a color ramp function like rainbow or the name of a color scheme in the RColorBrewer package (e.g. "Set3"). The latter requires the RColorBrewer package to have been installed and loaded. Unlike with using colors, providing a labels argument works with ggplot2 plots. In fact, you may find it preferable to use ggplot2, because if the legend argument is TRUE (the default), you will get a legend with the plot. Unfortunately, getting a legend with an arbitary number of elements to fit on an image created with the graphics::plot function and for it not to obscure the points proved beyond my capabilities. Even with ggplot2, a dataset with a large number of categories can generate a large and unwieldy legend.

Additionally, instead of providing the vectors directly, there are color_name and label_name arguments that take a string containing the name of a column in the data frame, e.g. you can use labels = iris$Species or label_name = "Species" and get the same result.

If you don't care that much about the colors, provide none of these options and sneer will try and work out a suitable column to use. If it finds at least one color column in the data frame (i.e. a string column where every element can be parsed as a color), it will use the last column found as if you had provided it as the colors argument.

Otherwise, it will repeat the process but looking for a vector of factors. If it finds one, it will map it to colors via the color_scheme, just as if you had provided the labels argument. The default color scheme is to use the rainbow function so you should normally get a colorful, albeit potentially garish, result.

For the ret argument, a vector with one or more of the following options can be supplied:

"pcost" The final cost function value, decomposed into n contributions, where n is the number of points embedded.
"x" The input coordinates after scaling and column filtering.
"dx" The input distance matrix. Calculated if not present.
"dy" The output distance matrix. Calculated if not present.
"p" The input probability matrix.
"q" The output probability matrix.
"prec" The input kernel precisions (inverse of the squared bandwidth).
"dim" The intrinsic dimensionality for each observation, calculated according to the method of Lee et al (2015). These are meaningless if not using the default exponential perp_kernel_fun.
"deg" Degree centrality of the input probability. Calculated if not present.
"dyn" A list of "dynamic" parameters, i.e. any non-coordinate parameters which were optimized. Only used if the "dyn" input parameter was non-NULL. The list will contain the value of the optimized parameters. If the "alt_opt" flag was set in the "dyn" input list, then this return list will also contain the number of cost function and gradient evaluations associated with the optimization of the parameters, as "nf" and "ng", respectively.
"costs" A matrix containing the costs and iteration at which they were calculated, as reported during the optimization. The number of these results is controlled by the report_every parameter.
"nf" The number of cost function evaluations carried out during the optimization. If "dynamic" parameters were used and optimized separately from coordinates, then this count does not include any contribution from the parameter optimization. Those counts can be found in the "dyn" return list.
"ng" The number of cost gradient evaluations carried out during the optimization. If "dynamic" parameters were used and optimized separately from coordinates, then this count does not include any contribution from the parameter optimization. Those counts can be found in the "dyn" return list.

The color_scheme parameter is used to set the color scheme for the embedding plot that is displayed during the optimization. It can be one of either a color ramp function (e.g. grDevices::rainbow), accepting an integer n as an argument and returning n colors, or the name of a ColorBrewer color scheme (e.g. "Spectral"). Using a ColorBrewer scheme requires the RColorBrewer package be installed.

For some applicable color ramp functions, see the Palettes help page in the grDevices package (e.g. by running the ?rainbow command).

Value

List with the following elements:

coords Embedded coordinates.
cost Cost function value for the embedded coordinates. The type of the cost depends on the method, but the lower the better.
norm_cost cost, normalized so that a perfect embedding gives a value of 0 and one where all the distances were equal would have a value of 1.
iter Iteration number when embedding terminated.

Additional elements will be in the list if ret or quality_measures are non-empty.

Embedding

The sneer function provides a variety of methods for embedding, including:

Stochastic Neighbor Embedding and variants (ASNE, SSNE and TSNE)
Metric MDS using the STRESS and SSTRESS functions
Sammon Mapping
Heavy-tailed Symmetric Stochastic Neighbor Embedding (HSSNE)
Neigbor Retrieval Visualizer (NeRV)
Jensen-Shannon Embedding (JSE)
Inhomogeneous t-SNE

See the documentation for the function for the exact list of methods and variations. If you want to create variations on these methods by trying different cost functions, weighting functions and normalization schemes, see the embedder function.

Optimization is carried out with the mize package (https://github.com/jlmelville/mize) with the limited memory BFGS. Other optimization methods include the Nesterov Accelerated Gradient method (Sutskever et al 2013) with an adaptive restart (O'Donoghue and Candes 2013), which is a bit more robust compared to the usual t-SNE optimization method across the different methods exposed by sneer.

Visualization

The embed_plot function will take the output of the sneer function and provide a visualization of the embedding. If you install the RColorBrewer package installed, you can use the ColorBrewer palettes by name.

Quantifying embedding quality

Some functions are available for attempting to quantify embedding quality, independent of the particular loss function used for an embedding method. The nbr_pres function will measure how well the embedding preserves a neighborhood of a given size around each observation. The rnx_auc_embed function implements the Area Under the Curve of the RNX curve (Lee et al. 2015), which generalizes the neighborhood preservation to account for all neighborhood sizes, with a bias towards smaller neighborhoods.

If your observations have labels which could be used for a classification task, then there are also functions which will use these labels to calculate the Area Under the ROC or PR (Precision/Recall) Curve, using the embedded distances to rank each observation: these are roc_auc_embed and pr_auc_embed functions, respectively. Note that to use these two functions, you must have the PRROC package installed.

Synthetic Dataset

There's a synthetic dataset in this package, called s1k. It consists of a 1000 points representing a fuzzy 9D simplex. It's intended to demonstrate the "crowding effect" and require the sort of probability-based embedding methods provided in this package (PCA does a horrible job of separated the 10 clusters in the data). See s1k for more details.

References

t-SNE, SNE and ASNE Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(2579-2605).

NeRV Venna, J., Peltonen, J., Nybo, K., Aidos, H., & Kaski, S. (2010). Information retrieval perspective to nonlinear dimensionality reduction for data visualization. Journal of Machine Learning Research, 11, 451-490.

JSE Lee, J. A., Renard, E., Bernard, G., Dupont, P., & Verleysen, M. (2013). Type 1 and 2 mixtures of Kullback-Leibler divergences as cost functions in dimensionality reduction based on similarity preservation. Neurocomputing, 112, 92-108.

Inhomogeneous t-SNE Kitazono, J., Grozavu, N., Rogovschi, N., Omori, T., & Ozawa, S. (2016, October). t-Distributed Stochastic Neighbor Embedding with Inhomogeneous Degrees of Freedom. In International Conference on Neural Information Processing (ICONIP 2016) (pp. 119-128). Springer International Publishing.

Nesterov Accelerated Gradient: Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In Proceedings of the 30th international conference on machine learning (ICML-13) (pp. 1139-1147).

O'Donoghue, B., & Candes, E. (2013). Adaptive restart for accelerated gradient schemes. Foundations of computational mathematics, 15(3), 715-732.

Spectral Direction: Vladymyrov, M., & Carreira-Perpinan, M. A. (2012). Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings. In Proceedings of the 29th International Conference on Machine Learning (ICML-12) (pp. 345-352).

Cook, J., Sutskever, I., Mnih, A., & Hinton, G. E. (2007). Visualizing similarity data with a mixture of maps. In International Conference on Artificial Intelligence and Statistics (pp. 67-74).

Hinton, G. E., & Roweis, S. T. (2002). Stochastic neighbor embedding. In Advances in neural information processing systems (pp. 833-840).

Kitazono, J., Grozavu, N., Rogovschi, N., Omori, T., & Ozawa, S. (2016, October). t-Distributed Stochastic Neighbor Embedding with Inhomogeneous Degrees of Freedom. In International Conference on Neural Information Processing (ICONIP 2016) (pp. 119-128). Springer International Publishing.

Lee, J. A., Renard, E., Bernard, G., Dupont, P., & Verleysen, M. (2013). Type 1 and 2 mixtures of Kullback-Leibler divergences as cost functions in dimensionality reduction based on similarity preservation. Neurocomputing, 112, 92-108.

Lee, J. A., Peluffo-Ordo'nez, D. H., & Verleysen, M. (2015). Multi-scale similarities in stochastic neighbour embedding: Reducing dimensionality while preserving both local and global structure. Neurocomputing, 169, 246-261.

Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(2579-2605).

Venna, J., Peltonen, J., Nybo, K., Aidos, H., & Kaski, S. (2010). Information retrieval perspective to nonlinear dimensionality reduction for data visualization. Journal of Machine Learning Research, 11, 451-490.

Vladymyrov, M., & Carreira-Perpinan, M. A. (2012). Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings. In Proceedings of the 29th International Conference on Machine Learning (ICML-12) (pp. 345-352).

Yang, Z., King, I., Xu, Z., & Oja, E. (2009). Heavy-tailed symmetric stochastic neighbor embedding. In Advances in neural information processing systems (pp. 2169-2177).

Yang, Z., Peltonen, J., & Kaski, S. (2014). Optimization equivalence of divergences improves neighbor embedding. In Proceedings of the 31st International Conference on Machine Learning (ICML-14) (pp. 460-468).

Examples

## Not run: 
# Do t-SNE on the iris dataset, scaling columns to zero mean and
# unit standard deviation.
res <- sneer(iris, scale_type = "sd")

# Use the weighted TSNE variant and export the input and output distance
# matrices.
res <- sneer(iris, scale_type = "sd", method = "wtsne", ret = c("dx", "dy"))

# calculate the 32-nearest neighbor preservation for each observation
# 0 means no neighbors preserved, 1 means all of them
pres32 <- nbr_pres(res$dx, res$dy, 32)

# Calculate the Area Under the RNX Curve
rnx_auc <- rnx_auc_embed(res$dx, res$dy)

# Load the PRROC library
library(PRROC)

# Calculate the Area Under the Precision Recall Curve for the embedding
pr <- pr_auc_embed(res$dy, iris$Species)

# Similarly, for the ROC curve:
roc <- roc_auc_embed(res$dy, iris$Species)

# Load the RColorBrewer library
library(RColorBrewer)
# Plot the embedding, with points colored by the neighborhood preservation
embed_plot(res$coords, x = pres32, color_scheme = "Blues")

## End(Not run)
## Not run: 
  # PCA on iris dataset and plot result using Species label name
  res <- sneer(iris, indexes = 1:4, label_name = "Species", method = "pca")
  # Same as above, but with sensible defaults (use all numeric columns, plot
  # with first factor column found)
  res <- sneer(iris, method = "pca")

  # Can use a distance matrix as input with external vector of labels
  res <- sneer(dist(iris[1:4]), method = "pca", labels = iris$Species)

  # scale columns so each one has mean 0 and variance 1
  res <- sneer(iris, method = "pca", scale_type = "sd")
  # full species name on plot is cluttered, so just use the first two
  # letters and half size
  res <- sneer(iris, method = "pca", scale_type = "sd", label_chars = 2,
               point_size = 0.5)

  library(ggplot2)
  library(RColorBrewer)
  # Use ggplot2 and RColorBrewer palettes for the plot
  res <- sneer(iris, method = "pca", scale_type = "sd", plot_type = "g")
  # Use a different ColorBrewer palette, bigger points, and range scale each
  # column
  res <- sneer(iris, method = "pca", scale_type = "r", plot_type = "g",
               color_scheme = "Dark2", point_size = 2)

  # metric MDS starting from the PCA
  res <- sneer(iris, method = "mmds", scale_type = "sd", init = "p")
  # Sammon map starting from random distribution
  res <- sneer(iris, method = "sammon", scale_type = "sd", init = "r")

  # TSNE with a perplexity of 32, initialize from PCA
  res <- sneer(iris, method = "tsne", scale_type = "sd", init = "p",
               perplexity = 32)
  # default settings are to use TSNE with perplexity 32 and initialization
  # from PCA so the following is the equivalent of the above
  res <- sneer(iris, scale_type = "sd")

  # Use the standard tSNE optimization method (Jacobs step size method) with
  # step momentum. Range scale the matrix and use an aggressive learning
  # rate (eta).
  res <- sneer(iris, scale_type = "m", perplexity = 25, opt = "tsne",
               eta = 500)

  # Use the L-BFGS optimization method
  res <- sneer(iris, scale_type = "sd", opt = "L-BFGS")
  # Use the Spectral Directions method
  res <- sneer(iris, scale_type = "sd", opt = "SPEC")

  # Use Conjugate Gradient
  res <- sneer(iris, scale_type = "sd", opt = "CG")

  # NeRV method, starting at a more global perplexity and slowly stepping
  # towards a value of 32 (might help avoid local optima)
  res <- sneer(iris, scale_type = "sd", method = "nerv", perp_scale = "step")

  # NeRV method has a lambda parameter - closer to 1 it gets, the more it
  # tries to avoid false positives (close points in the map that aren't close
  # in the input space):
  res <- sneer(iris, scale_type = "sd", method = "nerv", perp_scale = "step",
               lambda = 1)

  # Original NeRV paper transferred input exponential similarity kernel
  # precisions to the output kernel, and initialized from a uniform random
  # distribution
  res <- sneer(iris, scale_type = "sd", method = "nerv", perp_scale = "step",
               lambda = 1, prec_scale = "t", init = "u")

  # Like NeRV, the JSE method also has a controllable parameter that goes
  # between 0 and 1, called kappa. It gives similar results to NeRV at 0 and
  # 1 but unfortunately the opposite way round! The following gives similar
  # results to the NeRV embedding above:
  res <- sneer(iris, scale_type = "sd", method = "jse", perp_scale = "step",
               kappa = 0)

  # Rather than step perplexities, use multiscaling to combine and average
  # probabilities across multiple perplexities. Output kernel precisions
  # can be scaled based on the perplexity value (compare to NeRV example
  # which transferred the precision directly from the input kernel)
  res <- sneer(iris, scale_type = "sd", method = "jse", perp_scale = "multi",
               prec_scale = "s")

  # HSSNE has a controllable parameter, alpha, that lets you control how
  # much extra space to give points compared to the input distances.
  # Setting it to 1 is equivalent to TSNE, so 1.1 is a bit of an extra push:
  res <- sneer(iris, scale_type = "sd", method = "hssne", alpha = 1.1)


  # DHSSNE is a "dynamic" extension to HSSNE which will modify alpha from
  # its starting point, similar to how it-SNE works (except there's
  # only one global value being optimized)
  # Setting alpha simply chooses the initial value
  res <- sneer(iris, method = "dhssne", alpha = 0.5)

  # Can make other embedding methods "dynamic" in the style of it-SNE and
  # DSSNE. Here we let the ASNE output kernel have different precision
  # parameters:
  res <- sneer(iris, method = "asne", dyn = list(beta = "point"))

  # DHSSNE could be defined manually like this: alpha is optimized as a single
  # global parameter, while the beta parameters are not optimized
  res <- sneer(iris, method = "hssne",
               dyn = list(alpha = "global", beta = "static"))

  # Allow both alpha and beta in the heavy-tailed function to vary per-point:
  res <- sneer(iris, method = "hssne",
               dyn = list(alpha = "point", beta = "point"))

  # it-SNE has a similar degree of freedom parameter to HSSNE's alpha, but
  # applies independently to each point and is optimized as part of the
  # embedding.
  # Setting dof chooses the initial value (1 is like t-SNE, large values
  # approach ASNE)
  # kernel_opt_iter sets how many iterations with just coordinate
  # optimization before including dof optimization too.
  res <- sneer(iris, method = "itsne", dof = 10,
               dyn = list(kernel_opt_iter = 50))

  # wTSNE treats the input probability like a graph where the probabilities
  # are weighted edges and adds extra repulsion to nodes with higher degrees
  res <- sneer(iris, scale_type = "sd", method = "wtsne")

  # can use a step-function input kernel to make input probability more like
  # a k-nearest neighbor graph (but note that we don't take advantage of the
  # sparsity for performance purposes, sadly)
  res <- sneer(iris, scale_type = "sd", method = "wtsne",
               perp_kernel_fun = "step")

  # Some quality measures are available to quantify embeddings
  # The area under the RNX curve measures whether neighbors in the input
  # are still neighors in the output space
  res <- sneer(iris, scale_type = "sd", method = "wtsne",
               quality_measures =  c("rnxauc"))

  # Create a 5D gaussian with its own column specifying colors to use
  # for each point (in this case, random)
  g5d <- data.frame(matrix(rnorm(100 * 5), ncol = 5),
                    color = rgb(runif(100), runif(100), runif(100)),
                    stringsAsFactors = FALSE)
  # Specify the name of the color column and the plot will use it rather than
  # trying to map factor levels to colors
  res <- sneer(g5d, method = "pca", color_name = "color")

  # If your dataset labels divide the data into natural classes, can
  # calculate average area under the ROC and/or precision-recall curve too,
  # but you need to have installed the PRROC package.
  # All these techniques can be slow (scale with the square of the number of
  # observations).
  library(PRROC)
  res <- sneer(iris, scale_type = "sd", method = "wtsne",
               quality_measures =  c("rnx", "roc", "pr"))

  # export the distance matrices and do whatever quality measures we
  # want at our leisure
  res <- sneer(iris, scale_type = "sd", method = "wtsne", ret = c("dx", "dy"))

  # Calculate the Area Under the Precision Recall Curve for the embedding
  pr <- pr_auc_embed(res$dy, iris$Species)

  # Similarly, for the ROC curve:
  roc <- roc_auc_embed(res$dy, iris$Species)

  # export per-point error, degree centrality, input weight function
  # precision parameters and intrinsic dimensionality
  res <- sneer(iris, scale_type = "sd", method = "wtsne",
               ret = c("pcost", "deg", "prec", "dim"))

  # Plot the embedding as points colored by category, using the rainbow
  # color ramp function:
  embed_plot(res$coords, iris$Species, color_scheme = rainbow)

  # Load the RColorBrewer Library
  library(RColorBrewer)

  # Use a ColorBrewer Qualitative color scheme name (pass a string, not
  # a function!)
  embed_plot(res$coords, iris$Species, color_scheme = "Dark2")

  # Visualize embedding colored by various values:
  # Per-point embedding error
  embed_plot(res$coords, x = res$pcost)
  # Degree centrality
  embed_plot(res$coords, x = res$deg)
  # Intrinsic Dimensionality using the PRGn palette
  embed_plot(res$coords, x = res$dim, color_scheme = "PRGn")
  # Input weight function precision parameter with the Spectral palette
  embed_plot(res$coords, x = res$prec, color_scheme = "Spectral")

  # calculate the 32-nearest neighbor preservation for each observation
  # 0 means no neighbors preserved, 1 means all of them
  pres32 <- nbr_pres(res$dx, res$dy, 32)
  embed_plot(res$coords, x = pres32, cex = 1.5)

## End(Not run)

jlmelville/sneer documentation built on Sept. 8, 2024, 9:58 p.m.