title: "Visualization" output: html_document: theme: cosmo toc: true toc_float: collapsed: false
Previous: Analysis. Next: References. Up: Index.
During the embedding, sneer
will display the current state of the output
configuration. But you can also use a standalone function to visualize the
coordinates (or any 2D matrix), so we'll look at that first while we run
through the different options available.
embed_plot
embed_plot
is a function you can use to easily view the results of the plot.
Just looking at the output as an undifferentiated mass is rarely enlightening,
so you will normally want to color the points in some way. Three main strategies
come to mind:
embed_plot
can help you with that. However, if you go with the
categories-to-colors plot, you won't get a legend, so you will be able to
discern any clustering by the colors, but you won't know which color maps
to which category. Sorry. I was unable to find a satisfactory solution with
the standard graphics::plot
commands that could guarantee the legend would
be in a sensible place and location for arbitrary numbers of categories and
category names of arbitrary length.
Anyway, let's run through some examples. Let's do a quick tsne embedding on our
old friend the iris dataset, and lets extract the degree centrality (deg
) and
embedding error (pcost
) of each node too so we have something numeric to
project onto the plot.
tsne_iris <- sneer(iris, max_iter = 200, ret = c("deg", "pcost"))
Let's take a look at the embedding:
Well, that's garish. This is because you haven't provided any color information,
so the default is that each point gets its own color from the default color
scheme, which happens to be the grDevices::rainbow
function. This might be
useful for some simulation data sets where the data is produced by a sequential
function of some kind (e.g. a random walk or a path around a circle or some
other parametric function) and the progression of the color gives some clue
about location.
You can also omit the "$coords" bit when passing the result of sneer
to
embed_plot
:
embed_plot(tsne_iris$coords)
# Same as the above, less typing
embed_plot(tsne_iris)
colors
If you just want everything to be one color, pass that to the colors
parameter:
If you have a preset vector of colors, that makes a lot more sense. Perhaps, inspired by the technicolor dream plot of the first example, you want each iris to have its own random color:
niris <- nrow(iris)
iris_colors <- rgb(runif(niris), runif(niris), runif(niris))
Beautiful.
More sensibly, the datasets used in
How to use t-SNE Effectively, and
available in the github package snedata
have a pre-calculated color
column.
Do the installation, and then carry out PCA on the three clusters simulation data:
devtools::install_github("jlmelville/snedata")
library("snedata")
three_c <- three_clusters_data(50)
pca_3c <- sneer(three_c, method = "pca")
View the results with a less foolish set of colors than random:
Back to our long-suffering iris example. It consists of measurements of the
sepal and petal dimensions of 150 flowers, from 3 species: Iris setosa,
Iris versicolor and Iris virginica. That species information is in the
iris$Species
column, so it would be natural to map from those categories
to colors.
To do this, pass the iris$Species
column to the x
argument. This argument
can be a data frame or a column. embed_plot
will try and work out what to
do based on what it's passed. In this case, the result is:
Three species, three colors (sorry you can't see a legend).
color_scheme
If you don't like the color scheme used, you can change it by passing a different
color function to color_scheme
. For example, here's the use of the
topo.colors
function, also from the grDevices
package, like rainbow
:
You can write your own color function as long as it follows the same interface
as the ones in grDevices
: you pass in a single integer, n
, and you get back
a vector of n
colors. For instance, perhaps you really liked those random
colors. Enshrine your non-deterministic creativity as a function:
random_colors <- function(n) {
rgb(runif(n), runif(n), runif(n)
}
Now, whenever you want each iris species to get a random color, it's as simple as:
RColorBrewer
If, like me, you have little creativity when it comes to picking color schemes,
I recommend just deferring to the ColorBrewer
color schemes (although other color schemes exist).
These are available both for numerical scales, and for mapping categorically.
Use the RColorBrewer
package to get access to them in R. You need to install and load it
manually if you want to use it with embed_plot
:
install.packages("RColorBrewer")
library(RColorBrewer)
To use a scheme from ColorBrewer, pass the name of a scheme to the
color_scheme
parameter, rather than a function. Find the names by running
RColorBrewer::brewer.pal.info()
or RColorBrewer::display.brewer.all()
.
For mapping factors to colors, you should be using one of the "qual" (short for "qualitative") schemes, for example "Accent":
If the data set you are using has more factors than the color scheme has colors,
embed_plot
attempts to interpolate the colors into a new set of colors. Not
ideal, but better than nothing.
x
Previously, I mentioned that if you pass a data frame rather than a vector to
the x
parameter, the function will attempt to do the right thing. What this
means in practice is:
This is a convenience for when you have multiple data sets you're looking at, they have different column names and types that you want to use for coloring and you don't want to remember them all.
As x
is the second argument to embed_plot
positionally, you don't even
need to refer to it by name, which makes using embed_plot
even easier. For
example, the following pairs of invocations are equivalent to each other:
embed_plot(pca_3d$coords, colors = three_c$color)
embed_plot(pca_3d$coords, three_c) # will find the "color" column
embed_plot(tsne_iris$coords, x = iris$Species)
embed_plot(tsne_iris$coords, iris) # will find the "Species" column
embed_plot(tsne_iris$coords, colors = iris_colors)
# detects it's already been passed a color vector when passed as 'x' argument
embed_plot(tsne_iris$coords, iris_colors)
In some cases, you may want to plot the factor levels you're using for the color
scheme as labels on the plot, instead of the points themselves, especially
because of the lack of a legend. To do this, set the text
argument to the
column you want to use. Let's finally find out which species is which in the
iris plot:
Um, I suppose you can just about make out the names. Perhaps it'll look better
if we reduce the size of the text with the cex
parameter and only plot
the first two characters of the species name?
That's a bit better. As you can see, this probably isn't a great option, unless you have a small dataset with short labels.
You can also map a numeric vector to a point. For example, if you use return
the point-wise decomposition of the cost function, pcost
, you can get a
sense for which points have been well-embedded. It's important to avoid
drawing conclusions about similarity of points based on their embedded
proximity if they have a high individual embedding error.
Choosing a good color scheme is very important here, so I recommend the ColorBrewer sequential and diverging schemes. Here's the iris t-SNE absolute error plot with the "Blues" sequential color scheme:
I've use abs(tsne_iris$pcost)
, because for the Kullback-Leibler divergence,
the individual point-wise error can be negative or positive. For this example,
though, I only care about the magnitude of the embedding error. The setosa
class down in the bottom left is very well embedded, the light blue color
indicating little contribution to the embedding error. The embedding clearly had
a bit more trouble where the virginica and versicolor clusters overlap.
The degree centrality - a measure of how "important" each point is - might also provide some information, although note that it is calculated using the input probability only. It's not using the output coordinates or error directly. Once again, we'll use the Blues color scheme:
The dark blue points are inside the clusters, with the least important points on the edges. Makes sense. Good job, t-SNE (the degree centrality
As you can see, sequential schemes highlight large values. If you want to
highlight values at both ends of the scale, use a diverging scale, like
RdYlGn
:
There are some extra parameters that can be used to control the display of the
numerical mapping. Sometimes, you may only want to see the points with the
largest values. Use the top
argument for this. For example, here are the
top 15 most important iris points:
By default the color scheme is mapped to the extent of the numerical vector you provide. Sometimes, there's an absolute scale you may want to use, however. For instance, the neighborhood preservation value, discussed in greater detail in the Analysis section takes a value between 0 and 1, representing an embedded point that has retained none or all of its neighbors, respectively.
Let's see how well PCA does at preserving each point's 25 neighbors. Note
that we need to ret
urn the input distance matrix (dx
) and output distance
matrix (dy
) for the nbr_pres
calcuation:
pca_iris <- sneer(iris, ret = c("dx", "dy"), method = "pca")
pca_nbrp25 <- nbr_pres(pca_iris$dx, pca_iris$dy, 25)
Running summary(pca_nbrp25)
shows that PCA does a pretty good job:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.5600 0.8800 0.9200 0.8939 0.9600 1.0000
If we plot the nbr preservation values as colors (with the size of the points
increased via the cex
parameter):
There are a few red points in the middle of the two right hand clusters - that's
where the points have the lowest neighborhood preservations are. But given
they're never lower than 0.56, you could replot this, using the full preservation
scale, by passing the lower and upper limits to the limits
argument:
Doesn't seem so bad now those red points have turned yellow.
There are a few other commands that can modify the look of the plot.
As we've seen in a few places, cex
scales the size of the points (or the text
if the text
argument is used). Set it to less than 1 to make the points
smaller than default, and greater than 1 to make them bigger.
equal_axes
, if set to TRUE
will make the extent of the X and Y axis the
same. This is useful if you would rather have a 1:1 aspect ratio, at the
expense of more whitespace and in your plot. Here's the iris plot yet again,
now with equal X and Y axes:
The clusters now look a lot less "tall" and stretched in the vertical direction.
sneer
Much of what has been described here applies to the plots which appear during
the embedding process when running sneer
. Here are the options to be aware
of.
colors
, color_name
You may provide a vector of colors directly to sneer
via the colors
parameter, or the name of a column in the data frame via color_name
.
sneer(three_c, colors = three_c$color)
sneer(three_c, color_name = "color") # the same as the previous command
labels
, label_name
Similarly, you can map from a factor to colors with the labels
and
label_name
parameters:
sneer(iris, labels = iris$Species)
sneer(iris, label_name = "Species") # the same as the previous command
Just like embed_plot
, sneer
will try and find a color of label column from
the data frame you pass it:
sneer(iris, labels = iris$Species)
sneer(iris) # equivalent to the above
The same rules for finding columns applies for sneer
as for embed_plot
:
color columns are returned in preference to factor columns, and if nothing is
found, one color is used per point. If more than one column is found, the
last column is used.
color_scheme
You can specify the color scheme just like in embed_plot
: pass either a color
function like rainbow
or the name of a ColorBrewer color scheme, e.g.
"Set1"
.
plot_type
If you embed in other than 2 dimensions, you won't see a plot. You can also
turn off plotting by setting plot_type = "none"
.
There is also support for using ggplot2
. As usual, you need to install and
load it yourself:
install.package("ggplot2")
library(ggplot2)
Then, set plot_type = "ggplot2"
(or just "g"
):
Could that be..? Yes it is. It's a legend at long, long last. This is the big
advantage of using the ggplot2
plotting during the embedding. It's not
currently supported by embed_plot
.
Before you get too excited about legends and everything there are some downsides
to using it. First, even though positioning the legend works much better with
ggplot2
than using plot
, you still can have a horrible things happen if
you have a very large legend (because of lots of categories or long category
names). You can try fiddling with the legend_rows
parameter in this case,
which specifies how many rows to try and lay the legend out on. This might help
prevent the legend from getting too high or wide. If the worst comes to the
worst you can set the legend
parameter to FALSE
to turn it off.
Also, coloring the plot has to be done by mapping from a factor to a color scheme currently (i.e. no columns of explicit color names). Finally, you can't choose to display factor labels instead of points (but at least you have a legend).
cex
As with embed_plot
, the cex
parameter changes the size of the plotted
points.
equal_axes
Set this to TRUE
to have the axes have the same scale.
plot_labels
If TRUE
, whatever factor column provided to sneer
(or it finds itself)
will be plotted as labels on the embedding plot. As discussed under the
text
parameter of embed_plot
, has limited (but not zero) use.
embed_plotly
As an alternative to embed_plot
, you can try out embed_plotly
, which
uses the JavaScript charting library plotly and its R API
to generate embedding plots. You will need to install the
plotly package, ensuring the
version you are using is at least version 4:
install.package("plotly")
library("plotly")
packageVersion("plotly") # This needs to begin with '4.'
If you're using RStudio, calling embed_plotly
won't seem all that different to using embed_plot
. However, if you're using
the usual R shell, then it will open the plot in your default web browser. If
you can handle that, then there are a few advantages to using plotly.
First:
Yep, you get a legend.
Second:
And numeric vectors gets a color scale. Also, if you pass in some labels as
the text
argument, they will show up in a tooltip if you hover over a point.
Third, there are lots of panning and zooming options if you want to interactively explore different areas of the plot in more detail.
The only minor downside to the use of embed_plotly
is that it doesn't support
the limits
and top
arguments when used with a numeric vector. A small
price to pay.
Previous: Analysis. Next: References. Up: Index.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.