title: "Preprocessing" output: html_document: theme: cosmo toc: true toc_float: collapsed: false
Previous: Data Sets. Next: Input Initialization. Up: Index.
Sneer doesn't provide that much in the way of options for preprocessing, but it's useful to have some of this exposed because it can affect the final look of the embedding and you may need to scale data in a certain way if you're comparing with other embedding methods.
The following options mainly apply to input supplied as a data frame. If you supply a distance matrix, sneer won't do anything to it.
First, some preprocessing you don't need to to do: some embedding methods ask
you to weed out duplicates in the data before proceeding, but sneer
tries
to be robust enough to avoid that.
However, if you have some columns with very low or zero variance (for instance, because one column contains identical values for all observations), that can cause some problems. Rather than require you to remove these ahead of time, these will be removed from the data frame automatically before other pre-processing takes place. Sneer will log to the console the number of columns it filters in this way.
scale_type
ParameterBy default, no scaling is applied to the input data. By providing an argument
to scale_type
you can scale the data in the following way:
"none"
No scaling."sd"
Each column is centered so that is has a mean of zero and then scaled
so that its variance is 1. I've seen this called "autoscaling" in chemical and
biological data analysis, but I've not seen it referred to that way in other
fields. Nonetheless, you may also refer to it by "auto"
as well."range"
Range scales each column of data: each column is scaled so that the
values in each column have a range from 0 to 1."matrix"
Range scales the entire matrix: like range scaling, except the
entire matrix is treated like one big column."tsne"
Columns are centered and then each element divided by the absolute
maximum value, so that all elements range from a bit less than -1 to 1, or
-1 to a bit less than 1, depending on whether the maximum value was negative
or positive. This is the scaling carried out in Barnes-Hut t-SNE. All these arguments can be abbreviated (e.g. to "m"
instead of "matrix"
).
Image data is often treated by range scaling the entire matrix, e.g. using the mnist data set as an example:
res_mnist <- sneer(mnist, scale_type = "m")
sneer
Doesn't DoSome data sets can be very high dimensional, and processing time can be greatly reduced by doing PCA on the input data and keeping only a certain number of the score vectors that result.
Similarly, it's common to carry out various forms of whitening on image data sets.
While this could become part of the preprocessing workflow, I've kept this out
of sneer
for a couple of reasons.
First, if you run an embedding with different parameters on the same dataset
more than once, it's a bit of a waste of time to repeat PCA and/or whitening
inside the internals of sneer
rather than have you do it yourself once before
doing any further embedding.
Second, there's often a subjective criteria involved in this form of preprocessing. Everyone has their own favorite way of doing it, so rather than pointlessly expand the options even further, you should once again just do it yourself once ahead of time.
At least that was a nice and gentle start to the option-wrangling. Having preprocessed the input data, we can now move onto initializing the input data that the embedding directly works on.
Previous: Data Sets. Next: Input Initialization. Up: Index.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.