spe: Implements the stochastic proximity embedding algorithm
In spe: Stochastic Proximity Embedding

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/spe.R

Embeds an N dimensional dataset in M dimensions, such that distances (or similarities) in the original N dimensions are maintained (as close as possible) in the final M dimensions

spe( coord, rcutpercent = 1, maxdist = 0,
     nobs = 0, ndim = 0, edim,
     lambda0 = 2.0, lambda1 = 0.01,
     nstep = 1e6, ncycle = 100, 
     evalstress=FALSE, sampledist=TRUE, samplesize = 1e6)

`coord`	This should be a matrix with number of rows equal to the number of observations and number of columns equal to the input dimension. A data.frame may also be supplied and it will be converted to a matrix (so all names will be lost)
`rcutpercent`	This is the percentage of the maximum distance (as determined by probability sampling) that will be used as the neighborhood radius. Setting rcutpercent to a value greater than 1 effectively sets it to infinity.
`maxdist`	If you have alread calculated a mxaimum distance then you can supply it and probability sampling will not be carried out to obtain a maximum distance. The default is to carry out sampling. By setting maxdist to a non zero value sampling will not be carried out (even if sampledist=TRUE)
`nobs`	The number of observations. If it is not specified nobs will be taken as nrow(coord)
`ndim`	The number of input dimensions. If not specified it will be taken as ncol(coord)
`edim`	The number of dimensions to embed in
`lambda0`	The starting value of the learning parameter
`lambda1`	The ending value of the learning parameter
`nstep`	The number of refinement steps
`ncycle`	The number of cycles to carry out refinement for
`evalstress`	If TRUE the function will evaluate the Sammon stress on the final embedding
`sampledist`	If TRUE an approximation to the maximum distance in the input dimensions will be obtained via probability sampling
`samplesize`	The number of iterations for probability sampling. For a dataset of 6070 observations there will be 6070x6069/2 pairwise distances. The default value gives a close approximation and runs fast. If you want a bettr approximation 1e7 is a good value. YMMV

Efficient determination of rcut is yet to be implemented (using the connected component method). As a result you will have to determine a value of rcutpercent by trail and error. The pivot SPE method (J. Mol. Graph. Model., 2003, 22, 133-140) is not yet implemented

If evalstress is TRUE it will be a list with two components named x and stress. x is the matrix of the final embedding and stress is the final stress

Rajarshi Guha rajarshi@presidency.com

A Self Organizing Principle for Learning Nonlinear Manifolds, Proc. Nat. Acad. Sci., 2002, 99, 15869-15872 Stochastic Proximity Embedding, J. Comput. Chem., 2003, 24, 1215-1221 A Modified Rule for Stochastic Proximity Embedding, J. Mol. Graph. Model., 2003, 22, 133-140 A Geodesic Framework for Analyzing Molecular Similarities, J. Chem. Inf. Comput. Sci., 2003, 43, 475-484

eval.stress, sample.max.distance

## load the phone dataset
data(phone)

## run SPE, embed$stress should be 0 or very close to it
## You can plot the embedding using the scatterplot3d package
## (This will take a few minutes to run)
embed <- spe(phone, edim=3, evalstress=TRUE)

## evaluate the Sammon stress
stress <- eval.stress(embed$x, phone)

## embed the Swiss Roll dataset in 2D
data(swissroll)
embed <- spe(swissroll, edim=2, evalstress=TRUE)