NC.dist | R Documentation |
Calculates the normalized compression distance
NC.dist(data, method="gzip", character=TRUE)
data |
Matrix (or data frame) with variables that should be used in the computation of the distance between rows. |
method |
Taken from memCompress(): either "gzip", or "bzip2", or "xz"; the last is very slow |
character |
Convert to character mode (default), or use as raw? |
NC.dist() computes the distance based on the sizes of the compressed vectors. It is calculated as
dissimilarity(x, y) = B(x, y) - max(B(x), B(y)) / min(B(x), B(y))
where B(x) and B(y) are the bytesizes of the compressed 'x' and 'y', and B(x, y) is the comressed bytesize of concatenated 'x' and 'y'. The algorithm uses basic memCompress() function.
If argument is the data frame, NC.dist() internally converts it into the matrix. All columns by default will be converted into character mode (and if 'character=FALSE', into raw). This default behavior allows NC.dist() to be the universal distance which also does not mind NAs and zeroes.
Distance object with distances among rows of 'data'
Alexey Shipunov
Cilibrasi, R., & Vitanyi, P. M. (2005). Clustering by compression. Information Theory, IEEE Transactions on, 51(4), 1523-1545.
memCompress
## converts variables into character, universal method iris.nc <- NC.dist(iris[, -5]) iris.hnc <- hclust(iris.nc, method="ward.D2") ## amazingly, it works even for vectors with length=4 (iris data rows) plot(prcomp(iris[, -5])$x, col=cutree(iris.hnc, 3)) ## using variables as raw, it is good when they are uniform iris.nc2 <- NC.dist(iris[, -5], character=FALSE) iris.hnc2 <- hclust(iris.nc2, method="ward.D2") plot(prcomp(iris[, -5])$x, col=cutree(iris.hnc2, 3)) ## bzip2 uses Burrows-Wheeler transform NC.dist(matrix(runif(100), ncol=10), method="bzip2")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.