knitr::opts_chunk$set( collapse = TRUE , comment = "#>" )
disto is a R package that provides a high level API to interface over backends storing distance, dissimilarity, similarity matrices with matrix style extraction, replacement and other utilities. Disto supports in-memory distance matrices of 'dist' class and on-disk distance matrices of class 'bigdist'.
R provides "dist" class for storing distance objects. Under the hood, it is a numeric vector storing lower triangular matrix (diagonal excluded) in column order along with a few attributes. There are methods to subset ([[
), print and coerce them from and to matrices using as.dist
and as.matrix
respectively.
In general,
d[1:5, ]
would not work.d[1, 2] <- 3
disto was conceived to address these issues while keeping dist object as the back-end with the philosophy of minimal copies. This evolved into high-level API for dealing with generic distance objects irrespective of whether the object is in memory, disk or a database. Currently, the bindings are provided for in-memory objects of class 'dist' and on-disk distance matrices of class 'bigdist' from bigdist package.
library("disto") # create a dist object do <- dist(mtcars) # create a disto connection (does not nake a copy of do) dio <- disto(objectname = "do") # what's dio dio # what does it actually contain unclass(dio) # summary of the distance object underneath summary(dio) # what is the size? size(dio) # what are the names? names(dio) # quick plot plot(dio)
The idea is to provide an interface so that user does not worry about the storage and interacts with a matrix-like distance object without coercing as a matrix. Matrix coercion can be costly memory-wise when the dist object is large.
# what is the distance between 1st and 2nd element # note that this returns a matrix dio[1, 2] # this should be same as above, except the matrix is transposed dio[2, 1] # extract using names/labels dio["Mazda RX4 Wag", "Mazda RX4"] # for a single value extraction, `[[` is efficient as it does less work dio[[3, 4]] # dio[["Mazda RX4 Wag", "Mazda RX4"]] wont work, only integer index is supported in `[[` # neither would dio[[c(1, 2), 3]] # extract dio[1:5, 9:12] # extract mixed dio[1:5, c("Merc 240D", "Merc 230")] # exclude i or j dim(dio[1:2, ]) dim(dio[, 1:2]) dim(dio[,]) # All examples worked in outer product way # Specify product type as inner to extract diagonals only dio[1:5, 9:12, product = "inner"] # use lower triangular indexing dio[k = 1] # same as dio[1, 2] dio[k = 1:5]
# replace a value dio[1, 2] <- 100 # did it replace? dio[1, 2] # did it really replace at source do[1] # yes, it did # replacement is vectorized in inner product sense dio[1:5, 2:6] <- 7:11 dio[1:5, 2:6, product = "inner"]
The flow of as.matrix(do) %>% apply(1, somefunction)
is convenient. dapply
provides 'lapply' like functionality without coercion to a matrix. This slower than the above flow but consumes much less memory. dapply
is parallelized on UNIX-based systems.
# lets find indexes of five nearest neighbors for each observation/item # function to pick unsorted indexes of 5 nearest neighbors (excepting itself) udf_nn <- function(distances, index){ nnPlusOne <- which(data.table::frankv(distances, ties.method = "dense") <= 6) setdiff(nnPlusOne, index) } hi <- dapply(dio, 1, udf_nn) head(hi)
Two nearest methods are implemented: k nearest neighbors and fixed radius nearest neighbors for 'disto' handles and matrices.
# k nearest neighbors nn(dio, k = 1) knn <- nn(as.matrix(mtcars), k = 1) # distance method defaults to 'euclidean' knn str(knn) # observe the structure of the output # frnn nn(dio, r = 3) nn(as.matrix(mtcars), r = 3)
The workhorse functions for the dist class are dist_extract
and dist_replace
.
dist_extract(do, 1:5, 2:7) do <- dist_replace(do, 1:3, 4:6, 101:103) dist_extract(do, 1:3, 4:6, product = "inner")
Author: Srikanth KS, sri.teach@gmail.com
URL: https://github.com/talegari/disto
BugReports: https://github.com/talegari/disto/issues
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.