knitr::opts_chunk$set(
  collapse = TRUE
  , comment = "#>"
)

Introduction

disto is a R package that provides a high level API to interface over backends storing distance, dissimilarity, similarity matrices with matrix style extraction, replacement and other utilities. Disto supports in-memory distance matrices of 'dist' class and on-disk distance matrices of class 'bigdist'.

Why disto?

R provides "dist" class for storing distance objects. Under the hood, it is a numeric vector storing lower triangular matrix (diagonal excluded) in column order along with a few attributes. There are methods to subset ([[), print and coerce them from and to matrices using as.dist and as.matrix respectively.

In general,

disto was conceived to address these issues while keeping dist object as the back-end with the philosophy of minimal copies. This evolved into high-level API for dealing with generic distance objects irrespective of whether the object is in memory, disk or a database. Currently, the bindings are provided for in-memory objects of class 'dist' and on-disk distance matrices of class 'bigdist' from bigdist package.

Examples

Creating disto and exploration

library("disto")

# create a dist object
do <- dist(mtcars)

# create a disto connection (does not nake a copy of do)
dio <- disto(objectname = "do")

# what's dio
dio

# what does it actually contain
unclass(dio)

# summary of the distance object underneath
summary(dio)

# what is the size?
size(dio)

# what are the names?
names(dio)

# quick plot
plot(dio)

Extract and Replace

Extract

The idea is to provide an interface so that user does not worry about the storage and interacts with a matrix-like distance object without coercing as a matrix. Matrix coercion can be costly memory-wise when the dist object is large.

# what is the distance between 1st and 2nd element
# note that this returns a matrix
dio[1, 2]

# this should be same as above, except the matrix is transposed
dio[2, 1]

# extract using names/labels
dio["Mazda RX4 Wag", "Mazda RX4"]

# for a single value extraction, `[[` is efficient as it does less work
dio[[3, 4]] 
# dio[["Mazda RX4 Wag", "Mazda RX4"]] wont work, only integer index is supported in `[[`
# neither would dio[[c(1, 2), 3]]

# extract
dio[1:5, 9:12]

# extract mixed
dio[1:5, c("Merc 240D", "Merc 230")]

# exclude i or j
dim(dio[1:2, ])
dim(dio[, 1:2])
dim(dio[,])

# All examples worked in outer product way
# Specify product type as inner to extract diagonals only
dio[1:5, 9:12, product = "inner"]

# use lower triangular indexing
dio[k = 1] # same as dio[1, 2]
dio[k = 1:5]

Replace

# replace a value
dio[1, 2] <- 100

# did it replace?
dio[1, 2]

# did it really replace at source
do[1] # yes, it did

# replacement is vectorized in inner product sense
dio[1:5, 2:6] <- 7:11
dio[1:5, 2:6, product = "inner"]

'apply' like function

The flow of as.matrix(do) %>% apply(1, somefunction) is convenient. dapply provides 'lapply' like functionality without coercion to a matrix. This slower than the above flow but consumes much less memory. dapply is parallelized on UNIX-based systems.

# lets find indexes of five nearest neighbors for each observation/item

# function to pick unsorted indexes of 5 nearest neighbors (excepting itself)
udf_nn <- function(distances, index){
  nnPlusOne <- which(data.table::frankv(distances, ties.method = "dense") <= 6)
  setdiff(nnPlusOne, index)
}
hi <- dapply(dio, 1, udf_nn)
head(hi)

Exact Nearest neighbors

Two nearest methods are implemented: k nearest neighbors and fixed radius nearest neighbors for 'disto' handles and matrices.

# k nearest neighbors
nn(dio, k = 1)
knn <- nn(as.matrix(mtcars), k = 1) # distance method defaults to 'euclidean'
knn
str(knn) # observe the structure of the output

# frnn
nn(dio, r = 3)
nn(as.matrix(mtcars), r = 3)

Extract and replace functions for 'dist' objects

The workhorse functions for the dist class are dist_extract and dist_replace.

dist_extract(do, 1:5, 2:7)
do <- dist_replace(do, 1:3, 4:6, 101:103)
dist_extract(do, 1:3, 4:6, product = "inner")

Author: Srikanth KS, sri.teach@gmail.com

URL: https://github.com/talegari/disto

BugReports: https://github.com/talegari/disto/issues



talegari/disto documentation built on May 10, 2019, 3:19 a.m.