simdist-package: High performance computation of similarities and distances...
In vspinu/simdist: High performance distance and similarity functions

Description Details Supported representations Non-conventional approach

The package contains efficient parallel functions for computation of similarity and distance metrics on various sparse and dense representations. Canonical applications of these functions are natural language processing and recommender systems.

Simdist package uses a higher level abstraction for 2d sparse representation than the standard sparse matrices software. For every supported 2d representation primary and secondary dimension of variation of the measurement are defined. Every function in this package acts either on primary or secondary dimension. The primary reason for primary/secondary division is computational - computing along primary dimension is usually more efficient than along the secondary dimension. Even for dense matrices the "mental model" used in the package is that of nested lists - higher order grouping (i.e. document) is stored as entries along primary dimension and inner elements (i.e. terms) are stored across secondary dimension.

The supported 2d representations and primary-secondary dimensions are as follows:

matrix: primary - columns, secondary - rows
Matrix::dgCMatrix: primary - columns, secondary - rows
Matrix::dgRMatrix: primary - rows, secondary - columns
Matrix::dgTMatrix: primary - rows, secondary - columns
slam::simple_triplet_matrix: primary - rows, secondary - columns (not yet supported)
data frames in primary-secondary-value (psv) format: primary - first id column, secondary - second id column. Id and value columns could be explicitly marked with psv function.
list of named numeric or character vectors: primary - first list level, secondary - inner vector level (not yet implemented)

To minimize the risk of logical errors due to mismatched dimensions only distances across same-type objects are currently implemented.

Primary/secondary dimension distinction which allows treating all representations as two-level nested lists.
For named matrices, secondary dimensions are matched by names, not positionaly. This means that even for matrices the size of the secondary dimension need not match. All rows in X not in Y will be considered missing (aka 0s) as if it were a sparse matrix.
No normalization by default. All sim and dist functions accept normalization or scaling functions (transformers) which allow arbitrary transformation of the input matrices.
Cosine similarity of a vector X with 0 vector is 0, in contrast to proxy package where it's 1. This preserves coordinate-wise continuity in 0 and allows for a more efficient implementation.