simdist-package: High performance computation of similarities and distances...

Description Details Supported representations Non-conventional approach

Description

The package contains efficient parallel functions for computation of similarity and distance metrics on various sparse and dense representations. Canonical applications of these functions are natural language processing and recommender systems.

Details

Simdist package uses a higher level abstraction for 2d sparse representation than the standard sparse matrices software. For every supported 2d representation primary and secondary dimension of variation of the measurement are defined. Every function in this package acts either on primary or secondary dimension. The primary reason for primary/secondary division is computational - computing along primary dimension is usually more efficient than along the secondary dimension. Even for dense matrices the "mental model" used in the package is that of nested lists - higher order grouping (i.e. document) is stored as entries along primary dimension and inner elements (i.e. terms) are stored across secondary dimension.

Supported representations

The supported 2d representations and primary-secondary dimensions are as follows:

matrix

primary - columns, secondary - rows

Matrix::dgCMatrix

primary - columns, secondary - rows

Matrix::dgRMatrix

primary - rows, secondary - columns

Matrix::dgTMatrix

primary - rows, secondary - columns

slam::simple_triplet_matrix

primary - rows, secondary - columns (not yet supported)

data frames in primary-secondary-value (psv) format

primary - first id column, secondary - second id column. Id and value columns could be explicitly marked with psv function.

list of named numeric or character vectors

primary - first list level, secondary - inner vector level (not yet implemented)

To minimize the risk of logical errors due to mismatched dimensions only distances across same-type objects are currently implemented.

Non-conventional approach


vspinu/simdist documentation built on May 3, 2019, 7:09 p.m.