aggregated_distances: Aggregation distances for sparse 3d representations

Description Usage Arguments Details Value References

Description

Distances for situation when every entry on secondary dimension is characterized by a numeric vector (embedding). In the example of term-document matrix where document is a primary dimension, each term has a numeric representation in a N-dimensional space. For user-movie rating, vectors for movies can represent various movie characteristics. The aggregation distances (adist for short) perform various aggregation steps of these vectors

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
adist_centroid(x, y = NULL, vecs, ptrans = NULL, strans = NULL,
  by = c("primary", "secondary", "row", "column"), pairwise = FALSE,
  precompute = !pairwise, dist_type = "cosine")

adist_semantic_min_sum(x, y = NULL, vecs, ptrans = NULL, strans = NULL,
  by = c("primary", "secondary", "row", "column"), pairwise = FALSE,
  precompute = !pairwise, dist_type = "cosine")

adist_semantic_min_max(x, y = NULL, vecs, ptrans = NULL, strans = NULL,
  by = c("primary", "secondary", "row", "column"), pairwise = FALSE,
  precompute = !pairwise, dist_type = "cosine")

adist_rwmd(x, y = NULL, vecs, ptrans = NULL, strans = NULL,
  by = c("primary", "secondary", "row", "column"), pairwise = FALSE,
  precompute = !pairwise, dist_type = "cosine")

Arguments

x

sparse or dense objects supported by simdist. See simdist-package for description of supported types and related terminology.

y

sparse or dense objects supported by simdist. See simdist-package for description of supported types and related terminology.

vecs

Dense matrix with columns

ptrans

Primary and secondary transformations. Can be either a function, string or a numeric vector. When a function, it must take 3 arguments - an object supported by simdist distance measures, names of a dimension ("primary" or "secondary") and logical na.rm. When a string it must name a transformation function without the semantic prefix (norm_, scale_ or trans_); see link{transformations}. When numeric vector it specifies weights to scale along the corresponding dimension.

strans

Primary and secondary transformations. Can be either a function, string or a numeric vector. When a function, it must take 3 arguments - an object supported by simdist distance measures, names of a dimension ("primary" or "secondary") and logical na.rm. When a string it must name a transformation function without the semantic prefix (norm_, scale_ or trans_); see link{transformations}. When numeric vector it specifies weights to scale along the corresponding dimension.

by

Dimension along which to perform distance computation. For all supported data structures computation along primary dimension is more or as efficient than along the secondary dimension.

precompute

logical Weather to optimize the computation for speed and precompute individual distances. The computation is method specific bug generally should be TRUE (the default) unless memory usage is a concern.

dist_type

distance to use across individual vectors in vecs

Details

centroid

Within each primary entry (document, user etc.) the vectors of secondary entries (terms, movies etc) are averaged element-wise and dist_type is applied on the resulting vectors.

semantic_min_sum

Measure of semantic distance proposed in [1]. In a nutshell, For computing semantic distance between documents A (column in x) and B (column in y), first for each term a in A the minimal distance to terms in B is computed with dist_type distance. Then, this values are summed with weights co-responding weights (x matrix). Same procedure applies to terms from B, the resulting two values are summed:

DIST(A, B)=∑_a x_{A,a}\min_b D(a,b) + ∑_b x_{B,b}\min_a D(b,a)

Note that in [1] the authors weight each term by normalized IDF weight. The formulation in this package is more general. You can achieve their formula by applying "idf" strans and "l1" ptrans transformations. See examples.

semantic_min_max

Measure of semantic similarity proposed in [2]. The authors used the name "Relaxed Word Mover Distance" to emphasize that the measure is a lower bound of the well known "Earth Mover Distance" transportation problem. The metric is a variation of semantic_min_sum where the max is used in the last step instead of sum

adist_rwmd

Relaxed Word Mover Distance - same as adist_semantic_min_max.

Value

A matrix of the distances. If y=NULL, the value is a cross distance of x.

References

[1] Mihalcea, Rada, Courtney Corley, and Carlo Strapparava. <e2><80><98>Corpus-Based and Knowledge-Based Measures of Text Semantic Similarity<e2><80><99>. In AAAI, 6:775<e2><80><93>80, 2006.

[2] Ye, Xin, Hui Shen, Xiao Ma, Razvan Bunescu, and Chang Liu. <e2><80><98>From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering<e2><80><99>. In Proceedings of the 38th International Conference on Software Engineering, 404<e2><80><93>415. ICSE <e2><80><99>16. New York, NY, USA: ACM, 2016. doi:10.1145/2884781.2884862.


vspinu/simdist documentation built on May 3, 2019, 7:09 p.m.