lma_simets: Similarity Calculations

View source: R/lma_simets.R

lma_simetsR Documentation

Similarity Calculations

Description

Enter a numerical matrix, set of vectors, or set of matrices to calculate similarity per vector.

Usage

lma_simets(a, b = NULL, metric = NULL, group = NULL, lag = 0,
  agg = TRUE, agg.mean = TRUE, pairwise = TRUE, symmetrical = FALSE,
  mean = FALSE, return.list = FALSE)

Arguments

a

A vector or matrix. If a vector, b must also be provided. If a matrix and b is missing, each row will be compared. If a matrix and b is not missing, each row will be compared with b or each row of b.

b

A vector or matrix to be compared with a or rows of a.

metric

A character or vector of characters at least partially matching one of the available metric names (or 'all' to explicitly include all metrics), or a number or vector of numbers indicating the metric by index:

  • jaccard: sum(a & b) / sum(a | b)

  • euclidean: 1 / (1 + sqrt(sum((a - b) ^ 2)))

  • canberra: mean(1 - abs(a - b) / (a + b))

  • cosine: sum(a * b) / sqrt(sum(a ^ 2 * sum(b ^ 2)))

  • pearson: (mean(a * b) - (mean(a) * mean(b))) /
    sqrt(mean(a ^ 2) - mean(a) ^ 2) / sqrt(mean(b ^ 2) - mean(b) ^ 2)

group

If b is missing and a has multiple rows, this will be used to make comparisons between rows of a, as modified by agg and agg.mean.

lag

Amount to adjust the b index; either rows if b has multiple rows (e.g., for lag = 1, a[1, ] is compared with b[2, ]), or values otherwise (e.g., for lag = 1, a[1] is compared with b[2]). If b is not supplied, b is a copy of a, resulting in lagged self-comparisons or autocorrelations.

agg

Logical: if FALSE, only the boundary rows between groups will be compared, see example.

agg.mean

Logical: if FALSE aggregated rows are summed instead of averaged.

pairwise

Logical: if FALSE and a and b are matrices with the same number of rows, only paired rows are compared. Otherwise (and if only a is supplied), all pairwise comparisons are made.

symmetrical

Logical: if TRUE and pairwise comparisons between a rows were made, the results in the lower triangle are copied to the upper triangle.

mean

Logical: if TRUE, a single mean for each metric is returned per row of a.

return.list

Logical: if TRUE, a list-like object will always be returned, with an entry for each metric, even when only one metric is requested.

Details

Use setThreadOptions to change parallelization options; e.g., run RcppParallel::setThreadOptions(4) before a call to lma_simets to set the number of CPU threads to 4.

Value

Output varies based on the dimensions of a and b:

  • Out: A vector with a value per metric.
    In: Only when a and b are both vectors.

  • Out: A vector with a value per row.
    In: Any time a single value is expected per row: a or b is a vector, a and b are matrices with the same number of rows and pairwise = FALSE, a group is specified, or mean = TRUE, and only one metric is requested.

  • Out: A data.frame with a column per metric.
    In: When multiple metrics are requested in the previous case.

  • Out: A sparse matrix with a metric attribute with the metric name.
    In: Pairwise comparisons within an a matrix or between an a and b matrix, when only 1 metric is requested.

  • Out: A list with a sparse matrix per metric.
    In: When multiple metrics are requested in the previous case.

Examples

text <- c(
  "words of speaker A", "more words from speaker A",
  "words from speaker B", "more words from speaker B"
)
(dtm <- lma_dtm(text))

# compare each entry
lma_simets(dtm)

# compare each entry with the mean of all entries
lma_simets(dtm, colMeans(dtm))

# compare by group (corresponding to speakers and turns in this case)
speaker <- c("A", "A", "B", "B")

## by default, consecutive rows from the same group are averaged:
lma_simets(dtm, group = speaker)

## with agg = FALSE, only the rows at the boundary between
## groups (rows 2 and 3 in this case) are used:
lma_simets(dtm, group = speaker, agg = FALSE)

lingmatch documentation built on Aug. 29, 2023, 1:09 a.m.