tcrossprod_sparse: tcrossprod with benefits, for people that like parameters

View source: R/crossprod.r

tcrossprod_sparseR Documentation

tcrossprod with benefits, for people that like parameters

Description

This function (including the underlying cpp function batched_tcrossprod_cpp) is the workhorse of the RNewsflow package. It has unnervingly many arguments for a tcrossprod because it needs to be able to do many thing efficiently. While its mostly a backend function, we expose it because it has applications outside of RNewsflow, but we make no excuses for the fact that readability is very much sacrificed here for the convenience of being able to keep adding features that we need for RNewsflow.

Usage

tcrossprod_sparse(
  m,
  m2 = NULL,
  min_value = NULL,
  max_value = NULL,
  only_upper = F,
  diag = T,
  top_n = NULL,
  rowsum_div = F,
  max_p = 1,
  pvalue = c("disparity", "normal", "lognormal", "nz_normal", "nz_lognormal"),
  normalize = c("none", "l2", "softl2"),
  crossfun = c("prod", "min", "softprod", "maxproduct", "lookup", "cp_lookup",
    "cp_lookup_norm"),
  group = NULL,
  group2 = NULL,
  date = NULL,
  date2 = NULL,
  lwindow = -1,
  rwindow = 1,
  date_unit = c("days", "hours", "minutes", "seconds"),
  simmat = NULL,
  simmat_thres = NULL,
  row_attr = F,
  col_attr = F,
  lag_attr = F,
  batchsize = 1000,
  verbose = F
)

Arguments

m

A CsparseMatrix

m2

A CsparseMatrix

min_value

Optionally, a numerical value, specifying the threshold for including a score in the output.

max_value

Optionally, a numerical value for the upper limit for including a score in the output.

only_upper

If true, only the upper triangle of the matrix is returned. Only possible for symmetrical output (m and m2 have same number of columns)

diag

If false, the diagonal of the matrix is not returned. Only possible for symmetrical output (m and m2 have same number of columns)

top_n

An integer, specifying the top number of strongest similarities per row. So, for each row in m at most top_n scores are returned..

rowsum_div

If true, divide crossproduct by column sums of m. (this has to happen within the loop for min_value and top_n filtering).

max_p

A threshold for maximium p value.

pvalue

If max_p < 1, edges are removed based on a p value. For each document in dtm, a p value is calculated over its outward edges. Default is the p-value based on uniform distribution, akin to a "disparity" filter (see Serrano et al., DOI: 10.1073/pnas.0808904106) but without filtering on inward edges.

normalize

Normalize rows by a given norm score (before calculating similarity). Default is 'none' (no normalization). 'l2' is the l2 norm (use in combination with 'prod' crossfun for cosine similarity). 'l2soft' is the adaptation of l2 for soft similarity (use in combination with 'softprod' crossfun for soft cosine).

crossfun

The function used in the vector operations. Normally this is the "prod", for product (dot product). Here we also allow the "min", for minimum value. We use this in our document overlap_pct score. In addition, there is the (experimental) softprod, that can be used in combination with softl2 normalization to get the soft cosine similarity. The "maxproduct" is a special case used in the query_lookup measure, that uses product but only returns the score of the strongest matching term. The "cp_lookup" and "cp_lookup_norm" are special cases for conditional probability sensitive lookup.

group

Optionally, a character vector that specifies a group (e.g., source) for each row in m. If given, only pairs of rows with the same group are calculated.

group2

If m2 and group are used, group2 has to be used to specify the groups for the rows in m2 (otherwise group will be ignored)

date

Optionally, a POSIXct vector (or a vector that can be converted to as.POSIXct) that specifies a date for each row in m. If given, only pairs of rows within a given date range (see lwindow, rwindow and date_unit) are calculated.

date2

If m2 and date are used, date2 has to be used to specify the date for the rows in m2 (otherwise date will be ignored)

lwindow

If date (and date2) are used, lwindow determines the left side of the date window. e.g. -10 means that rows are only matched with rows for which date is within 10 [date_units] before.

rwindow

Like lwindow, but for the right side. e.g. an lwindow of -1 and rwindow of 1, with date_unit is "days", means that only rows are matched for which the dates are within a 1 day distance

date_unit

The date unit used in lwindow and rwindow. Supports "days", "hours", "minutes" and "seconds". Note that refers to the time distance between two rows ("days" doesn't refer to calendar days, but to a time of 24 hours)

simmat

If softcos is used, a symmetric matrix with terms that indicates the similarity of terms (i.e. adjacency matrix). If NULL, a cosine similarity matrix will be created on the go

simmat_thres

If softcos is used, a threshold for the term similarity.

row_attr

If TRUE, add the "row_n" and "row_sum" elements to the "margin" attribute.

col_attr

Like row_attr, but adding "col_n" and "col_sum" to the "margin" attribute.

lag_attr

If TRUE, adds "lag_n" and "lag_sum" to the "margin" attribute. These are the margin scores for rows, where the date of the column is before (lag) the date of the row. Only possible if date argument is given.

batchsize

If group and/or date are used, size of batches.

verbose

if TRUE, report progress

Details

Enables limiting row combinations to within specified groups and date windows, and filters results that do not pass the threshold on the fly. To achieve this, options for similarity measures are included in the function. For example, to get the cosine similarity, you can normalize with "l2" and use the "prod" (product) function for the

This function is called by the document comparison functions (newsflow_compare, delete_duplicates). We only expose it here for additional flexibility, and because it could be usefull outside of the purpose of this package.

The output matrix also has an attribute "margin", which contains margin scores (e.g., row_sum) if the row_attr or col_attr arguments are used. The reason for including this is that some values that are normally available in the output of a cross product are broken if certain filter options are used. If group or date is used, we don't know how many columns a rows has been compared to (normally this is all columns). If a min/max or top_n filter is used, we don't know the true row sums (and thus row means).

Value

A CsparseMatrix

Examples

set.seed(1)
m = Matrix::rsparsematrix(5,10,0.5)
tcrossprod_sparse(m, min_value = 0, only_upper = FALSE, diag = TRUE)
tcrossprod_sparse(m, min_value = 0, only_upper = FALSE, diag = FALSE)
tcrossprod_sparse(m, min_value = 0, only_upper = TRUE, diag = FALSE)
tcrossprod_sparse(m, min_value = 0.2, only_upper = TRUE, diag = FALSE)
tcrossprod_sparse(m, min_value = 0, only_upper = TRUE, diag = FALSE, top_n = 1)

kasperwelbers/RNewsflow documentation built on April 8, 2024, 4:39 p.m.