newsflow_compare: Create a network of document similarities over time

View source: R/newsflow.r

newsflow_compareR Documentation

Create a network of document similarities over time

Description

This is a wrapper for the compare_documents function, specialised for the case of analyzing documents over time. The difference is that using date_var is mandatory, and the output is returned as an igraph network (using as_document_network).

Usage

newsflow_compare(
  dtm,
  dtm_y = NULL,
  date_var = "date",
  hour_window = c(-24, 24),
  group_var = NULL,
  measure = c("cosine", "overlap_pct", "overlap", "dot_product", "softcosine"),
  tf_idf = F,
  min_similarity = 0,
  n_topsim = NULL,
  only_complete_window = T,
  ...
)

Arguments

dtm

A quanteda dfm. Note that it is common to first weight the dtm(s) before calculating document similarity, For this you can use quanteda's dfm_tfidf and dfm_weight

dtm_y

Optionally, another dtm. If given, the documents in dtm will be compared to the documents in dtm_y.

date_var

The name of the column in meta that specifies the document date. default is "date". The values should be of type POSIXct, or coercable with as.POSIXct. If given, the hour_window argument is used to only compare documents within a time window.

hour_window

A vector of length 2, in which the first and second value determine the left and right side of the window, respectively. For example, c(-10, 36) will compare each document to all documents between the previous 10 and the next 36 hours. It is possible to specify time windows down to the level of seconds by using fractions (hours / 60 / 60).

group_var

Optionally, The name of the column in meta that specifies a group (e.g., source, sourcetype). If given, only documents within the same group will be compared.

measure

The measure that should be used to calculate similarity/distance/adjacency. Currently supports the symmetrical measure "cosine" (cosine similarity), the assymetrical measures "overlap_pct" (percentage of term scores in the document that also occur in the other document), "overlap" (like overlap_pct, but as the sum of overlap instead of the percentage) and the symmetrical soft cosine measure (experimental). The regular dot product (dot_product) is also supported.

tf_idf

If TRUE, weigh the dtm (and dtm_y) by term frequency - inverse document frequency. For more control over weighting, we recommend using quanteda's dfm_tfidf or dfm_weight on dtm and dtm_y.

min_similarity

A threshold for similarity. lower values are deleted. For all available similarity measures zero means no similarity.

n_topsim

An alternative or additional sort of threshold for similarity. Only keep the [n_topsim] highest similarity scores for x. Can return more than [n_topsim] similarity scores in the case of duplicate similarities.

only_complete_window

If True, only compare articles (x) of which a full window of reference articles (y) is available. Thus, for the first and last [window.size] days, there will be no results for x.

...

Other arguments passed to compare_documents.

Value

An igraph network.

Examples

dtm = quanteda::dfm_tfidf(rnewsflow_dfm)
el = newsflow_compare(dtm, date_var='date', hour_window = c(0.1, 36))

RNewsflow documentation built on May 31, 2023, 6:53 p.m.