Compare the documents in a dtm with a sliding window over time

Share:

Description

Given a document-term matrix (DTM) and corresponding document meta data, calculates the document similarities over time using with a sliding window.

The meta data.frame should have a column containing document id's that match the rownames of the DTM (i.e. document names) and should have a column indicating the publication time. By default these columns should be labeled "document_id" and "date", but the column labels can also be set using the 'id.var' and 'date.var' parameters. Any other columns will automatically be included as document meta information in the output.

Usage

1
2
3
4
newsflow.compare(dtm, meta, id.var = "document_id", date.var = "date",
  hour.window = c(-24, 24), measure = "cosine", min.similarity = 0,
  n.topsim = NULL, only.from = NULL, only.to = NULL,
  return.zeros = FALSE, only.complete.window = TRUE)

Arguments

dtm

A document-term matrix in the tm DocumentTermMatrix class. It is recommended to weight the DTM beforehand, for instance using weightTfIdf.

meta

A data.frame where rows are documents and columns are document meta information. Should at least contain 2 columns: the document name/id and date. The name/id column should match the document names/ids of the edgelist, and its label is specified in the 'id.var' argument. The date column should be intepretable with as.POSIXct, and its label is specified in the 'date.var' argument.

id.var

The label for the document name/id column in the 'meta' data.frame. Default is "document_id"

date.var

The label for the document date column in the 'meta' data.frame . default is "date"

hour.window

A vector of length 2, in which the first and second value determine the left and right side of the window, respectively. For example, c(-10, 36) will compare each document to all documents between the previous 10 and the next 36 hours.

measure

the measure that should be used to calculate similarity/distance/adjacency. Currently supports the symmetrical measure "cosine" (cosine similarity), and the assymetrical measures "overlap_pct" (percentage of term scores in the document that also occur in the other document).

min.similarity

a threshold for similarity. lower values are deleted. Set to 0.1 by default.

n.topsim

An alternative or additional sort of threshold for similarity. Only keep the [n.topsim] highest similarity scores for x. Can return more than [n.topsim] similarity scores in the case of duplicate similarities.

only.from

A vector with names/ids of documents (dtm rownames), or a logical vector that matches the rows of the dtm. Use to compare only these documents to other documents.

only.to

A vector with names/ids of documents (dtm rownames), or a logical vector that matches the rows of the dtm. Use to compare other documents to only these documents.

return.zeros

If true, all comparison results are returned, including those with zero similarity (rarely usefull and problematic with large data)

only.complete.window

if True, only compare articles (x) of which a full window of reference articles (y) is available. Thus, for the first and last [window.size] days, there will be no results for x.

Details

The calculation of document similarity is performed using a vector space model approach. Inner-product based similarity measures are used, such as cosine similarity. It is recommended to weight the DTM beforehand, for instance using Term frequency-inverse document frequency (tf.idf)

Value

A network/graph in the igraph class

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
data(dtm)
data(meta)

dtm = tm::weightTfIdf(dtm)
g = newsflow.compare(dtm, meta, hour.window = c(0.1, 36))

vcount(g) # number of documents, or vertices
ecount(g) # number of document pairs, or edges

head(igraph::get.data.frame(g, 'vertices'))
head(igraph::get.data.frame(g, 'edges'))