documents.window.compare: Compare the documents in a dtm per time frame

Description Usage Arguments Value

Description

Compare all documents within a document term matrix that are dated (e.g., pubished) within a given number of days (window.size) from each other.

Usage

1
2
3
4
5
documents.window.compare(dtm, document.date, window.size = 3,
  time.unit = "days", window.direction = "<=>", measure = "cosine",
  min.similarity = NULL, n.topsim = NULL, only.from = NULL,
  return.date = F, return.datedif = T, return.zeros = F,
  only.complete.window = F)

Arguments

dtm

a document-term matrix in the tm format

document.date

a vector of date class, of the same length and order as the documents (rows) of the dtm.

window.size

the timeframe in days within which articles must occur in order to be compared. e.g., if 0, articles are only compared to articles of the same day. If 1, articles are compared to all articles of the previous, same or next day.

time.unit

a string indicating what time unit to use. Can be 'mins','hours','days','months' or 'years'.

window.direction

For a more specific selection of which articles in the window to compare to. This is given with a combination of the symbols '<' (before x) '=' (simultanous with x) and '>' (after x). default is '<=>', which means all articles. '<>' means all articles before or after the [time.unit] of an article itself. '<' means all previous articles, and '<=' means all previous and simultaneous articles. etc.

measure

the measure that should be used to calculate similarity/distance/adjacency. Currently only cosine is supported

min.similarity

a threshold for similarity. lower values are deleted

n.topsim

An alternative or additional sort of threshold for similarity. Only keep the [n.topsim] highest similarities for x.

only.from

A vector of ids that match the documents (rownames) in dtm. Use to compare only these documents to other documents.

return.date

If true, the dates for x and y are given in the output

return.zeros

If true, all comparison results are returned, including those with zero similarity (quite possibly the worst thing to do with large data)

only.complete.window

if True, only compare articles (x) of which a full window of reference articles (y) is available. Thus, for the first and last [window.size] days, there will be no results for x.

get.overlap.terms

Add the overlapping terms of documents to the output.

Value

A data frame with columns x, y and similarity. If return.date == T, date.x and date.y are returned as well.


kasperwelbers/corpus-tools documentation built on May 20, 2019, 7:37 a.m.