documents.compare: Compare the documents in two corpora/dtms

Description Usage Arguments Value

Description

Compare the documents in corpus dtm.x with reference corpus dtm.y.

Usage

1
2
3
documents.compare(dtm.x, dtm.y = NULL, measure = "cosine",
  min.similarity = 0.1, n.topsim = NULL, only.from = NULL,
  return.zeros = F)

Arguments

dtm.x

the main document-term matrix

dtm.y

the 'reference' document-term matrix. If NULL, documents of dtm.x are compared to each ohter

measure

the measure that should be used to calculate similarity/distance/adjacency. Currently only cosine is supported

min.similarity

a threshold for similarity. lower values are deleted. Set to 0.1 by default.

n.topsim

An alternative or additional sort of threshold for similarity. Only keep the [n.topsim] highest similarity scores for x. Can return more than [n.topsim] similarity scores in the case of duplicate similarities.

only.from

A vector of ids that match the documents (rownames) in dtm. Use to compare only these documents to other documents.

return.zeros

If true, all comparison results are returned, including those with zero similarity (quite possibly the worst thing to do with large data)

Value

A data frame with sets of documents and their similarities.


kasperwelbers/corpus-tools documentation built on May 20, 2019, 7:37 a.m.