text_to_DTM: Convert a vector of text documents into a Document Term...
In bakaburg1/BaySREn: BaySREn. An R package to automatise citation collection and screening in Systematic Reviews. Based on Bayesian active machine learning

text_to_DTM

R Documentation

Convert a vector of text documents into a Document Term Matrix

Description

A Document Term Matrix (DTM) is a structure describing the association of a term to a document. In this case, we used a binary matrix with ones if a term is present in a document and one otherwise.

Usage

text_to_DTM(
  corpus,
  min.freq = 20,
  ids = 1:length(corpus),
  freq.subset.ids = ids,
  included.pos = c("Noun", "Verb", "Adjective"),
  tokenize.fun = tokenize_text,
  add.ngrams = TRUE,
  aggr.synonyms = TRUE,
  n.gram.thresh = 0.5,
  syn.thresh = 0.9,
  label = "TERM__",
  na.as.missing = TRUE
)

Arguments

`corpus`	A vector of text documents.
`min.freq`	Minimum number of document in which a term need to be present to be considered.
`ids`	Identification ID of documents.
`freq.subset.ids`	IDs to consider when computing term frequency.
`included.pos`	Part of speech (POS) to consider when building the DTM. See `lexicon::hash_grady_pos()` for a list of recognized POS.
`tokenize.fun`	Function to use to clean up text.
`add.ngrams`	Whether to search and add non-consecutive n-grams. See `DTM.add_ngrams()`.
`aggr.synonyms`	Whether to aggregate terms which almost always appear together. See `DTM.aggr_synonyms()`.
`n.gram.thresh`	The threshold to use to identify the network of non-consecutive n-grams if `add.ngrams` is `TRUE`.
`syn.thresh`	The threshold to use to identify the network of terms to aggregate if `aggr.synonyms` is `TRUE`.
`label`	A label to prepend to term columns in the DTM.
`na.as.missing`	Whether to set as `NA` the DTM cells for empty document. If `FALSE` those cells will be set to zero.

Details

Before computing the DTM, document terms are cleaned, tokenized and lemmatized, and stop-words are removed.

To reduce noise, only terms that appear in a fraction of documents higher than min.freq are considered. The function also uses cosine similarity to identify relevant subclusters of related terms or redundant ones.

Value

A Document Term Matrix with a row for each document and a column for the terms plus a column with the document IDs.

Examples

## Not run: 

Records <- import_data(get_session_files("Session1")$Records)

Title_DTM <- with(
  Records,
  text_to_DTM(Title,
    min.freq = 20, label = "TITLE__", ids = ID,
    freq.subset.ids = ID[Target %in% c("y", "n")]
  )
)

## End(Not run)

bakaburg1/BaySREn documentation built on March 30, 2022, 12:16 a.m.

bakaburg1/BaySREn index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

bakaburg1/BaySREn
BaySREn. An R package to automatise citation collection and screening in Systematic Reviews. Based on Bayesian active machine learning

text_to_DTM: Convert a vector of text documents into a Document Term...
In bakaburg1/BaySREn: BaySREn. An R package to automatise citation collection and screening in Systematic Reviews. Based on Bayesian active machine learning

Convert a vector of text documents into a Document Term Matrix

Description

Usage

Arguments

Details

Value

Examples

Related to text_to_DTM in bakaburg1/BaySREn...

R Package Documentation

Browse R Packages

We want your feedback!

bakaburg1/BaySREn BaySREn. An R package to automatise citation collection and screening in Systematic Reviews. Based on Bayesian active machine learning

text_to_DTM: Convert a vector of text documents into a Document Term... In bakaburg1/BaySREn: BaySREn. An R package to automatise citation collection and screening in Systematic Reviews. Based on Bayesian active machine learning

Convert a vector of text documents into a Document Term Matrix

Description

Usage

Arguments

Details

Value

Examples

Related to text_to_DTM in bakaburg1/BaySREn...

R Package Documentation

Browse R Packages

We want your feedback!

bakaburg1/BaySREn
BaySREn. An R package to automatise citation collection and screening in Systematic Reviews. Based on Bayesian active machine learning

text_to_DTM: Convert a vector of text documents into a Document Term...
In bakaburg1/BaySREn: BaySREn. An R package to automatise citation collection and screening in Systematic Reviews. Based on Bayesian active machine learning