docTermMatrix: Generate a document-term matrix
In koRpus: Text Analysis with Emphasis on POS Tagging, Readability, and Lexical Diversity

Description Usage Arguments Details Value References Examples

Returns a sparse document-term matrix calculated from a given TIF[1] compliant token data frame or object of class kRp.text. You can also calculate the term frequency inverted document frequency value (tf-idf) for each term.

docTermMatrix(obj, terms = "token", case.sens = FALSE, tfidf = FALSE, ...)

## S4 method for signature 'data.frame'
docTermMatrix(obj, terms = "token", case.sens = FALSE,
      tfidf = FALSE)

## S4 method for signature 'kRp.text'
docTermMatrix(obj, terms = "token", case.sens = FALSE, tfidf = FALSE)

`obj`	Either an object of class `kRp.text`, or a TIF[1] compliant token data frame.
`terms`	A character string defining the `tokens` column to be used for calculating the matrix.
`case.sens`	Logical, whether terms should be counted case sensitive.
`tfidf`	Logical, if `TRUE` calculates term frequency–inverse document frequency (tf-idf) values instead of absolute frequency.
`...`	Additional arguments depending on the particular method.

This is usually more interesting if done with more than one single text. If you're interested in full corpus analysis, the tm.plugin.koRpus package should be worth checking out. Alternatively, a data frame with multiple doc_id entries can be used.

See the examples to learn how to limit the analysis to desired word classes.

A sparse matrix of class dgCMatrix.

[1] Text Interchange Formats (https://github.com/ropensci/tif) [2] tm.plugin.koRpus: https://CRAN.R-project.org/package=tm.plugin.koRpus

# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
  sample_file <- file.path(
    path.package("koRpus"), "examples", "corpus", "Reality_Winner.txt"
  )
  # of course this makes more sense with a corpus of
  # multiple texts, see the tm.plugin.koRpus[2] package
  # for that
  tokenized.obj <- tokenize(
    txt=sample_file,
    lang="en"
  )
  # get the document-term frequencies in a sparse matrix
  myDTMatrix <- docTermMatrix(tokenized.obj)

  # combine with filterByClass() to, e.g.,  exclude all punctuation
  myDTMatrix <- docTermMatrix(filterByClass(tokenized.obj))

  # instead of absolute frequencies, get the tf-idf values
  myDTMatrix <- docTermMatrix(
    filterByClass(tokenized.obj),
    tfidf=TRUE
  )
} else {}

koRpus documentation built on May 18, 2021, 1:13 a.m.

koRpus index

Package overview README.md Using the koRpus Package for Text Analysis

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

koRpus
Text Analysis with Emphasis on POS Tagging, Readability, and Lexical Diversity

docTermMatrix: Generate a document-term matrix
In koRpus: Text Analysis with Emphasis on POS Tagging, Readability, and Lexical Diversity

Description

Usage

Arguments

Details

Value

References

Examples

Related to docTermMatrix in koRpus...

R Package Documentation

Browse R Packages

We want your feedback!

koRpus Text Analysis with Emphasis on POS Tagging, Readability, and Lexical Diversity

docTermMatrix: Generate a document-term matrix In koRpus: Text Analysis with Emphasis on POS Tagging, Readability, and Lexical Diversity

Description

Usage

Arguments

Details

Value

References

Examples

Related to docTermMatrix in koRpus...

R Package Documentation

Browse R Packages

We want your feedback!

koRpus
Text Analysis with Emphasis on POS Tagging, Readability, and Lexical Diversity

docTermMatrix: Generate a document-term matrix
In koRpus: Text Analysis with Emphasis on POS Tagging, Readability, and Lexical Diversity