DTM: Document Term Matricizer

Description Usage Arguments Value

View source: R/DTM.R

Description

Turns text into data.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
DTM(
  texts,
  sparse = 0.99,
  wstem = "all",
  ngrams = 1,
  language = "english",
  vocabmatch = NULL,
  stop.words = TRUE,
  punct = FALSE,
  POS = FALSE,
  dependency = FALSE,
  tag.sub = 0,
  overlap = 0.8,
  group.conc = NULL,
  group.conc.cutoff = 0.8,
  TPformat = FALSE,
  verbose = FALSE,
  mc.cores = 1
)

Arguments

texts

a character vector of texts.

sparse

maximum feature sparsity for inclusion (1 = include all features)

wstem

character what words should be stemmed?

ngrams

numeric vector of ngram sizes (max = 1:3)

language

character what language are you parsing?

vocabmatch

matrix used to create a new matrix with features that are identical to a previous one

stop.words

logical should stop words be included? default is TRUE

punct

logical should exclamation points and question marks be included as features?

POS

logical should features have part of speech tags appended? default is FALSE

dependency

logical should features have dependency relations appended? default is FALSE

tag.sub

numeric what fraction of features should be replaced by POS tags? default is 0 (no features), fractions not supported yet.

overlap

numeric How dissimilar (in cossine distance) must an ngram be from all (n-1)grams to be added to feature set?

group.conc

character group IDs for removing group-specific words

group.conc.cutoff

numeric threshold for group-specificity of words, as proportion of occurences in the main group.

TPformat

logical - return in stm::textProcessor() format?

verbose

logical - report interim steps during processing

Value

Feature counts, as a matrix (or in stm format)


myeomans/DTMtools documentation built on March 2, 2020, 8:57 p.m.