qe_texts: Quanteda analysis of text vector
In abresler/govtrackR: Federal government tracking tools

Quanteda analysis of text vector

qe_texts(
  texts = NULL,
  method = "word",
  remove_words = c("\\.", "\\-", "\\#", "\\'", "\\,", "\\;", "\\_",
    "\\DESCRIPTION:", "\\:", "\\SBIR", "\\I ", "\\II ", "\\III ", "PHASE"),
  dfm_dictionary = NULL,
  n_top_features = 10,
  stem = F,
  exclude_features = F,
  remove_numbers = T,
  remove_punct = T,
  remove_symbols = T,
  remove_separators = TRUE,
  remove_twitter = T,
  remove_hyphens = T,
  collocation_size = 3,
  include_textstat = T,
  remove_url = FALSE,
  stop_sources = c("smart", "snowball", "stopwords-iso"),
  n_gram_tokens = 2,
  include_dfm = F,
  verbose = T
)

`texts`	vector of text
`method`	what the unit for splitting the text, available alternatives are: `"word"` (recommended default) smartest, but slowest, word tokenization method; see stringi-search-boundaries for details. `"fasterword"` dumber, but faster, word tokenization metho uses `stri_split_charclass(x, "[\\p{Z}\\p{C}]+")` `"fastestword"` dumbest, but fastest, word tokenization method, calls `stri_split_fixed(x, " ")` `"character"` tokenization into individual characters `"sentence"` sentence segmenter, smart enough to handle some exceptions in English such as "Prof. Plum killed Mrs. Peacock." (but far from perfect).
`dfm_dictionary`	if not `NULL` dictionary of word meanings
`n_top_features`	if not `NULL` number of top features for feature count
`stem`	if `TRUE` stem words
`exclude_features`	if `TRUE` remove nested feature list
`remove_numbers`	logical; if `TRUE` remove tokens that consist only of numbers, but not words that start with digits, e.g. `2day`
`remove_separators`	logical; if `TRUE` remove separators and separator characters (Unicode "Separator" [Z] and "Control [C]" categories). Only applicable for `what = "character"` (when you probably want it to be `FALSE`) and for `what = "word"` (when you probably want it to be `TRUE`).
`remove_twitter`	logical; if `TRUE` remove Twitter characters `@` and `#`; set to `TRUE` if you wish to eliminate these. Note that this will always be set to `FALSE` if `remove_punct = FALSE`.
`remove_hyphens`	logical; if `TRUE` split words that are connected by hyphenation and hyphenation-like characters in between words, e.g. `"self-storage"` becomes `c("self", "storage")`. Default is `FALSE` to preserve such words as is, with the hyphens. Only applies if `method = "word"` or `what = "fasterword"`.
`collocation_size`	integer collocation size for texstat parameter
`include_textstat`	if `TRUE` applies textstat algorithm to text vector
`remove_url`	logical; if `TRUE` find and eliminate URLs beginning with http(s) – see section "Dealing with URLs".
`stop_sources`	stop word source NULL "smart" "stopwords-iso" "snowball"
`n_gram_tokens`	`integer` of n-gram tokens - default 2L
`include_dfm`	if `TRUE` includes document feature matrix
`verbose`	if `TRUE` vervbose