qe_texts: Quanteda analysis of text vector

Description Usage Arguments Examples

Description

Quanteda analysis of text vector

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
qe_texts(
  texts = NULL,
  method = "word",
  remove_words = c("\\.", "\\-", "\\#", "\\'", "\\,", "\\;", "\\_",
    "\\DESCRIPTION:", "\\:", "\\SBIR", "\\I ", "\\II ", "\\III ", "PHASE"),
  dfm_dictionary = NULL,
  n_top_features = 10,
  stem = F,
  exclude_features = F,
  remove_numbers = T,
  remove_punct = T,
  remove_symbols = T,
  remove_separators = TRUE,
  remove_twitter = T,
  remove_hyphens = T,
  collocation_size = 3,
  include_textstat = T,
  remove_url = FALSE,
  stop_sources = c("smart", "snowball", "stopwords-iso"),
  n_gram_tokens = 2,
  include_dfm = F,
  verbose = T
)

Arguments

texts

vector of text

method

what the unit for splitting the text, available alternatives are:

"word"

(recommended default) smartest, but slowest, word tokenization method; see stringi-search-boundaries for details.

"fasterword"

dumber, but faster, word tokenization metho uses stri_split_charclass(x, "[\\p{Z}\\p{C}]+")

"fastestword"

dumbest, but fastest, word tokenization method, calls stri_split_fixed(x, " ")

"character"

tokenization into individual characters

"sentence"

sentence segmenter, smart enough to handle some exceptions in English such as "Prof. Plum killed Mrs. Peacock." (but far from perfect).

dfm_dictionary

if not NULL dictionary of word meanings

n_top_features

if not NULL number of top features for feature count

stem

if TRUE stem words

exclude_features

if TRUE remove nested feature list

remove_numbers

logical; if TRUE remove tokens that consist only of numbers, but not words that start with digits, e.g. 2day

remove_separators

logical; if TRUE remove separators and separator characters (Unicode "Separator" [Z] and "Control [C]" categories). Only applicable for what = "character" (when you probably want it to be FALSE) and for what = "word" (when you probably want it to be TRUE).

remove_twitter

logical; if TRUE remove Twitter characters @ and #; set to TRUE if you wish to eliminate these. Note that this will always be set to FALSE if remove_punct = FALSE.

remove_hyphens

logical; if TRUE split words that are connected by hyphenation and hyphenation-like characters in between words, e.g. "self-storage" becomes c("self", "storage"). Default is FALSE to preserve such words as is, with the hyphens. Only applies if method = "word" or what = "fasterword".

collocation_size

integer collocation size for texstat parameter

include_textstat

if TRUE applies textstat algorithm to text vector

remove_url

logical; if TRUE find and eliminate URLs beginning with http(s) – see section "Dealing with URLs".

stop_sources

stop word source

  • NULL

  • "smart"

  • "stopwords-iso"

  • "snowball"

n_gram_tokens

integer of n-gram tokens - default 2L

include_dfm

if TRUE includes document feature matrix

verbose

if TRUE vervbose

Examples

1
qe_texts(texts = "HIGH SURFACE AREA NON-OXIDE CERAMIC ELECTRODES FOR ULTRACAPACITORS", n_gram_tokens = 1:4)

abresler/govtrackR documentation built on July 11, 2020, 12:30 a.m.