corpora_to_word_list: Corpora to Word List

Description Usage Arguments Value

View source: R/keyToEnglish.R

Description

Converts a collection of documents to a word list

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
corpora_to_word_list(
  paths,
  ascii_only = TRUE,
  custom_regex = NA,
  max_word_length = 20,
  stopword_fn = DEFAULT_STOPWORDS,
  min_word_count = 5,
  max_size = 16^3,
  min_word_length = 3,
  output_file = NA,
  json_path = NA
)

Arguments

paths

Paths of plaintext documents

ascii_only

Will omit non-ascii characters if TRUE

custom_regex

If not NA, will override ascii_only and this will determine what a valid word consists of

max_word_length

Maximum length of extracted words

stopword_fn

Filename containing stopwords to use or a list of stopwords (if length > 1)

min_word_count

Minimum number of occurrences for a word to be added to word list

max_size

Maximum size of list

min_word_length

Minimum length of words

output_file

File to write list to

json_path

If input text is JSON, then it will be parsed as such if this is a character of JSON keys to follow

Value

A 'character' vector of words


keyToEnglish documentation built on Feb. 14, 2021, 1:05 a.m.