freqlist: Build the frequency list of a corpus
In wai-wong-reimagine/mclm: Mastering Corpus Linguistics Methods

Description Usage Arguments Value Examples

Builds the word frequency list of a corpus.

freqlist(x,
         re_drop_line = NA,
         line_glue = NA,
         re_cut_area = NA,
         re_token_splitter = "\\s+",
         re_token_extractor = "[^\\s]+",
         re_drop_token = NA,
         re_token_transf_in = NA,
         token_transf_out = NA,
         token_to_lower = TRUE,
         perl = TRUE,
         blocksize = 300,
         verbose = FALSE,
         show_dots = FALSE,
         dot_blocksize = 10,
         file_encoding = "UTF-8",
         as_text = FALSE)

`x`	the object `x` either contains the list of filenames of the corpus files (if `as_text` is `TRUE`) or the actual text of the corpus (if `as_text` is `FALSE`). If (if `as_text` is `TRUE`) and the length of the vector `x` is higher than one, then each item in `x` is treated as a separate line (or a separate series of lines) in the corpus text. Withing each item of `x`, the character `"\\n"` is also treated as a line separator.
`re_drop_line`	if `re_drop_line` is `NA`, then this argument is ignored. Otherwise, `re_drop_line` is a character vector (assumed to be of length 1) containing a regular expression. Lines in `x` that contain a match for `re_drop_line` are treated as not belonging to the corpus and are excluded from the results.
`line_glue`	if `line_glue` is `NA`, then this argument is ignored. Otherwise, all lines in a corpus file (or in `x`, if `as_text` is `TRUE`, are glued together in one character vector of length 1, with the string `line_glue` pasted in between consecutive lines. The value of `line_glue` can also be equal to the empty string `""`. The ‘line glue’ operation is conducted immediately after the ‘drop line’ operation.
`re_cut_area`	if `re_cut_area` is `NA`, then this argument is ignored. Otherwise, all matches in a corpus file (or in `x`, if `as_text` is `TRUE`, are 'cut out' of the text prior to the identification of the tokens in the text (and are therefore not taken into account when identifying the tokens). The ‘cut area’ operation is conducted immediately after the ‘line glue’ operation.
`re_token_splitter`	the actual token identification is either based on `re_token_splitter`, a regular expression that identifies the areas between the tokens, or on `re_token_extractor`, a regular expressions that identifies the area that are the tokens. The first mechanism is the default mechanism: the argument `re_token_extractor` is only used if `re_token_splitter` is `NA`. more specifically, `re_token_splitter` is a regular expression that identifies the locations where lines in the corpus files are split into tokens. The ‘token identification’ operation is conducted immediately after the ‘cut area’ operation.
`re_token_extractor`	a regular expression that identifies the locations of the actual tokens. This argument is only used if `re_token_splitter` is `NA`. Whereas matches for `re_token_splitter` are identified as the areas between the tokens, matches for `re_token_extractor` are identified as the areas of the actual tokens. Currently the implementation of `re_token_extractor` is a lot less time-efficient than that of `re_token_splitter`. The ‘token identification’ operation is conducted immediately after the ‘cut area’ operation.
`re_drop_token`	a regular expression that identifies tokens that are to be excluded from the results. Any token that contains a match for `re_drop_token` is removed from the results. If `re_drop_token` is `NA`, this argument is ignored. The ‘drop token’ operation is conducted immediately after the ‘token identification’ operation.
`re_token_transf_in`	a regular expression that identifies areas in the tokens that are to be transformed. This argument works together with the argument `token_transf_out`. If both `re_token_transf_in` and `token_transf_out` differ from `NA`, then all matches, in the tokens, for the regular expression `re_token_transf_in` are replaced with the replacement string `token_transf_out`. The ‘token transformation’ operation is conducted immediately after the ‘drop token’ operation.
`token_transf_out`	a ‘replacement string’. This argument works together with `re_token_transf_in` and is ignored if `re_token_transf_in` is `NA`.
`token_to_lower`	a boolean value that determines whether or not tokens must be converted to lowercase before returning the result. The ‘token to lower’ operation is conducted immediately after the ‘token transformation’ operation.
`perl`	a boolean value that determines whether or not the PCRE regular expression flavor is being used in the arguments that contain regular expressions.
`blocksize`	number that indicates how many corpus files are read to memory ‘at each individual step’ during the steps in the procedure; normally the default value of `300` should not be changed, but when one works with exceptionally small corpus files, it may be worthwhile to use a higher number, and when one works with exceptionally large corpus files, ot may be worthwhile to use a lower number.
`verbose`	if `verbose` is `TRUE`, messages are printed to the console to indicate progress.
`show_dots`	if `verbose` is `TRUE`, dots are printed to the console to indicate progress.
`dot_blocksize`	if `verbose` is `TRUE`, dots are printed to the console to indicate progress.
`file_encoding`	file encoding that is assumed in the corpus files.
`as_text`	boolean vector, assumed to be of length 1, which determines whether `x` is to be interpreted as a character vector containing the actual contents of the corpus (if `as_text` is `TRUE`) or as a character vector containing the names of the corpus files (if `as_text` is `FALSE`). If if `as_text` is `TRUE`, then the arguments `blocksize`, `verbose`, `show_dots`, `dot_blocksize`, and `file_encoding` are ignored.

This function returns a frequency list, i.e. an object of the class "freqlist".

toy_corpus <- "Once upon a time there was a tiny toy corpus.
It consisted of three sentence. And it lived happily ever after."
(flist <- freqlist(toy_corpus, as_text = TRUE))
print(flist, n = 20)

t_splitter <- "(?xi) [:\\s.;,?!\"]+"
freqlist(toy_corpus,
         re_token_splitter = t_splitter,
         as_text = TRUE)

t_splitter <- "(?xi) [:\\s.;,?!\"]+"
freqlist(toy_corpus,
         re_token_splitter = t_splitter,
         token_to_lower = FALSE,
         as_text = TRUE)

t_extractor <- "(?xi) ( [:;?!] | [.]+ | [\\w'-]+ )"
freqlist(toy_corpus,
         re_token_splitter = NA,
         re_token_extractor = t_extractor,
         as_text = TRUE)