Description Usage Arguments Value Examples
Builds the word frequency list of a corpus.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | freqlist(x,
re_drop_line = NA,
line_glue = NA,
re_cut_area = NA,
re_token_splitter = "\\s+",
re_token_extractor = "[^\\s]+",
re_drop_token = NA,
re_token_transf_in = NA,
token_transf_out = NA,
token_to_lower = TRUE,
perl = TRUE,
blocksize = 300,
verbose = FALSE,
show_dots = FALSE,
dot_blocksize = 10,
file_encoding = "UTF-8",
as_text = FALSE)
|
x |
the object If (if |
re_drop_line |
if |
line_glue |
if |
re_cut_area |
if |
re_token_splitter |
the actual token identification is either based on
more specifically, |
re_token_extractor |
a regular expression that identifies the locations of the
actual tokens. This
argument is only used
if |
re_drop_token |
a regular expression that identifies tokens that are to be excluded
from the results. Any token that contains a match for
|
re_token_transf_in |
a regular expression that identifies areas in the tokens that are to be
transformed. This argument works together with the argument
If both The ‘token transformation’ operation is conducted immediately after the ‘drop token’ operation. |
token_transf_out |
a ‘replacement string’. This argument works together with
|
token_to_lower |
a boolean value that determines whether or not tokens must be converted to lowercase before returning the result. The ‘token to lower’ operation is conducted immediately after the ‘token transformation’ operation. |
perl |
a boolean value that determines whether or not the PCRE regular expression flavor is being used in the arguments that contain regular expressions. |
blocksize |
number that indicates how many corpus files are read to memory
‘at each individual step’ during the steps in the procedure;
normally the default value
of |
verbose |
if |
show_dots |
if |
dot_blocksize |
if |
file_encoding |
file encoding that is assumed in the corpus files. |
as_text |
boolean vector, assumed to be of length 1, which determines whether
|
This function returns a frequency list, i.e. an object of the
class "freqlist"
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | toy_corpus <- "Once upon a time there was a tiny toy corpus.
It consisted of three sentence. And it lived happily ever after."
(flist <- freqlist(toy_corpus, as_text = TRUE))
print(flist, n = 20)
t_splitter <- "(?xi) [:\\s.;,?!\"]+"
freqlist(toy_corpus,
re_token_splitter = t_splitter,
as_text = TRUE)
t_splitter <- "(?xi) [:\\s.;,?!\"]+"
freqlist(toy_corpus,
re_token_splitter = t_splitter,
token_to_lower = FALSE,
as_text = TRUE)
t_extractor <- "(?xi) ( [:;?!] | [.]+ | [\\w'-]+ )"
freqlist(toy_corpus,
re_token_splitter = NA,
re_token_extractor = t_extractor,
as_text = TRUE)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.