ngram_freq: Produce a frequency list of ngrams.
In ekbrown/corpling: Simple Tools for Corpus Linguistics

Description Usage Arguments Value See Also Examples

Produces a frequency list of ngrams and returns a data frame. Users specify the number of words in the ngrams, the order of the frequency list as either alphabetical or by frequency, and whether the list be in ascending or descending order, among other options.

1 2	ngram_freq(text, num_wd = 1, ignore_case = TRUE, order_by = "alpha", descending = FALSE, min_freq = 1, word_char = NULL)

`text`	The text with the ngrams whose frequencies are to be determined, as either a character vector or something coercible to it, such as a list of character vectors.
`num_wd`	Specifies the number of words in the ngrams, whether single words (`num_wd = 1`, the default), bigrams (`num_wd = 2`), trigrams (`num_wd = 3`), or even larger ngrams.
`ignore_case`	Specifies whether the frequency list be case-insensitive (`ignore_case = TRUE`, the default) or case-sensitve (`ignore_case = FALSE`). If case-insensitive, the ngrams are converted to upper-case.
`order_by`	Specifies whether the frequency list be ordered alphabetically (`order_by = "alpha"`) or by frequency (`order_by = "freq"`).
`descending`	Specifies whether the frequency list be ordered in ascending order (`descending = FALSE`, the default) or descending order (`descending = TRUE`).
`min_freq`	Specifies the minimum frequency that an ngram must have in order to be included in the frequency list. With `min_freq = 1` (the default), all ngrams are included.
`word_char`	If `word_char = NULL` (the default), the user's locale is used to distinguish word characters (e.g., "abc") from non-word characters (e.g., ".?!"). A user's locale can be determined with `sessionInfo()`. If words are split that shouldn't be, users can give a character class to specify word characters. For example, `word_char = "[-'a-z]+"` specifies that a combination of one or more contiguous dashes, apostrophes or letters "a" to "z" be considered as words, and as a result, the sentence "It's a hard-knock life, for us!" has six words rather than more.

A two-column local data frame, the first column with the ngrams and the second column with the frequencies.

For more info about local data frames, see https://cran.r-project.org/web/packages/dplyr/vignettes/data_frames.html).

text <- c("First sentence here. Short, but sweet.")
text <- c(text, "Second one here, maybe?")
text <- c(text, "Third short paragraph here!")
text <- c(text, "Here too; with another thought.", "Here too.")

ngram_freq(text)
ngram_freq(text, ignore_case = FALSE)
ngram_freq(text, order_by = "freq", descending = TRUE)
ngram_freq(text, order_by = "freq", descending = TRUE, min_freq = 2)

# view difference (if any, given your locale)
ngram_freq("It's a hard-knock life, for us!")
ngram_freq("It's a hard-knock life, for us!", word_char = "[-'a-z]+")

# gets bigram frequencies
ngram_freq(text, num_wd = 2)
ngram_freq(text, num_wd = 2, order_by = "freq", descending = TRUE)

# gets trigram frequencies
ngram_freq(text, num_wd = 3)