ngram_freq: Produce a frequency list of ngrams.

Description Usage Arguments Value See Also Examples

Description

Produces a frequency list of ngrams and returns a data frame. Users specify the number of words in the ngrams, the order of the frequency list as either alphabetical or by frequency, and whether the list be in ascending or descending order, among other options.

Usage

1
2
ngram_freq(text, num_wd = 1, ignore_case = TRUE, order_by = "alpha",
  descending = FALSE, min_freq = 1, word_char = NULL)

Arguments

text

The text with the ngrams whose frequencies are to be determined, as either a character vector or something coercible to it, such as a list of character vectors.

num_wd

Specifies the number of words in the ngrams, whether single words (num_wd = 1, the default), bigrams (num_wd = 2), trigrams (num_wd = 3), or even larger ngrams.

ignore_case

Specifies whether the frequency list be case-insensitive (ignore_case = TRUE, the default) or case-sensitve (ignore_case = FALSE). If case-insensitive, the ngrams are converted to upper-case.

order_by

Specifies whether the frequency list be ordered alphabetically (order_by = "alpha") or by frequency (order_by = "freq").

descending

Specifies whether the frequency list be ordered in ascending order (descending = FALSE, the default) or descending order (descending = TRUE).

min_freq

Specifies the minimum frequency that an ngram must have in order to be included in the frequency list. With min_freq = 1 (the default), all ngrams are included.

word_char

If word_char = NULL (the default), the user's locale is used to distinguish word characters (e.g., "abc") from non-word characters (e.g., ".?!"). A user's locale can be determined with sessionInfo(). If words are split that shouldn't be, users can give a character class to specify word characters. For example, word_char = "[-'a-z]+" specifies that a combination of one or more contiguous dashes, apostrophes or letters "a" to "z" be considered as words, and as a result, the sentence "It's a hard-knock life, for us!" has six words rather than more.

Value

A two-column local data frame, the first column with the ngrams and the second column with the frequencies.

See Also

For more info about local data frames, see https://cran.r-project.org/web/packages/dplyr/vignettes/data_frames.html).

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
text <- c("First sentence here. Short, but sweet.")
text <- c(text, "Second one here, maybe?")
text <- c(text, "Third short paragraph here!")
text <- c(text, "Here too; with another thought.", "Here too.")

ngram_freq(text)
ngram_freq(text, ignore_case = FALSE)
ngram_freq(text, order_by = "freq", descending = TRUE)
ngram_freq(text, order_by = "freq", descending = TRUE, min_freq = 2)

# view difference (if any, given your locale)
ngram_freq("It's a hard-knock life, for us!")
ngram_freq("It's a hard-knock life, for us!", word_char = "[-'a-z]+")

# gets bigram frequencies
ngram_freq(text, num_wd = 2)
ngram_freq(text, num_wd = 2, order_by = "freq", descending = TRUE)

# gets trigram frequencies
ngram_freq(text, num_wd = 3)

ekbrown/corpling documentation built on May 16, 2019, 2:24 a.m.