TokenGenerator: Generates n-grams from text files

Description Super class Methods Examples

Description

It generates n-gram tokens along with their frequencies. The data may be saved to a file in plain text format or as a R object.

Super class

wordpredictor::Base -> TokenGenerator

Methods

Public methods

Inherited methods

Method new()

It initializes the current obj. It is used to set the file name, tokenization options and verbose option.

Usage
TokenGenerator$new(fn = NULL, opts = list(), ve = 0)
Arguments
fn

The path to the input file.

opts

The options for generating the n-gram tokens.

  • n. The n-gram size.

  • save_ngrams. If the n-gram data should be saved.

  • min_freq. All n-grams with frequency less than min_freq are ignored.

  • line_count. The number of lines to process at a time.

  • stem_words. If words should be transformed to their stems.

  • dir. The dir where the output file should be saved.

  • format. The format for the output. There are two options.

    • plain. The data is stored in plain text.

    • obj. The data is stored as a R obj.

ve

The level of detail in the information messages.


Method generate_tokens()

It generates n-gram tokens and their frequencies from the given file name. The tokens may be saved to a text file as plain text or a R object.

Usage
TokenGenerator$generate_tokens()
Returns

The data frame containing n-gram tokens along with their frequencies.

Examples
# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("test-clean.txt")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve, rp = "./")
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code

# The n-gram size
n <- 4
# The test file name
tfn <- paste0(ed, "/test-clean.txt")
# The n-gram number is set
tg_opts <- list("n" = n, "save_ngrams" = TRUE, "dir" = ed)
# The TokenGenerator object is created
tg <- TokenGenerator$new(tfn, tg_opts, ve = ve)
# The n-gram tokens are generated
tg$generate_tokens()

# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()

Method clone()

The objects of this class are cloneable with this method.

Usage
TokenGenerator$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
## ------------------------------------------------
## Method `TokenGenerator$generate_tokens`
## ------------------------------------------------

# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("test-clean.txt")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve, rp = "./")
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code

# The n-gram size
n <- 4
# The test file name
tfn <- paste0(ed, "/test-clean.txt")
# The n-gram number is set
tg_opts <- list("n" = n, "save_ngrams" = TRUE, "dir" = ed)
# The TokenGenerator object is created
tg <- TokenGenerator$new(tfn, tg_opts, ve = ve)
# The n-gram tokens are generated
tg$generate_tokens()

# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()

wordpredictor documentation built on June 19, 2021, 5:06 p.m.