TokenGenerator: Generates n-grams from text files

TokenGeneratorR Documentation

Generates n-grams from text files

Description

It generates n-gram tokens along with their frequencies. The data may be saved to a file in plain text format or as a R object.

Super class

wordpredictor::Base -> TokenGenerator

Methods

Public methods


Method new()

It initializes the current obj. It is used to set the file name, tokenization options and verbose option.

Usage
TokenGenerator$new(fn = NULL, opts = list(), ve = 0)
Arguments
fn

The path to the input file.

opts

The options for generating the n-gram tokens.

  • n. The n-gram size.

  • save_ngrams. If the n-gram data should be saved.

  • min_freq. All n-grams with frequency less than min_freq are ignored.

  • line_count. The number of lines to process at a time.

  • stem_words. If words should be transformed to their stems.

  • dir. The dir where the output file should be saved.

  • format. The format for the output. There are two options.

    • plain. The data is stored in plain text.

    • obj. The data is stored as a R obj.

ve

The level of detail in the information messages.


Method generate_tokens()

It generates n-gram tokens and their frequencies from the given file name. The tokens may be saved to a text file as plain text or a R object.

Usage
TokenGenerator$generate_tokens()
Returns

The data frame containing n-gram tokens along with their frequencies.

Examples
# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("test-clean.txt")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve, rp = "./")
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code

# The n-gram size
n <- 4
# The test file name
tfn <- paste0(ed, "/test-clean.txt")
# The n-gram number is set
tg_opts <- list("n" = n, "save_ngrams" = TRUE, "dir" = ed)
# The TokenGenerator object is created
tg <- TokenGenerator$new(tfn, tg_opts, ve = ve)
# The n-gram tokens are generated
tg$generate_tokens()

# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()

Method clone()

The objects of this class are cloneable with this method.

Usage
TokenGenerator$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

Examples


## ------------------------------------------------
## Method `TokenGenerator$generate_tokens`
## ------------------------------------------------

# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("test-clean.txt")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve, rp = "./")
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code

# The n-gram size
n <- 4
# The test file name
tfn <- paste0(ed, "/test-clean.txt")
# The n-gram number is set
tg_opts <- list("n" = n, "save_ngrams" = TRUE, "dir" = ed)
# The TokenGenerator object is created
tg <- TokenGenerator$new(tfn, tg_opts, ve = ve)
# The n-gram tokens are generated
tg$generate_tokens()

# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()

wordpredictor documentation built on Oct. 8, 2024, 5:10 p.m.