ModelGenerator: Generates n-gram models from a text file

ModelGeneratorR Documentation

Generates n-gram models from a text file

Description

It provides a method for generating n-gram models. The n-gram models may be customized by specifying data cleaning and tokenization options.

Details

It provides a method that generates a n-gram model. The n-gram model may be customized by specifying the data cleaning and tokenization options.

The data cleaning options include removal of punctuation, stop words, extra space, non-dictionary words and bad words. The tokenization options include n-gram number and word stemming.

Super class

wordpredictor::Base -> ModelGenerator

Methods

Public methods


Method new()

It initializes the current object. It is used to set the maximum n-gram number, sample size, input file name, data cleaner options, tokenization options and verbose option.

Usage
ModelGenerator$new(
  name = NULL,
  desc = NULL,
  fn = NULL,
  df = NULL,
  n = 4,
  ssize = 0.3,
  dir = ".",
  dc_opts = list(),
  tg_opts = list(),
  ve = 0
)
Arguments
name

The model name.

desc

The model description.

fn

The model file name.

df

The path of the input text file. It should be the short file name and should be present in the data directory.

n

The n-gram size of the model.

ssize

The sample size as a proportion of the input file.

dir

The directory containing the input and output files.

dc_opts

The data cleaner options.

tg_opts

The token generator options.

ve

The level of detail in the information messages.


Method generate_model()

It generates the model using the parameters passed to the object's constructor. It generates a n-gram model file and saves it to the model directory.

Usage
ModelGenerator$generate_model()
Examples
# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("input.txt")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve, rp = "./")
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code

# ModelGenerator class object is created
mg <- ModelGenerator$new(
    name = "default-model",
    desc = "1 MB size and default options",
    fn = "def-model.RDS",
    df = "input.txt",
    n = 4,
    ssize = 0.99,
    dir = ed,
    dc_opts = list(),
    tg_opts = list(),
    ve = ve
)
# The n-gram model is generated
mg$generate_model()

# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()

Method clone()

The objects of this class are cloneable with this method.

Usage
ModelGenerator$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

Examples


## ------------------------------------------------
## Method `ModelGenerator$generate_model`
## ------------------------------------------------

# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("input.txt")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve, rp = "./")
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code

# ModelGenerator class object is created
mg <- ModelGenerator$new(
    name = "default-model",
    desc = "1 MB size and default options",
    fn = "def-model.RDS",
    df = "input.txt",
    n = 4,
    ssize = 0.99,
    dir = ed,
    dc_opts = list(),
    tg_opts = list(),
    ve = ve
)
# The n-gram model is generated
mg$generate_model()

# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()

wordpredictor documentation built on Oct. 8, 2024, 5:10 p.m.