ModelPredictor: Allows predicting text, calculating word probabilities and...

ModelPredictorR Documentation

Allows predicting text, calculating word probabilities and Perplexity

Description

It provides a method for predicting the new word given a set of previous words. It also provides a method for calculating the Perplexity score for a set of words. Furthermore it provides a method for calculating the probability of a given word and set of previous words.

Super class

wordpredictor::Base -> ModelPredictor

Methods

Public methods


Method new()

It initializes the current object. It is used to set the model file name and verbose options.

Usage
ModelPredictor$new(mf, ve = 0)
Arguments
mf

The model file name.

ve

The level of detail in the information messages.


Method get_model()

Returns the Model class object.

Usage
ModelPredictor$get_model()
Returns

The Model class object is returned.


Method calc_perplexity()

The Perplexity for the given sentence is calculated. For each word, the probability of the word given the previous words is calculated. The probabilities are multiplied and then inverted. The nth root of the result is the perplexity, where n is the number of words in the sentence. If the stem_words tokenization option was specified when creating the given model file, then the previous words are converted to their stems.

Usage
ModelPredictor$calc_perplexity(words)
Arguments
words

The list of words.

Returns

The perplexity of the given list of words.

Examples
# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("def-model.RDS")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve, rp = "./")
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code

# The model file name
mfn <- paste0(ed, "/def-model.RDS")
# ModelPredictor class object is created
mp <- ModelPredictor$new(mf = mfn, ve = ve)
# The sentence whoose Perplexity is to be calculated
l <- "last year at this time i was preparing for a trip to rome"
# The line is split in to words
w <- strsplit(l, " ")[[1]]
# The Perplexity of the sentence is calculated
p <- mp$calc_perplexity(w)
# The sentence Perplexity is printed
print(p)
# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()

Method predict_word()

Predicts the next word given a list of previous words. It checks the last n previous words in the transition probabilities data, where n is equal to 1 - n-gram size of model. If there is a match, the top 3 next words with highest probabilities are returned. If there is no match, then the last n-1 previous words are checked. This process is continued until the last word is checked. If there is no match, then empty result is returned. The given words may optionally be stemmed.

Usage
ModelPredictor$predict_word(words, count = 3, dc = NULL)
Arguments
words

A character vector of previous words or a single vector containing the previous word text.

count

The number of results to return.

dc

A DataCleaner object. If it is given, then the given words

Returns

The top 3 predicted words along with their probabilities.

Examples
# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("def-model.RDS")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve, "rp" = "./")
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code

# The model file name
mfn <- paste0(ed, "/def-model.RDS")
# ModelPredictor class object is created
mp <- ModelPredictor$new(mf = mfn, ve = ve)
# The next word is predicted
nws <- mp$predict_word("today is", count = 10)
# The predicted next words are printed
print(nws)

# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()

Method get_word_prob()

Calculates the probability of the given word given the previous words. The last n words are converted to numeric hash using digest2int function. All other words are ignored. n is equal to 1 - size of the n-gram model. The hash is looked up in a data frame of transition probabilities. The last word is converted to a number by checking its position in a list of unique words. If the hash and the word position were found, then the probability of the previous word and hash is returned. If it was not found, then the hash of the n-1 previous words is taken and the processed is repeated. If the data was not found in the data frame, then the word probability is returned. This is known as back-off. If the word probability could not be found then the default probability is returned. The default probability is calculated as 1/(N+V), Where N = number of words in corpus and V is the number of dictionary words.

Usage
ModelPredictor$get_word_prob(word, pw)
Arguments
word

The word whose probability is to be calculated.

pw

The previous words.

Returns

The probability of the word given the previous words.

Examples
# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("def-model.RDS")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve, "rp" = "./")
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code

# The model file name
mfn <- paste0(ed, "/def-model.RDS")
# ModelPredictor class object is created
mp <- ModelPredictor$new(mf = mfn, ve = ve)
# The probability that the next word is "you" given the prev words
# "how" and "are"
prob <- mp$get_word_prob(word = "you", pw = c("how", "are"))
# The probability is printed
print(prob)

# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()

Method clone()

The objects of this class are cloneable with this method.

Usage
ModelPredictor$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

Examples


## ------------------------------------------------
## Method `ModelPredictor$calc_perplexity`
## ------------------------------------------------

# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("def-model.RDS")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve, rp = "./")
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code

# The model file name
mfn <- paste0(ed, "/def-model.RDS")
# ModelPredictor class object is created
mp <- ModelPredictor$new(mf = mfn, ve = ve)
# The sentence whoose Perplexity is to be calculated
l <- "last year at this time i was preparing for a trip to rome"
# The line is split in to words
w <- strsplit(l, " ")[[1]]
# The Perplexity of the sentence is calculated
p <- mp$calc_perplexity(w)
# The sentence Perplexity is printed
print(p)
# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()

## ------------------------------------------------
## Method `ModelPredictor$predict_word`
## ------------------------------------------------

# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("def-model.RDS")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve, "rp" = "./")
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code

# The model file name
mfn <- paste0(ed, "/def-model.RDS")
# ModelPredictor class object is created
mp <- ModelPredictor$new(mf = mfn, ve = ve)
# The next word is predicted
nws <- mp$predict_word("today is", count = 10)
# The predicted next words are printed
print(nws)

# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()

## ------------------------------------------------
## Method `ModelPredictor$get_word_prob`
## ------------------------------------------------

# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("def-model.RDS")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve, "rp" = "./")
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code

# The model file name
mfn <- paste0(ed, "/def-model.RDS")
# ModelPredictor class object is created
mp <- ModelPredictor$new(mf = mfn, ve = ve)
# The probability that the next word is "you" given the prev words
# "how" and "are"
prob <- mp$get_word_prob(word = "you", pw = c("how", "are"))
# The probability is printed
print(prob)

# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()

wordpredictor documentation built on Oct. 8, 2024, 5:10 p.m.