In achalshah20/ANLP: Build Text Prediction Model

ANLP is a package which provides all the functionalities to build text prediction model.

Functions

Functionalities supported by ANLP package:

Read text data in binary mode
Sample text data
Clean text data by applying transformations to lower case, remove numbers, punctuation, abbreviations, contractions, symbols, white space etc.
Build N-gram model and generate term document frequency tabel
Predict next word by using N-gram models and backoff algorithm

readTextFile

This function reads text data from file in specificied encoding.

library(ANLP)
print(length(twitter.data))

There are more than 100k tweets in the dataset. Initially we will sample 10k tweets to build our model.

sampleTextData

We need to sample 10% of the data. So, we will use SampleTextData function following way:

train.data <- sampleTextData(twitter.data,0.1)
print(length(train.data))
head(train.data)

Now, we have 10k tweets but we can see that data is very impure. There are many punctuations, abbreviations, contractions.

cleanTextData

train.data.cleaned <- cleanTextData(train.data)
train.data.cleaned[[1]]$content[1:5]

As we can see, all the texts are now cleaned and looks good :)

Now, next step is to build N-gram models by using our cleaned data corpus.

generateTDM

We will build 1,2,3 gram models and generate term frequency matrix for all the data.

unigramModel <- generateTDM(train.data.cleaned,1)
head(unigramModel)

bigramModel <- generateTDM(train.data.cleaned,2)
head(bigramModel)

trigramModel <- generateTDM(train.data.cleaned,3)
head(trigramModel)

Good work :) Now we have all 3 models so lets predict.

predict_Backoff

This function accepts list of all the N-gram models. So, lets merge all the N-gram models in single list.
Note: Remember to merge N-gram models in decending order. (3,2,1 Ngram models)

nGramModelsList <- list(trigramModel,bigramModel,unigramModel)

Lets predict some strings:

testString <- "I am the one who"
predict_Backoff(testString,nGramModelsList)

testString <- "what is my"
predict_Backoff(testString,nGramModelsList)

testString <- "the best movie"
predict_Backoff(testString,nGramModelsList)

Enjoy and free feel to give feedbacks on achalshah20@gmail.com

achalshah20/ANLP documentation built on May 10, 2019, 5:10 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com