ANLP is a package which provides all the functionalities to build text prediction model.
Functionalities supported by ANLP package:
This function reads text data from file in specificied encoding.
library(ANLP) print(length(twitter.data))
There are more than 100k tweets in the dataset. Initially we will sample 10k tweets to build our model.
We need to sample 10% of the data. So, we will use SampleTextData function following way:
train.data <- sampleTextData(twitter.data,0.1) print(length(train.data)) head(train.data)
Now, we have 10k tweets but we can see that data is very impure. There are many punctuations, abbreviations, contractions.
train.data.cleaned <- cleanTextData(train.data) train.data.cleaned[[1]]$content[1:5]
As we can see, all the texts are now cleaned and looks good :)
Now, next step is to build N-gram models by using our cleaned data corpus.
We will build 1,2,3 gram models and generate term frequency matrix for all the data.
unigramModel <- generateTDM(train.data.cleaned,1) head(unigramModel)
bigramModel <- generateTDM(train.data.cleaned,2) head(bigramModel)
trigramModel <- generateTDM(train.data.cleaned,3) head(trigramModel)
Good work :) Now we have all 3 models so lets predict.
This function accepts list of all the N-gram models. So, lets merge all the N-gram models in single list.
Note: Remember to merge N-gram models in decending order. (3,2,1 Ngram models)
nGramModelsList <- list(trigramModel,bigramModel,unigramModel)
Lets predict some strings:
testString <- "I am the one who" predict_Backoff(testString,nGramModelsList) testString <- "what is my" predict_Backoff(testString,nGramModelsList) testString <- "the best movie" predict_Backoff(testString,nGramModelsList)
Enjoy and free feel to give feedbacks on achalshah20@gmail.com
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.