ANLP Package"

ANLP is a package which provides all the functionalities to build text prediction model.

Functions

Functionalities supported by ANLP package:

readTextFile

This function reads text data from file in specificied encoding.

library(ANLP)
print(length(twitter.data))

There are more than 100k tweets in the dataset. Initially we will sample 10k tweets to build our model.

sampleTextData

We need to sample 10% of the data. So, we will use SampleTextData function following way:

train.data <- sampleTextData(twitter.data,0.1)
print(length(train.data))
head(train.data)

Now, we have 10k tweets but we can see that data is very impure. There are many punctuations, abbreviations, contractions.

cleanTextData

train.data.cleaned <- cleanTextData(train.data)
train.data.cleaned[[1]]$content[1:5]

As we can see, all the texts are now cleaned and looks good :)

Now, next step is to build N-gram models by using our cleaned data corpus.

generateTDM

We will build 1,2,3 gram models and generate term frequency matrix for all the data.

unigramModel <- generateTDM(train.data.cleaned,1)
head(unigramModel)
bigramModel <- generateTDM(train.data.cleaned,2)
head(bigramModel)
trigramModel <- generateTDM(train.data.cleaned,3)
head(trigramModel)

Good work :) Now we have all 3 models so lets predict.

predict_Backoff

This function accepts list of all the N-gram models. So, lets merge all the N-gram models in single list.
Note: Remember to merge N-gram models in decending order. (3,2,1 Ngram models)

nGramModelsList <- list(trigramModel,bigramModel,unigramModel)

Lets predict some strings:

testString <- "I am the one who"
predict_Backoff(testString,nGramModelsList)

testString <- "what is my"
predict_Backoff(testString,nGramModelsList)

testString <- "the best movie"
predict_Backoff(testString,nGramModelsList)

Enjoy and free feel to give feedbacks on achalshah20@gmail.com



Try the ANLP package in your browser

Any scripts or data that you put into this service are public.

ANLP documentation built on May 30, 2017, 4:42 a.m.