knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
SentiAnalyzer is a straight forward solution for analyzing consumer reviews. The consumer reviews typically include the text which customers write after visiting the place or using the service provided by a service or job owner. The consumers usually give useful information through the reviews to the manager in a way that maybe affects the future of the company. Many potential future consumers read these reviews to decide whether they use the service or not. Therefore, that is necessary to know what's inside the review data and whether the general view toward the service provided is positive or negative. Nowadays, Machine learning algorithms can find useful patterns in data and predict new data based on learning on previous data. However, Processing the reviews data is not easy since it contains natural languages data and we need to process the data in order to be consistent with machine learning algorithms which they can work well on numeric and categorical data. The dataset that SentiAnalyzer can work with, for now, is a short review text and a binary class which indicates whether consumer review is positive or negative. First, we use natural language processing techniques (NLP) to preprocess the text of reviews to prepare them for Machine Learning (ML) algorithms. Next, the ML algorithms train on the data and build models for predicting the sentiment of new review data. We also compare the efficiency of different ML algorithms and let know the user to know which algorithm works better with her/his data. The Senti Analyzer package also provides interactive plots using Shiny apps such as high term frequency plot, word cloud of positive and negative words plot, different confusion matrix plots such as heatmap and parameter tuning plot for ML algorithms.
install.packages("SentiAnalyzer")
Balancing the dataset (BalanceData()
function)
Cleaning, Tokenizing and Building the term matrix of text (CleanText()
function)
Visualize the clean term matrix(VisualizeData()
function)
Train different classification algorithms(e.g., SVM,NB,RF,KNN,GBM) and choose best parameters for each one to get the highest possible classification accuracy for an specefic dataset (BuildTraining()
function)
Choose the best trained classification algorithm for the specific dataset according to different measures (e.g., FScore, Recall, Precision, Accuracy) (comparison()
function)
Predict on the new data (BuildPrediction()
function)
Visualize the output of the confusion matrix, that is, the accuracy of the training model in predicting the sentiment of the consumer review
The first step to process your data is to make sure that we have equal review rows in both positive and negative side. We check it in SentiAnalyzer package with BalanceData
. This function can be used to balance the called dataset. We use package ROSE
to balance data based on the class variable or dependent variable.If the data i already balanced, we just message the user that no need to balance the data and return the original dataset, otherwise the balanced dataset will return to the user.
imb <- system.file(package = "SentiAnalyzer", "extdata/Imbalance_Restaurant_Reviews.tsv") imbalance_data <- read.delim(imb,quote='',stringsAsFactors = FALSE)
head(imbalance_data)
SentiAnalyzer::BalanceData(imbalance_data)
Input the balance dataset and check how the function responds:
direction <- system.file(package = "SentiAnalyzer", "extdata/Restaurant_Reviews.tsv") balanced_data <- read.delim(direction,quote='',stringsAsFactors = FALSE)
dim(balanced_data) str(balanced_data)
The CleanText()
function is used to preprocess the text and build a term matrix which will be input to the ML algorithms later. You can see the input, output and description of this function as follows:
input: CleanText()
calls for 3 arguments:
a. the dataset b. document term matrix structure of choice (choose 1 from 3) c. reduction rate (range 0-1)
output: CleanText()
returns a term matrix clean_dataset
saved as data/clean_dataset.rda
We used packages tm
and SnowballC
in CleanText()
to tokenize, clean the text and build the term matrix.
integrating built-in functions from tm
, CleanText()
will "clean up" the words from the text and mine for the words that conveys a range sorts of sentiment and convert some formatting; this remains what is called as token (single) or corpus (all the tokens):
a. converts all text to lower case b. remove numbers from the text c. remove punctuations d. remove stop words, e.g. "the", "a", "for", "and", etc e. extract the stems of thegiven words using Porter's stemming algorithmn f. remove extra white spaces that was left off by the removed texts
Let go through the process step by step:
library(tm)
corpus=VCorpus(VectorSource(balanced_data$Review)) corpus=tm_map(corpus,content_transformer(tolower)) #convert all review to lower case corpus=tm_map(corpus,removeNumbers) # remove numbers from reviews corpus=tm_map(corpus,removePunctuation) # remove punctuations from reviews corpus=tm_map(corpus,removeWords,stopwords()) # remove Stop words from reviews corpus=tm_map(corpus,stemDocument) # Stemming corpus=tm_map(corpus,stripWhitespace) # remove extra space that's created in cleaning stage above
corpus$content[[1]][[1]] corpus$content[[2]][[1]] corpus$content[[3]][[1]] corpus$content[[4]][[1]] corpus$content[[5]][[1]] corpus$content[[6]][[1]]
Next, still within the scope of the same current function, Cleantext()
, the corpus is formatted to a document-term matrix (DTM) and creating document term matrix of words in reviews. Essentially it creates a single column for every tokens in the corpus and counted for frequency of occurence on each tokens on the rows
user also have the choice to choose either (the argument call is also done on CleanText
):
the document-term matrix will quickly expand the dataset dimension, especially the sparse terms, significantly. Depending on the dimension of the dataset can be adjusted be adjusted within the range 0 to 1. Essentially it is calling tm::removeSparseTerms
clean_dataset <- SentiAnalyzer::CleanText(balanced_data, dtm_method=1, reductionrate=0.99)
example:
dim(clean_dataset) names(clean_dataset)[1:50] head(clean_dataset[1:10]) tail(clean_dataset[1:10])
Before we start any further major processing of data, we can quickly get an insight of the data through a word cloud visualization. Termcount is used for filtering the highest terms repeated in reviews, usually > 10 count
check out shiny version of this function
input: a balanced dataset of text and binary sentiment review, a baseline for term count
output: wordcloud plot and frequency of words bar plot
#balanced_data <- read.delim(direction,quote='',stringsAsFactors = FALSE) #SentiAnalyzer::VisualizeData(balanced_data,15)
Trains different classification algorithms using Buildtraining
function and choose best parameters for each one of the available classification algorithms to get the highest possible classification accuracy for a specefic dataset. The algorithms that are being trained in the package are:
a. Gradient-boosting machine (GBM) b. k-Nearest Neighbor (KNN) c. Naive Bayes (NB) d. Random Forest (RF) e. svm_Poly
(warning, this function will take a lot of time to run, so we are showing a reduced version of it)
SentiAnalyzer::BuildTraining(clean_dataset)
input: document-term matrix (result from CleanText()
which is clean_dataset
)
output: list of trained models with best parameters from 5 machine learning algorithms (GBM, KNN, NB, RF,SVM_Poly)
Using cross validation for a valid testing accuracy
This function uses package caret
, and traincontrol
and whole dataset for training purpose, we use 10-fold cross validation for a valid estimate of testing accuracy.
the order of the list output of the trained models are: GBM, kNN, Naive Bayes, Decision Tree, svm_Poly. The performance of the trained models can be viewed/visualized by calling a specific model according to the index in the list; plot()
, which is an embedded function from caret
.
For example, to plot GBM model
data(package = "SentiAnalyzer", trained_models) plot((trained_models[[2]]))
The model training consist of expand.grid()
, train()
(formula, data, method, train control, tuneGrid, etc)
Training KNN with the best parameters using caret
traning function.
We train the classifier KNN for k(1:80) nearest neighbors. The train function selects the best parameter with the highest accuracy,
#csv_data <- read.csv(system.file(package = "SentiAnalyzer", "extdata/testing1.csv")) data(package = "SentiAnalyzer", testing1) res_name <- colnames(testing1)[ncol(testing1)] formula <- as.formula(paste(res_name, ' ~ .')) if (is.factor(testing1[, ncol(testing1)]) == FALSE) { testing1[, ncol(testing1)] <- ifelse(testing1[, ncol(testing1)] == 1, "Yes", "No") testing1[, ncol(testing1)] <- factor(testing1[, ncol(testing1)], levels = c("Yes", "No")) } else { testing1[, ncol(testing1)] <- ifelse(testing1[, ncol(testing1)] == 1, "Yes", "No") } # train control ctrl_cv10 = caret::trainControl( method = "cv", number = 10, savePred = T, classProb = T ) treeGrid_knn = expand.grid(k = c(1:80)) ###### kNN with Cross Validation ############ model_knn_10 =caret::train( formula, data = testing1, method = "knn", trControl = ctrl_cv10, tuneGrid = treeGrid_knn )
Visualization of the KNN tranining :
plot(model_knn_10)
As shown in plot above training explores up to 80 neighbours for KNN algorithm and selects the best one for the trained KNN model. We can see that for small values of k there is overfitting while from k=50 above it is gonna predict the majority. There needs to be a trade off between bias and variance
Another example, output from the Gradient Boosting trained model
data(package = "SentiAnalyzer", trained_models) plot((trained_models[[2]]))
As shown it searches for tree depth and boosting iterations, for this specific dataset tree depth less than 2 and 50 boosting iteration is doing better than other settings.
Classifies the target value of input data using different trained models (output of BuildTraining()
function) and returns a list of confusion matrices, these matrices can be used for any purposes. Using comparison()
function, we intent to extract different measurements(F,Recall, Precision,Accuracy) from it.User can also put TermVector Matrix as input and this function, depending upon the type of the input, automaticly runs BuildTraining
function if needed. The goal here to have confuction matrix information as a list.
input: document-term matrix (e.g., result from CleanText()
function which is cleaned_dataset
) or list of trained models. Output of BuildTraining()
(e.g., trained_models
R object saved in package data)
output: list of various parameter values of 5 machine learning algorithms (GBM, KNN, NB, RF,SVM_Poly)
Example below uses BuilPrediction()
function, with our example trained object (list of trained models trained_models
). As mentioned earlier the input can also be a cleaned document-term matrix (e.g.,output of CleanText()
function)
data(package = "SentiAnalyzer", trained_models) df_predicted <- BuildPrediction(trained_models)
As mentioned earlier the input can also be a cleaned document-term matrix (e.g.,output of CleanText()
function which is cleaned_dataset
) :
BuildPrediction(cleaned_dataset)
This function extracts F1, Recall,Precision and Accuracy from confusion matrices list and returns a dataframe of those measurments.
input: depending upon the type of the input, function runs appropriate needed functions. Input can be a document-term matrix (e.g., result from CleanText()
function which is cleaned_dataset
) or a list (e.g., output of BuildPrediction()
function)
output: list of confusion matrices of 5 machine learning algorithms Input can be
Example below uses comparison()
function, with our example confusion matrices list R object (list of confusion matrices df_predicted
).
comparison(df_predicted)
As mentioned earlier the input can also be a cleaned document-term matrix (e.g.,output of CleanText()
function which is cleaned_dataset
) :
comparison(cleaned_dataset)
Example confusion matrices for different algorithm for our example dataset cleaned_dataset
:
data(package = "SentiAnalyzer", trained_models) df_predicted <- SentiAnalyzer::BuildPrediction(trained_models) Cmx <- SentiAnalyzer::comparison(df_predicted) Cmx
As shown for this dataset, based on table above KNN is doing better than others with respect to accuracy and RandomForest is doing better in terms of F1 measure. As we can see, just referring to one measurement for summerizing the confusion matrix is not suitable. Selecting the best method can not be just based upon accuracy of classification.
what now? How about we visualize these numbers to aid our interpretation of comparing the parameters?
For a more interactive version of the functions, access our Shiny apps
Here is an example of plotly heat map and table of the confusion matrices
input: confusion matrix from Comparison
output: interactive plotly heat map and table
Cmx_4param <- Cmx[1:4] Cmx_4param <- as.data.frame(Cmx_4param) #colnames(Cmx_4param) <- c("accuracy","precision","recall","f1-score") rownames(Cmx_4param) <- c("Gradient-Boosting", "k-Nearest Neighbors", "Naive Bayes", "Random Forest","SVM") #heatmap encompassing 5 ML trained model output, 5 parameters of the confusion matrix plotly::plot_ly(z=data.matrix(Cmx_4param), x=colnames(Cmx_4param), y=rownames(Cmx_4param), type="heatmap", textfont=list(size=20,color = "black")) # #plotly table for 5 ML trained model output, 5 parameters of the confusion matrix # plotly::plot_ly( # type = 'table', # header = list( # values = c('<b>Confusion Matrix</b>', rownames(Cmx_4param)), # line = list(color = '#506784'), # fill = list(color = '#119DFF'), # align = c('left','center'), # font = list(color = 'white', size = 12) # ), # cells = list( # values = rbind(colnames(Cmx_4param), # data.matrix(Cmx_4param)), # line = list(color = '#506784'), # fill = list(color = c('#25FEFD', 'white')), # align = c('left', 'left'), # font = list(color = c('#black'), size = 12) # ))
Deep learning algorithm can also be used to classify textual content. We used bag of word techniques but more sophisticated techniques for NLP such as word embedding can be incorporated.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.