knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)
op = options(width = 80, str = strOptions(strict.width = "cut"))
This vignette uses text2vec and Stonybrook's novels dataset to create a classification and regression model to predict the likelihood of success for novels and other literary texts. The dataset and various approaches to creating a classifier are discussed in the original article "Success with Style" [@ashok_success_2013].
The authors of the original study chose to publish the raw data used for their dataset but did not publish the source code to their classifier. In the original study, the author's built a classifier using LibLinear SVM. They used a variety of approaches to transform text into "features" suitable for training a classifier. In this example, we make use of text2vec which, as its name implies, provides the tools to transform text into feature vectors.
library(dplyr) library(novels) library(data.table) library(text2vec) library(glmnet) library(tools) library(knitr) library(kableExtra)
The novels dataset consists of r dim(novels)[1]
entries. Some of the entries refer
to the same text file, e.g. in the case where a literary work belongs in more than one category (fiction and short story). For training, it's important that we only train on a given text once. In the entire dataset there are r sum(novels %>% select(file.name) %>% duplicated)
duplicates.
Also, author's may have multiple works in the dataset. While we have a total of r dim(novels)[1]
entries in the dataset, there are fewer authors r length(unique(novels$author))
.
# Make response value an integer for AUC computation. Test to prevent # confusion when rerunning code chunks. if(class(novels$response) == 'factor') { stopifnot(novels$response == 'success' || novels$response == 'failure') novels$response <- as.integer(novels$response == 'success') } novels = copy(novels::novels) setDT(novels) set.seed(2019L) all_ids = unique(novels$author) indices <- createSplit(novels) train = novels[indices$train] test = novels[indices$test]
Create a text2vec iterator, vocabulary, vectorizer and document text matrix (dtm). The dtm represents the novels data set as trainable features.
it <- novels::getIterator(train$text, train$title) vocab <- novels::text2vocab(it, length(train$text)) vectorizer = vocab_vectorizer(vocab) dtm = create_dtm(it, vectorizer)
Transform the dtm to an idf to normalize the data and make it machine learning friendly.
# define tfidf model tfidf = TfIdf$new() # fit model to train data and transform train data with fitted model dtm_tfidf = fit_transform(dtm, tfidf)
rm(dtm)
Build a classifier using a generalized linear model (glm) with cross-validation (cv).
glmnet_classifier = cv.glmnet(x = dtm_tfidf, y = train$response, family = 'binomial', alpha = 0.1, type.measure = "auc", nfolds = 5, thresh = 1e-5, maxit = 1e4)
Area under the curve (AUC) is a time-honored approach to evaluating the effectiveness of a model. Here we use the glmnet package to plot the AUC for different values of the control variable "lambda".
plot(glmnet_classifier)
To understand the quality of our classifier, we want the model's best AUC result for both training and test data sets. We want AUC scores as close to 1 as possible, ideally over .9.
Although the test set rarely exceeds the training set results, we want a test set AUC result that is close to the training set AUC.
classifier <- createPredictor(vectorizer, tfidf, glmnet_classifier) test_results <- classifier(test) test_auc <- glmnet:::auc(test$response, test_results$prediction) train_auc <- max(glmnet_classifier$cvm)
The classifier produces an AUC of r sprintf("%02.2f", train_auc * 100)
on training data and an AUC of r sprintf("%02.2f", test_auc * 100)
on test data.
summarizeScores <- function(results) { results %>% mutate( author = toTitleCase(as.character(author)), title = toTitleCase(as.character(title)), prediction = novels::results2Text(prediction), response = novels::results2Text(response) ) %>% select(author, title, score, download.count, prediction, response) %>% arrange(desc(score)) } prettyPrintScores <- function(results) { kable(summarizeScores(results) %>% head(n = 100)) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) }
Let's look at the relative scores for examples of fiction in the test set just to get a sense for how it performs.
prettyPrintScores(test_results %>% filter(genre == 'fiction'))
evaluatePerformance <- function(results) { results %>% createConfusionMatrix %>% addInformationRetrievalMetrics }
This table shows how the classifier performs for a given genre.
genres <- levels(novels$genre) resultsByGenre <- cbind(data.frame(Genre = c('all')), evaluatePerformance(test_results)) for(g in genres) { resultsByGenre <- rbind( resultsByGenre, cbind( data.frame(Genre = c(g)), evaluatePerformance(test_results %>% filter(genre == g)) ) ) } resultsByGenre %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
This table shows how the GLMNET / text2vec classifier stacks up to the results reported in the "Success with Style" article. The GLMNET column shows the performance of our classifier. The best per genre column shows the best result obtained by each of Sunnybrook's 15 models in each genre. The best overall column shows the performance of the best single classifier over all the genres.
In three genres, viz. fiction, poetry and short-stories, the GLMNET / text2vec classify has noticably poorer results.
# c('all', 'adventure', 'fiction', 'historical.fiction', 'love.story', 'mystery', 'poetry', 'science.fiction', 'short.stories') sunnyBrookAccuracybyGenre <- data.frame( Genre = resultsByGenre$Genre, BestPerGenre = c(73.5, 84.0, 75.0, 61.0, 82.0, 75.0, 74.0, 76.0, 77.0), BestOverall = c(73.5, 75.0, 75.0, 58.0, 81.0, 75.0, 72.0, 76.0, 77.0) ) resultsByGenre %>% inner_join(sunnyBrookAccuracybyGenre, by = 'Genre') %>% mutate(GLMNET = Accuracy) %>% select(Genre, GLMNET, BestPerGenre, BestOverall) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.