knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)
op = options(width = 80, str = strOptions(strict.width = "cut"))

Introduction

This vignette uses text2vec and Stonybrook's novels dataset to create a classification and regression model to predict the likelihood of success for novels and other literary texts. The dataset and various approaches to creating a classifier are discussed in the original article "Success with Style" [@ashok_success_2013].

The authors of the original study chose to publish the raw data used for their dataset but did not publish the source code to their classifier. In the original study, the author's built a classifier using LibLinear SVM. They used a variety of approaches to transform text into "features" suitable for training a classifier. In this example, we make use of text2vec which, as its name implies, provides the tools to transform text into feature vectors.

library(dplyr)
library(novels)
library(data.table)
library(text2vec)
library(glmnet)
library(tools)
library(knitr)
library(kableExtra)

The "novels" dataset

The novels dataset consists of r dim(novels)[1] entries. Some of the entries refer to the same text file, e.g. in the case where a literary work belongs in more than one category (fiction and short story). For training, it's important that we only train on a given text once. In the entire dataset there are r sum(novels %>% select(file.name) %>% duplicated) duplicates.

Also, author's may have multiple works in the dataset. While we have a total of r dim(novels)[1] entries in the dataset, there are fewer authors r length(unique(novels$author)).

Create training and test sets

# Make response value an integer for AUC computation. Test to prevent
# confusion when rerunning code chunks.
if(class(novels$response) == 'factor') {
  stopifnot(novels$response == 'success' || novels$response == 'failure')
  novels$response <- as.integer(novels$response == 'success')
}
novels = copy(novels::novels)
setDT(novels)
set.seed(2019L)
all_ids = unique(novels$author)
indices <- createSplit(novels)
train = novels[indices$train]
test = novels[indices$test]

Feature extraction

Create a text2vec iterator, vocabulary, vectorizer and document text matrix (dtm). The dtm represents the novels data set as trainable features.

it <- novels::getIterator(train$text, train$title)
vocab <- novels::text2vocab(it, length(train$text))
vectorizer = vocab_vectorizer(vocab)
dtm = create_dtm(it, vectorizer)

Normalization

Transform the dtm to an idf to normalize the data and make it machine learning friendly.

# define tfidf model
tfidf = TfIdf$new()
# fit model to train data and transform train data with fitted model
dtm_tfidf = fit_transform(dtm, tfidf)
rm(dtm)

Build classifier

Build a classifier using a generalized linear model (glm) with cross-validation (cv).

glmnet_classifier = cv.glmnet(x = dtm_tfidf, y = train$response,
                              family = 'binomial',
                              alpha = 0.1,
                              type.measure = "auc",
                              nfolds = 5,
                              thresh = 1e-5,
                              maxit = 1e4)

Display AUC for the resulting model

Area under the curve (AUC) is a time-honored approach to evaluating the effectiveness of a model. Here we use the glmnet package to plot the AUC for different values of the control variable "lambda".

plot(glmnet_classifier)

Evaluate test set

To understand the quality of our classifier, we want the model's best AUC result for both training and test data sets. We want AUC scores as close to 1 as possible, ideally over .9.

Although the test set rarely exceeds the training set results, we want a test set AUC result that is close to the training set AUC.

classifier <- createPredictor(vectorizer, tfidf, glmnet_classifier)
test_results <- classifier(test)
test_auc <- glmnet:::auc(test$response, test_results$prediction)
train_auc <- max(glmnet_classifier$cvm)

The classifier produces an AUC of r sprintf("%02.2f", train_auc * 100) on training data and an AUC of r sprintf("%02.2f", test_auc * 100) on test data.

summarizeScores <- function(results) {
  results %>%
        mutate(
          author = toTitleCase(as.character(author)),
          title = toTitleCase(as.character(title)),
          prediction = novels::results2Text(prediction),
          response = novels::results2Text(response)
        ) %>%
        select(author, title, score, download.count, prediction, response) %>%
        arrange(desc(score))
}

prettyPrintScores <- function(results) {
  kable(summarizeScores(results) %>% head(n = 100)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
}

Some Test Scores for the Fiction Genre

Let's look at the relative scores for examples of fiction in the test set just to get a sense for how it performs.

prettyPrintScores(test_results %>% filter(genre == 'fiction'))

Define function to evaluate performance

evaluatePerformance <- function(results) {
  results %>% createConfusionMatrix %>%
      addInformationRetrievalMetrics
}

Classifier Results by Genre

This table shows how the classifier performs for a given genre.

genres <- levels(novels$genre)
resultsByGenre <- cbind(data.frame(Genre = c('all')), evaluatePerformance(test_results))
for(g in genres) {
  resultsByGenre <- rbind(
    resultsByGenre,
    cbind(
      data.frame(Genre = c(g)),
      evaluatePerformance(test_results %>% filter(genre == g))
    )
  )
}
resultsByGenre %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Comparison of text2vec performance versus the best of SunnyBrook

This table shows how the GLMNET / text2vec classifier stacks up to the results reported in the "Success with Style" article. The GLMNET column shows the performance of our classifier. The best per genre column shows the best result obtained by each of Sunnybrook's 15 models in each genre. The best overall column shows the performance of the best single classifier over all the genres.

In three genres, viz. fiction, poetry and short-stories, the GLMNET / text2vec classify has noticably poorer results.

# c('all', 'adventure', 'fiction', 'historical.fiction', 'love.story', 'mystery', 'poetry', 'science.fiction', 'short.stories')
sunnyBrookAccuracybyGenre <- data.frame(
  Genre = resultsByGenre$Genre,
  BestPerGenre = c(73.5, 84.0, 75.0, 61.0, 82.0, 75.0, 74.0, 76.0, 77.0),
  BestOverall = c(73.5, 75.0, 75.0, 58.0, 81.0, 75.0, 72.0, 76.0, 77.0)

)
resultsByGenre %>% inner_join(sunnyBrookAccuracybyGenre, by = 'Genre') %>% mutate(GLMNET = Accuracy) %>% select(Genre, GLMNET, BestPerGenre, BestOverall) %>% kable %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))


John-Poplett/novels documentation built on Jan. 28, 2020, 12:02 a.m.