In rosepeglershare/TagR: SHARE's TagR package that uses machine learning to identify a comment's relevant topics

knitr::opts_chunk$set(echo = TRUE,
                      message = FALSE,
                      warning = FALSE,
                      fig.align = "center",
                      fig.width=7,
                      fig.height=5)

library(pander)
library(TagR)
library(tm)
library(tidyverse)
library(DT)

This notebook provides a full example of how a user can utilise the TagR package.

We will go through the following steps:

Pre-process the text data
Conduct exploratory data analysis
Create document-term matrices
Bind numerical variables to the document-term matrices
Find an optimal set of parameters for the Xgboost model for each type of author
Train a model for each author type and predict the probabilities that each post in the unlabelled data set belongs to such author
Build an explainer that helps to visualise how the Xgboost model makes its predictions

The Data

The labelled dataset consists of 644 tweets where the author has been identified as either an IT professional, a developer, or neither. The unlabelled dataset consists of 644 tweets where the author has not yet been identified.

labelled_raw <- TagR::labelled
unlabelled_raw <- TagR::unlabelled

rownames(labelled_raw) <- NULL
rownames(unlabelled_raw) <- NULL

The labelled and unlabelled datasets have the following variables:

names(unlabelled_raw)

The labelled dataset also has the variables

names(labelled_raw[,10:11])

that identifies which type of author the tweet belongs to, using a binary 0 or 1 in each column.

Step 1: Pre-process the text data

The text data that I wish to include within the model is under the variable 'text'.

This text data is going to be pre-processed by:

removing punctuation
transforming text to lowercase
removing numbers and any mention of "http"
removing common words such as "the" that will not be useful in this analysis

The "preprocess_data" function also outputs a long-style dataframe that will be useful in performing exploratory data analysis.

preprocessed_data <- TagR::preprocess_data(labelled_raw = labelled_raw,
                                           unlabelled_raw = unlabelled_raw,
                                           topics = c('DEV', 'ITPRO'),
                                           text_vars = c('text'))

labelled <- preprocessed_data$labelled_data
unlabelled <- preprocessed_data$unlabelled_data
long_labelled <- preprocessed_data$long_labelled_data

Step 2: Perform exploratory data analysis

I would now like to explore the most frequent 10 terms for each author, how often each author crops up in the dataset, and the difference between an IT professional and developer in terms of the words they mention.

Top 10 terms:

top_n_terms(long_labelled_data = long_labelled,
            text_var = text,
            top_n = 10)

Author frequency:

topic_frequency(labelled_data = labelled)

Important terms:

This function uses the weighted log odds ratio to identify words that are particularly important in identifying whether a post belongs to each category. The terms 'javascript' and 'cybersecurity' are key in identifying a developer and IT pro, respectively, even though they may not be the most frequent terms.

plot_important_terms(long_labelled_data = long_labelled,
                     text_var = text,
                     log_odds_n = 30)

Step 3: Create document-term matrices

I now want to transform my data into sparse document-term matrices whereby each column represents a word and the rows represent a post. For example, the following dataset

sentences <- c('I love machine learning',
               'I love to code')

can be transformed into this matrix

sample_dtm <- tm::VCorpus(tm::VectorSource(sentences)) %>%
  tm::DocumentTermMatrix() %>%
  as.matrix()

pander::pandoc.table(sample_dtm)

To create my document-term matrices I will set the maximum sparsity level to 1 as I want to include all terms. For much larger datasets this may need to be reduced to the default value of 0.999 to decrease the size of the matrix. The validation split is set to 0.2, this splits the labelled dataset into a training dataset and validation dataset ready for hyper-parameter tuning.

dtms <- create_dt_matrices(labelled_data = labelled,
                           unlabelled_data = unlabelled,
                           text_vars = c('text'),
                           topics = c('DEV', 'ITPRO'),
                           max_sparsity = 1,
                           val_split = 0.2)

full_labelled_dtm <- dtms$labelled_dtm
train_labelled_dtm <- dtms$train_labelled_dtm
valid_labelled_dtm <- dtms$valid_labelled_dtm
unlabelled_dtm <- dtms$unlabelled_dtm
full_labels <- dtms$labels
train_labels <- dtms$train_labels
val_labels <- dtms$val_labels

Step 4: Bind numerical variables to matrices

Now that the text data is sorted, I want to add any numerical variables that I think could be useful. I'm going to choose reach, favourites and retweets. The validation split needs to be the same as before, 0.2.

dtms <- bind_numerical_vars(labelled_raw = labelled_raw,
                            unlabelled_raw = unlabelled_raw,
                            full_labelled_dtm = full_labelled_dtm,
                            train_labelled_dtm = train_labelled_dtm,
                            valid_labelled_dtm = valid_labelled_dtm,
                            unlabelled_dtm = unlabelled_dtm,
                            numerical_vars = c('reach', 'favourites', 'retweets'),
                            val_split = 0.2)

full_labelled_dtm <- dtms$full_labelled_dtm
train_labelled_dtm <- dtms$train_labelled_dtm
valid_labelled_dtm <- dtms$valid_labelled_dtm
unlabelled_dtm <- dtms$unlabelled_dtm

Step 5: Hyper-parameter tuning

Xgboost models use a range of different parameters, the values of which can affect how the model performs.

The parameters we are going to use are:

max_depth: Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit.
eta: Step size shrinkage used in update to prevent overfitting.
subsample: Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees and this will prevent overfitting.
colsample_bytree: The subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed.
min_child_weight: Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning.

The "hyperparameter tuning" function goes through a number of random values for each parameter and finds the set of parameters that produced the highest accuracy score. The number of iterations should be kept high to increase the chance of finding better parameters, but for the purpose of this example I will use a lower number of 500.

parameters <- hyperparameter_tuning(train_labelled_dtm = train_labelled_dtm,
                                    valid_labelled_dtm = valid_labelled_dtm,
                                    train_labels = train_labels,
                                    val_labels = val_labels,
                                    topics = c("DEV", "ITPRO"),
                                    num_its = 500)

The best parameter sets for each author are:

datatable(parameters, rownames = FALSE, options = list(scrollX = T))

Step 6: Predict authors for unlabelled posts

The next step is to train an Xgboost model for each author using the best parameter set found in the previous step. After training, the model is then used to predict whether the unlabelled posts are written by either author, or neither.

This function also has the option to use some default parameter values if hyper-parameter tuning has not been performed.

predictions <- predict_topics(unlabelled_raw = unlabelled_raw,
                              labelled_dtm = full_labelled_dtm,
                              unlabelled_dtm = unlabelled_dtm,
                              labels_matrix = full_labels,
                              text_vars = c('text'),
                              num_vars = c('reach', 'favourites', 'retweets'),
                              topics = c('DEV', 'ITPRO'),
                              parameters_df = parameters,
                              nrounds = 1000)

models <- predictions$models
predictions <- predictions$predictions

devmodel <- models$DEV
itmodel <- models$ITPRO

The output then looks like the following:

#pander::pandoc.table(predictions[1:3,])
datatable(predictions[1:3,], rownames = FALSE, options = list(scrollX = T))

In this case, the model has predicted that none of the first 3 posts are likely to be written by a developer or IT professional.

Step 7: Build and visualise an explainer

It's all very well that we now have predictions, but it is difficult to tell how the model actually made those predictions, and it would be useful if we could visualise the thought process.

An Xgboost model uses decision trees to make predictions. That is, it will go through a series of nodes that will ask different questions, which branch off in different directions depending on the answer. The final decision is the leaf of the tree, which will be our prediction.

To better visualise the process, let's first build our explainer. I would like to understand how the algorithm decides the probability of a post being written by a developer.

model_explainer <- build_topic_explainer(full_labelled_dtm = full_labelled_dtm,
                                         full_labels = full_labels,
                                         topic = 'DEV',
                                         devmodel)

I would like to see how the model predicted a particular post, so let's see which posts were predicted to be written by a developer, using a probability boundary of 0.7.

devs <- subset(predictions, DEV > 0.7)
devs['468', 'text']

The comment explainer threshold is set at 0.02 so the graph doesn't become messy and unreadable. The resulting graph shows that the inclusion of words such as 'web' and 'dev' (stemmed words) have contributed to the prediction that the post was written by a developer. The lack of some words, such as 'privaci' and 'secur' have also influenced the decision, as these are more common words for an IT professional.

comment_explainer <- explain_comment(xgbmodel = devmodel,
                                     explainer = model_explainer,
                                     unlabelled_dtm = unlabelled_dtm,
                                     comment_index = 55,
                                     threshold = 0.02)