knitr::opts_chunk$set(echo = TRUE,
                      message = FALSE,
                      warning = FALSE,
                      fig.align = "center",
                      fig.width=7,
                      fig.height=5)
library(pander)
library(TagR)
library(tm)
library(tidyverse)
library(DT)

This notebook provides a full example of how a user can utilise the TagR package.

We will go through the following steps:

  1. Pre-process the text data
  2. Conduct exploratory data analysis
  3. Create document-term matrices
  4. Bind numerical variables to the document-term matrices
  5. Find an optimal set of parameters for the Xgboost model for each type of author
  6. Train a model for each author type and predict the probabilities that each post in the unlabelled data set belongs to such author
  7. Build an explainer that helps to visualise how the Xgboost model makes its predictions

The Data

The labelled dataset consists of 644 tweets where the author has been identified as either an IT professional, a developer, or neither. The unlabelled dataset consists of 644 tweets where the author has not yet been identified.

labelled_raw <- TagR::labelled
unlabelled_raw <- TagR::unlabelled
rownames(labelled_raw) <- NULL
rownames(unlabelled_raw) <- NULL

The labelled and unlabelled datasets have the following variables:

names(unlabelled_raw)

The labelled dataset also has the variables

names(labelled_raw[,10:11])

that identifies which type of author the tweet belongs to, using a binary 0 or 1 in each column.

Step 1: Pre-process the text data

The text data that I wish to include within the model is under the variable 'text'.

This text data is going to be pre-processed by:

The "preprocess_data" function also outputs a long-style dataframe that will be useful in performing exploratory data analysis.

preprocessed_data <- TagR::preprocess_data(labelled_raw = labelled_raw,
                                           unlabelled_raw = unlabelled_raw,
                                           topics = c('DEV', 'ITPRO'),
                                           text_vars = c('text'))

labelled <- preprocessed_data$labelled_data
unlabelled <- preprocessed_data$unlabelled_data
long_labelled <- preprocessed_data$long_labelled_data

Step 2: Perform exploratory data analysis

I would now like to explore the most frequent 10 terms for each author, how often each author crops up in the dataset, and the difference between an IT professional and developer in terms of the words they mention.

Top 10 terms:

top_n_terms(long_labelled_data = long_labelled,
            text_var = text,
            top_n = 10)

Author frequency:

topic_frequency(labelled_data = labelled)

Important terms:

This function uses the weighted log odds ratio to identify words that are particularly important in identifying whether a post belongs to each category. The terms 'javascript' and 'cybersecurity' are key in identifying a developer and IT pro, respectively, even though they may not be the most frequent terms.

plot_important_terms(long_labelled_data = long_labelled,
                     text_var = text,
                     log_odds_n = 30)

Step 3: Create document-term matrices

I now want to transform my data into sparse document-term matrices whereby each column represents a word and the rows represent a post. For example, the following dataset

sentences <- c('I love machine learning',
               'I love to code')

can be transformed into this matrix

sample_dtm <- tm::VCorpus(tm::VectorSource(sentences)) %>%
  tm::DocumentTermMatrix() %>%
  as.matrix()
pander::pandoc.table(sample_dtm)

To create my document-term matrices I will set the maximum sparsity level to 1 as I want to include all terms. For much larger datasets this may need to be reduced to the default value of 0.999 to decrease the size of the matrix. The validation split is set to 0.2, this splits the labelled dataset into a training dataset and validation dataset ready for hyper-parameter tuning.

dtms <- create_dt_matrices(labelled_data = labelled,
                           unlabelled_data = unlabelled,
                           text_vars = c('text'),
                           topics = c('DEV', 'ITPRO'),
                           max_sparsity = 1,
                           val_split = 0.2)

full_labelled_dtm <- dtms$labelled_dtm
train_labelled_dtm <- dtms$train_labelled_dtm
valid_labelled_dtm <- dtms$valid_labelled_dtm
unlabelled_dtm <- dtms$unlabelled_dtm
full_labels <- dtms$labels
train_labels <- dtms$train_labels
val_labels <- dtms$val_labels

Step 4: Bind numerical variables to matrices

Now that the text data is sorted, I want to add any numerical variables that I think could be useful. I'm going to choose reach, favourites and retweets. The validation split needs to be the same as before, 0.2.

dtms <- bind_numerical_vars(labelled_raw = labelled_raw,
                            unlabelled_raw = unlabelled_raw,
                            full_labelled_dtm = full_labelled_dtm,
                            train_labelled_dtm = train_labelled_dtm,
                            valid_labelled_dtm = valid_labelled_dtm,
                            unlabelled_dtm = unlabelled_dtm,
                            numerical_vars = c('reach', 'favourites', 'retweets'),
                            val_split = 0.2)

full_labelled_dtm <- dtms$full_labelled_dtm
train_labelled_dtm <- dtms$train_labelled_dtm
valid_labelled_dtm <- dtms$valid_labelled_dtm
unlabelled_dtm <- dtms$unlabelled_dtm

Step 5: Hyper-parameter tuning

Xgboost models use a range of different parameters, the values of which can affect how the model performs.

The parameters we are going to use are:

The "hyperparameter tuning" function goes through a number of random values for each parameter and finds the set of parameters that produced the highest accuracy score. The number of iterations should be kept high to increase the chance of finding better parameters, but for the purpose of this example I will use a lower number of 500.

parameters <- hyperparameter_tuning(train_labelled_dtm = train_labelled_dtm,
                                    valid_labelled_dtm = valid_labelled_dtm,
                                    train_labels = train_labels,
                                    val_labels = val_labels,
                                    topics = c("DEV", "ITPRO"),
                                    num_its = 500)

The best parameter sets for each author are:

datatable(parameters, rownames = FALSE, options = list(scrollX = T))

Step 6: Predict authors for unlabelled posts

The next step is to train an Xgboost model for each author using the best parameter set found in the previous step. After training, the model is then used to predict whether the unlabelled posts are written by either author, or neither.

This function also has the option to use some default parameter values if hyper-parameter tuning has not been performed.

predictions <- predict_topics(unlabelled_raw = unlabelled_raw,
                              labelled_dtm = full_labelled_dtm,
                              unlabelled_dtm = unlabelled_dtm,
                              labels_matrix = full_labels,
                              text_vars = c('text'),
                              num_vars = c('reach', 'favourites', 'retweets'),
                              topics = c('DEV', 'ITPRO'),
                              parameters_df = parameters,
                              nrounds = 1000)

models <- predictions$models
predictions <- predictions$predictions

devmodel <- models$DEV
itmodel <- models$ITPRO

The output then looks like the following:

#pander::pandoc.table(predictions[1:3,])
datatable(predictions[1:3,], rownames = FALSE, options = list(scrollX = T))

In this case, the model has predicted that none of the first 3 posts are likely to be written by a developer or IT professional.

Step 7: Build and visualise an explainer

It's all very well that we now have predictions, but it is difficult to tell how the model actually made those predictions, and it would be useful if we could visualise the thought process.

An Xgboost model uses decision trees to make predictions. That is, it will go through a series of nodes that will ask different questions, which branch off in different directions depending on the answer. The final decision is the leaf of the tree, which will be our prediction.

To better visualise the process, let's first build our explainer. I would like to understand how the algorithm decides the probability of a post being written by a developer.

model_explainer <- build_topic_explainer(full_labelled_dtm = full_labelled_dtm,
                                         full_labels = full_labels,
                                         topic = 'DEV',
                                         devmodel)

I would like to see how the model predicted a particular post, so let's see which posts were predicted to be written by a developer, using a probability boundary of 0.7.

devs <- subset(predictions, DEV > 0.7)
devs['468', 'text']

The comment explainer threshold is set at 0.02 so the graph doesn't become messy and unreadable. The resulting graph shows that the inclusion of words such as 'web' and 'dev' (stemmed words) have contributed to the prediction that the post was written by a developer. The lack of some words, such as 'privaci' and 'secur' have also influenced the decision, as these are more common words for an IT professional.

comment_explainer <- explain_comment(xgbmodel = devmodel,
                                     explainer = model_explainer,
                                     unlabelled_dtm = unlabelled_dtm,
                                     comment_index = 55,
                                     threshold = 0.02)
comment_explainer


rosepeglershare/TagR documentation built on Dec. 31, 2020, 3:12 a.m.