knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, fig.align = "center", fig.width=7, fig.height=5)
library(pander) library(TagR) library(tm) library(tidyverse) library(DT)
This notebook provides a full example of how a user can utilise the TagR package.
We will go through the following steps:
The labelled dataset consists of 644 tweets where the author has been identified as either an IT professional, a developer, or neither. The unlabelled dataset consists of 644 tweets where the author has not yet been identified.
labelled_raw <- TagR::labelled unlabelled_raw <- TagR::unlabelled
rownames(labelled_raw) <- NULL rownames(unlabelled_raw) <- NULL
The labelled and unlabelled datasets have the following variables:
names(unlabelled_raw)
The labelled dataset also has the variables
names(labelled_raw[,10:11])
that identifies which type of author the tweet belongs to, using a binary 0 or 1 in each column.
The text data that I wish to include within the model is under the variable 'text'.
This text data is going to be pre-processed by:
The "preprocess_data" function also outputs a long-style dataframe that will be useful in performing exploratory data analysis.
preprocessed_data <- TagR::preprocess_data(labelled_raw = labelled_raw, unlabelled_raw = unlabelled_raw, topics = c('DEV', 'ITPRO'), text_vars = c('text')) labelled <- preprocessed_data$labelled_data unlabelled <- preprocessed_data$unlabelled_data long_labelled <- preprocessed_data$long_labelled_data
I would now like to explore the most frequent 10 terms for each author, how often each author crops up in the dataset, and the difference between an IT professional and developer in terms of the words they mention.
Top 10 terms:
top_n_terms(long_labelled_data = long_labelled, text_var = text, top_n = 10)
Author frequency:
topic_frequency(labelled_data = labelled)
Important terms:
This function uses the weighted log odds ratio to identify words that are particularly important in identifying whether a post belongs to each category. The terms 'javascript' and 'cybersecurity' are key in identifying a developer and IT pro, respectively, even though they may not be the most frequent terms.
plot_important_terms(long_labelled_data = long_labelled, text_var = text, log_odds_n = 30)
I now want to transform my data into sparse document-term matrices whereby each column represents a word and the rows represent a post. For example, the following dataset
sentences <- c('I love machine learning', 'I love to code')
can be transformed into this matrix
sample_dtm <- tm::VCorpus(tm::VectorSource(sentences)) %>% tm::DocumentTermMatrix() %>% as.matrix()
pander::pandoc.table(sample_dtm)
To create my document-term matrices I will set the maximum sparsity level to 1 as I want to include all terms. For much larger datasets this may need to be reduced to the default value of 0.999 to decrease the size of the matrix. The validation split is set to 0.2, this splits the labelled dataset into a training dataset and validation dataset ready for hyper-parameter tuning.
dtms <- create_dt_matrices(labelled_data = labelled, unlabelled_data = unlabelled, text_vars = c('text'), topics = c('DEV', 'ITPRO'), max_sparsity = 1, val_split = 0.2) full_labelled_dtm <- dtms$labelled_dtm train_labelled_dtm <- dtms$train_labelled_dtm valid_labelled_dtm <- dtms$valid_labelled_dtm unlabelled_dtm <- dtms$unlabelled_dtm full_labels <- dtms$labels train_labels <- dtms$train_labels val_labels <- dtms$val_labels
Now that the text data is sorted, I want to add any numerical variables that I think could be useful. I'm going to choose reach, favourites and retweets. The validation split needs to be the same as before, 0.2.
dtms <- bind_numerical_vars(labelled_raw = labelled_raw, unlabelled_raw = unlabelled_raw, full_labelled_dtm = full_labelled_dtm, train_labelled_dtm = train_labelled_dtm, valid_labelled_dtm = valid_labelled_dtm, unlabelled_dtm = unlabelled_dtm, numerical_vars = c('reach', 'favourites', 'retweets'), val_split = 0.2) full_labelled_dtm <- dtms$full_labelled_dtm train_labelled_dtm <- dtms$train_labelled_dtm valid_labelled_dtm <- dtms$valid_labelled_dtm unlabelled_dtm <- dtms$unlabelled_dtm
Xgboost models use a range of different parameters, the values of which can affect how the model performs.
The parameters we are going to use are:
The "hyperparameter tuning" function goes through a number of random values for each parameter and finds the set of parameters that produced the highest accuracy score. The number of iterations should be kept high to increase the chance of finding better parameters, but for the purpose of this example I will use a lower number of 500.
parameters <- hyperparameter_tuning(train_labelled_dtm = train_labelled_dtm, valid_labelled_dtm = valid_labelled_dtm, train_labels = train_labels, val_labels = val_labels, topics = c("DEV", "ITPRO"), num_its = 500)
The best parameter sets for each author are:
datatable(parameters, rownames = FALSE, options = list(scrollX = T))
The next step is to train an Xgboost model for each author using the best parameter set found in the previous step. After training, the model is then used to predict whether the unlabelled posts are written by either author, or neither.
This function also has the option to use some default parameter values if hyper-parameter tuning has not been performed.
predictions <- predict_topics(unlabelled_raw = unlabelled_raw, labelled_dtm = full_labelled_dtm, unlabelled_dtm = unlabelled_dtm, labels_matrix = full_labels, text_vars = c('text'), num_vars = c('reach', 'favourites', 'retweets'), topics = c('DEV', 'ITPRO'), parameters_df = parameters, nrounds = 1000) models <- predictions$models predictions <- predictions$predictions devmodel <- models$DEV itmodel <- models$ITPRO
The output then looks like the following:
#pander::pandoc.table(predictions[1:3,]) datatable(predictions[1:3,], rownames = FALSE, options = list(scrollX = T))
In this case, the model has predicted that none of the first 3 posts are likely to be written by a developer or IT professional.
It's all very well that we now have predictions, but it is difficult to tell how the model actually made those predictions, and it would be useful if we could visualise the thought process.
An Xgboost model uses decision trees to make predictions. That is, it will go through a series of nodes that will ask different questions, which branch off in different directions depending on the answer. The final decision is the leaf of the tree, which will be our prediction.
To better visualise the process, let's first build our explainer. I would like to understand how the algorithm decides the probability of a post being written by a developer.
model_explainer <- build_topic_explainer(full_labelled_dtm = full_labelled_dtm, full_labels = full_labels, topic = 'DEV', devmodel)
I would like to see how the model predicted a particular post, so let's see which posts were predicted to be written by a developer, using a probability boundary of 0.7.
devs <- subset(predictions, DEV > 0.7) devs['468', 'text']
The comment explainer threshold is set at 0.02 so the graph doesn't become messy and unreadable. The resulting graph shows that the inclusion of words such as 'web' and 'dev' (stemmed words) have contributed to the prediction that the post was written by a developer. The lack of some words, such as 'privaci' and 'secur' have also influenced the decision, as these are more common words for an IT professional.
comment_explainer <- explain_comment(xgbmodel = devmodel, explainer = model_explainer, unlabelled_dtm = unlabelled_dtm, comment_index = 55, threshold = 0.02)
comment_explainer
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.