This package exists to explore the concept of creating a machine learning model as an R package, similar to the established concept of an analysis as an R package. The idea here is that, using vignettes, we can train the model by installing the package. The functions in the package then allow the user to score new data with the trained model. To demonstrate this I've created an extremely simple sentiment analysis model based on review data from the UCI Machine Learning Repository.

I thought this might work because of a few things:

However, I have my doubts:

No matter what, I think these sorts of projects have to be shared, even if I don't think that this is a major success!

A quick shout out for the excellent book on R packages by Hadley Wickham. It's well worth keeping bookmarked.

library(dplyr)
library(ggplot2)
library(text2vec)
library(tidytext)
library(randomForest)

knitr::opts_chunk$set(echo = TRUE, cache = FALSE)

package_root <- here::here()
devtools::load_all(package_root)

Data load

I haven't kept the data in this git repository, opting instead to download it if it doesn't already exist. It's a fairly small data set though (3000 rows).

download_data is a package function that downloads and unzips the source data into the inst/extdata directory (creating it if necessary). On package compilation, everything in the inst folder is moved up to the root directory of the package, and so we can find the extdata directory in the finished product.

extdata <- file.path(package_root, "inst", "extdata")
data_files <- c("amazon_cells_labelled.txt",
                "imdb_labelled.txt",
                "yelp_labelled.txt") %>% file.path(extdata, .)
if (!all(file.exists(data_files))) {
  download_data(extdata)
}

Data is loaded in with another custom function, read_review_file. This is just readr::read_tsv with some special options to cover the pecularities of the raw data. All of these custom functions are documented and stored in the R directory. Once the package is installed, function manuals can be called in the usual way (eg. ?read_review_file).

This is a simple analysis, so let's just stick to discrete categories for sentiment: "good" and "bad". I don't care too much about how the model performs, as long as it functions.

reviews <- data_files %>% 
  purrr::map(read_review_file) %>%
  purrr::reduce(rbind) %>% 
  mutate(sentiment = ifelse(sentiment == 1, "good", "bad"))
reviews %>% head

Exploring data

We check for missing data using the naniar package:

reviews %>% naniar::miss_var_summary()

Let's take a look at which words are the most frequent. First we create a data frame such that each row is an occurrence of a word. Note that we remove stop words --- these are words like "the" that are common and usually provide little semantic content to the text.

words <- reviews %>% 
  tidytext::unnest_tokens(
    word, 
    review
  ) %>% 
  anti_join(
    tidytext::stop_words, 
    by = "word"
  )
words %>% head

Now we'll plot the mosst frequently occurring words, keeping a note of which words are "good" and which words are "bad".

words %>%
  count(word, sentiment, sort = TRUE) %>%
  head(20) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col() + 
  # scale_fill_manual(
  #   values = wine_plot_colours
  # ) +
  xlab(NULL) +
  theme(text = element_text(size = 16)) +
  coord_flip() +
  ggtitle("Frequency of words")

There are no surprises here! "Bad" is universally bad and "love" is universally good. It's comforting to see. We'll note this and use these words in our unit tests.

I'm not sure what purpose word clouds serve, but they seem almost mandatory.

words %>%
  count(word) %>%
  with(
    wordcloud::wordcloud(
      word, 
      n, 
      max.words = 100
    )
  )

Preprocessing

We need to apply some preprocessing to our text before we can feed it into a model. The first round of preprocessing is simply ignoring case, punctuation and numbers:

text_preprocessor

I'm actually not sure that we should be removing numbers here. We're dealing with reviews, after all, and a review like "10/10" certainly tells us something about sentiment. But that's beyond the scope of this package.

The next round of processing involves tokenising our words. This is a process of stripping words down to their base. Another custom function, stem_tokeniser plays this role, by calling on the Porter stemming algorithm:

stem_tokeniser("information informed informing informs")

Now we'll define our vocabulary. The vocabulary is the domain of the problem --- the words that will go into the model as features. Dimensionality is an issue here, we''ll prune our vocabulary to include only words that occur a minimum number of times. The vocabulary is subject to domain shift --- if an incoming piece of text contains a word that isn't in the vocabulary, it will be ignored by the model.

We're going to insist that every word in the vocabulary appears in at least 25 of the reviews.

vocabulary <- create_vocabulary(reviews$review,
                                doc_proportion_min = 25 / nrow(reviews))
vocabulary

Finally, we'll create a vectoriser using the text2vec package. This allows us to map new text onto the vocabulary we've just created.

Actually, much of what you see here uses the text2vec package. I'm fond of this package because it's designed with the idea that you may need to score new data that comes in after you've trained your model, so always need to be able to process new text!

vectoriser <- vocabulary %>% text2vec::vocab_vectorizer()

One quick note, though: the itoken function in this package creates an iterator of tokens. It can be called like this:

tokens <- text2vec::itoken(
  unprocessed_text, 
  preprocessor = text_preprocessor, 
  tokenizer = stem_tokeniser,
  progressbar = FALSE
)

However, I had great trouble with this method, with words not being properly tokenised before making it into the vectoriser. So I've done this instead:

processed_text <- stem_tokeniser(text_preprocessor(unprocessed_text))

text2vec::itoken(
  processed_text, 
  progressbar = FALSE
)

I would have thought that the two pieces of code were equivalent, but my unit tests fail with the first example. I'm putting this here as an unknown!

Creating a document term matrix

The input for our model algorithm is a document term matrix. This is a matrix in which every row represents one of our 3000 reviews, and every column uses one of the r length(vocabulary$term) terms in our vocabulary. We use the map_to_dtm function which allows us to map raw text onto a new dtm.

map_to_dtm

You'll notice that tfidf argument. This stands for term frequency inverse document frequency. Informally, we want to weight every word as more important if it occurs often, and less important if it occurs in many documents. This is exactly what tfidf does. Let's start with an unweighted matrix so we can see the effect.

dtm_unweighted <- map_to_dtm(reviews$review,
                             vectoriser = vectoriser,
                             tfidf = NULL)
paste0('> ', reviews$review[3000]) %>% cat
tail(as.matrix(dtm_unweighted)[3000,], 21)

Now we fit our tfidf. Be careful here: fit_transform is a method which says "use this data to define the tfidf, and then transform the input by that tfidf. This is distinct from transform which says "use a tfidf that's already been fitted to transform this data". This terminology is more familiar to Python users than it is to R users, and I occasionally see it tripping people up (especially on Kaggle). The rule of thumb is that you use fit_transform for your training data and transform for your test data, or any new data that you encounter.

tfidf <- text2vec::TfIdf$new()
dtm_tf_idf <- tfidf$fit_transform(dtm_unweighted)

Let's take the same example before, but now with a weighted document term matrix:

tail(as.matrix(dtm_tf_idf)[3000,], 21)

Training a random forest

With all of the effort we put into preprocessing, the model training step is relatively straightforward! Document term matrices are stored as a special class of sparse matrix, because there are computational techniques to efficiently store and use matrices in which the vast majority of entries are 0. However, this format isn't accepted by the randomForest algorithm. Fortunately, with only 3000 rows and r length(vocabulary$term) columns, we don't have to worry too much about computational efficiency.

review_rf <- randomForest::randomForest(
  x = as.matrix(dtm_tf_idf),
  y = factor(reviews$sentiment),
  ntree = 500
)

While I don't want to invest much time in the model itself, we can at least look at how it performs, and which terms it considers the most important:

review_rf
randomForest::varImpPlot(review_rf)

Artefact output

Now we recall that we're actually creating a package. We want all of the work that we've done so far to be included in the final result. Fortunately, our custom functions are in the R directory, so they'll persist when the package is compiled. We need three other objects (all relating to our trained model) to be available as well: our random forest review_rf, our vectoriser, and the tfidf we used to weight our training data (and will reuse for new data). These are all (sparsely) documented with their own entries in the R directory.

withr::with_dir(package_root,
  usethis::use_data(
    review_rf,
    vectoriser,
    tfidf,
    overwrite = TRUE 
  )
)


mdneuzerling/ModelAsAPackage documentation built on Feb. 1, 2020, 12:57 a.m.