knitr::opts_chunk$set(echo = TRUE, eval = FALSE)
This tutorial classifies movie reviews as positive or negative using the text of the review. This is an example of binary — or two-class — classification, an important and widely applicable kind of machine learning problem.
We'll use the IMDB dataset that contains the text of 50,000 movie reviews from the Internet Movie Database. These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced, meaning they contain an equal number of positive and negative reviews.
Let's start and load Keras, as well as a few other required libraries.
library(keras) library(dplyr) library(ggplot2) library(purrr)
The IMDB dataset comes packaged with Keras. It has already been preprocessed such that the reviews (sequences of words) have been converted to sequences of integers, where each integer represents a specific word in a dictionary.
The following code downloads the IMDB dataset to your machine (or uses a cached copy if you've already downloaded it):
imdb <- dataset_imdb(num_words = 10000) c(train_data, train_labels) %<-% imdb$train c(test_data, test_labels) %<-% imdb$test
The argument num_words = 10000
keeps the top 10,000 most frequently occurring words in the training data. The rare words are discarded to keep the size of the data manageable.
Conveniently, the dataset comes with an index mapping words to integers, which has to be downloaded separately:
word_index <- dataset_imdb_word_index()
Let's take a moment to understand the format of the data. The dataset comes preprocessed: each example is an array of integers representing the words of the movie review. Each label is an integer value of either 0 or 1, where 0 is a negative review, and 1 is a positive review.
paste0("Training entries: ", length(train_data), ", labels: ", length(train_labels))
[1] "Training entries: 25000, labels: 25000"
The texts of the reviews have been converted to integers, where each integer represents a specific word in a dictionary. Here's what the first review looks like:
train_data[[1]]
[1] 1 14 22 16 43 530 973 1622 1385 65 458 4468 66 3941 4 173 [17] 36 256 5 25 100 43 838 112 50 670 2 9 35 480 284 5 [33] 150 4 172 112 167 2 336 385 39 4 172 4536 1111 17 546 38 [49] 13 447 4 192 50 16 6 147 2025 19 14 22 4 1920 4613 469 [65] 4 22 71 87 12 16 43 530 38 76 15 13 1247 4 22 17 [81] 515 17 12 16 626 18 2 5 62 386 12 8 316 8 106 5 [97] 4 2223 5244 16 480 66 3785 33 4 130 12 16 38 619 5 25 [113] 124 51 36 135 48 25 1415 33 6 22 12 215 28 77 52 5 [129] 14 407 16 82 2 8 4 107 117 5952 15 256 4 2 7 3766 [145] 5 723 36 71 43 530 476 26 400 317 46 7 4 2 1029 13 [161] 104 88 4 381 15 297 98 32 2071 56 26 141 6 194 7486 18 [177] 4 226 22 21 134 476 26 480 5 144 30 5535 18 51 36 28 [193] 224 92 25 104 4 226 65 16 38 1334 88 12 16 283 5 16 [209] 4472 113 103 32 15 16 5345 19 178 32
Movie reviews may be different lengths. The below code shows the number of words in the first and second reviews. Since inputs to a neural network must be the same length, we'll need to resolve this later.
length(train_data[[1]]) length(train_data[[2]])
[1] 218 [1] 189
It may be useful to know how to convert integers back to text.
We already have the word_index
we downloaded above — a list with words as keys and integers as values. If we create a data frame from it, we can conveniently use it in both directions.
word_index_df <- data.frame( word = names(word_index), idx = unlist(word_index, use.names = FALSE), stringsAsFactors = FALSE ) # The first indices are reserved word_index_df <- word_index_df %>% mutate(idx = idx + 3) word_index_df <- word_index_df %>% add_row(word = "<PAD>", idx = 0)%>% add_row(word = "<START>", idx = 1)%>% add_row(word = "<UNK>", idx = 2)%>% add_row(word = "<UNUSED>", idx = 3) word_index_df <- word_index_df %>% arrange(idx) decode_review <- function(text){ paste(map(text, function(number) word_index_df %>% filter(idx == number) %>% select(word) %>% pull()), collapse = " ") }
Now we can use the decode_review function to display the text for the first review:
decode_review(train_data[[1]])
[1] "<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"
The reviews — the arrays of integers — must be converted to tensors before fed into the neural network. This conversion can be done a couple of ways:
One-hot-encode the arrays to convert them into vectors of 0s and 1s. For example, the sequence [3, 5] would become a 10,000-dimensional vector that is all zeros except for indices 3 and 5, which are ones. Then, make this the first layer in our network — a dense
layer — that can handle floating point vector data. This approach is memory intensive, though, requiring a num_words * num_reviews
size matrix.
Alternatively, we can pad the arrays so they all have the same length, then create an integer tensor of shape num_examples * max_length
. We can use an embedding layer capable of handling this shape as the first layer in our network.
In this tutorial, we will use the second approach.
Since the movie reviews must be the same length, we will use the pad_sequences function to standardize the lengths:
train_data <- pad_sequences( train_data, value = word_index_df %>% filter(word == "<PAD>") %>% select(idx) %>% pull(), padding = "post", maxlen = 256 ) test_data <- pad_sequences( test_data, value = word_index_df %>% filter(word == "<PAD>") %>% select(idx) %>% pull(), padding = "post", maxlen = 256 )
Let's look at the length of the examples now:
length(train_data[1, ]) length(train_data[2, ])
[1] 256 [1] 256
And inspect the (now padded) first review:
train_data[1, ]
[1] 1 14 22 16 43 530 973 1622 1385 65 458 4468 66 3941 4 [16] 173 36 256 5 25 100 43 838 112 50 670 2 9 35 480 [31] 284 5 150 4 172 112 167 2 336 385 39 4 172 4536 1111 [46] 17 546 38 13 447 4 192 50 16 6 147 2025 19 14 22 [61] 4 1920 4613 469 4 22 71 87 12 16 43 530 38 76 15 [76] 13 1247 4 22 17 515 17 12 16 626 18 2 5 62 386 [91] 12 8 316 8 106 5 4 2223 5244 16 480 66 3785 33 4 [106] 130 12 16 38 619 5 25 124 51 36 135 48 25 1415 33 [121] 6 22 12 215 28 77 52 5 14 407 16 82 2 8 4 [136] 107 117 5952 15 256 4 2 7 3766 5 723 36 71 43 530 [151] 476 26 400 317 46 7 4 2 1029 13 104 88 4 381 15 [166] 297 98 32 2071 56 26 141 6 194 7486 18 4 226 22 21 [181] 134 476 26 480 5 144 30 5535 18 51 36 28 224 92 25 [196] 104 4 226 65 16 38 1334 88 12 16 283 5 16 4472 113 [211] 103 32 15 16 5345 19 178 32 0 0 0 0 0 0 0 [226] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [241] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [256] 0
The neural network is created by stacking layers — this requires two main architectural decisions:
In this example, the input data consists of an array of word-indices. The labels to predict are either 0 or 1. Let's build a model for this problem:
# input shape is the vocabulary count used for the movie reviews (10,000 words) vocab_size <- 10000 model <- keras_model_sequential() model %>% layer_embedding(input_dim = vocab_size, output_dim = 16) %>% layer_global_average_pooling_1d() %>% layer_dense(units = 16, activation = "relu") %>% layer_dense(units = 1, activation = "sigmoid") model %>% summary()
____________________________________________________________________________________ Layer (type) Output Shape Param # ==================================================================================== embedding_1 (Embedding) (None, None, 16) 160000 ____________________________________________________________________________________ global_average_pooling1d_1 (GlobalAv (None, 16) 0 ____________________________________________________________________________________ dense_3 (Dense) (None, 16) 272 ____________________________________________________________________________________ dense_4 (Dense) (None, 1) 17 ==================================================================================== Total params: 160,289 Trainable params: 160,289 Non-trainable params: 0 ____________________________________________________________________________________
The layers are stacked sequentially to build the classifier:
embedding
layer. This layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: (batch, sequence, embedding
).global_average_pooling_1d
layer returns a fixed-length output vector for each example by averaging over the sequence dimension. This allows the model to handle input of variable length, in the simplest way possible.dense
) layer with 16 hidden units.sigmoid
activation function, this value is a float between 0 and 1, representing a probability, or confidence level.The above model has two intermediate or "hidden" layers, between the input and output. The number of outputs (units, nodes, or neurons) is the dimension of the representational space for the layer. In other words, the amount of freedom the network is allowed when learning an internal representation.
If a model has more hidden units (a higher-dimensional representation space), and/or more layers, then the network can learn more complex representations. However, it makes the network more computationally expensive and may lead to learning unwanted patterns — patterns that improve performance on training data but not on the test data. This is called overfitting, and we'll explore it later.
A model needs a loss function and an optimizer for training. Since this is a binary classification problem and the model outputs a probability (a single-unit layer with a sigmoid activation), we'll use the binary_crossentropy
loss function.
This isn't the only choice for a loss function, you could, for instance, choose mean_squared_error
. But, generally, binary_crossentropy is better for dealing with probabilities — it measures the "distance" between probability distributions, or in our case, between the ground-truth distribution and the predictions.
Later, when we are exploring regression problems (say, to predict the price of a house), we will see how to use another loss function called mean squared error.
Now, configure the model to use an optimizer and a loss function:
model %>% compile( optimizer = 'adam', loss = 'binary_crossentropy', metrics = list('accuracy') )
When training, we want to check the accuracy of the model on data it hasn't seen before. Create a validation set by setting apart 10,000 examples from the original training data. (Why not use the testing set now? Our goal is to develop and tune our model using only the training data, then use the test data just once to evaluate our accuracy).
x_val <- train_data[1:10000, ] partial_x_train <- train_data[10001:nrow(train_data), ] y_val <- train_labels[1:10000] partial_y_train <- train_labels[10001:length(train_labels)]
Train the model for 20 epochs in mini-batches of 512 samples. This is 20 iterations over all samples in the x_train and y_train tensors. While training, monitor the model's loss and accuracy on the 10,000 samples from the validation set:
history <- model %>% fit( partial_x_train, partial_y_train, epochs = 40, batch_size = 512, validation_data = list(x_val, y_val), verbose=1 )
Train on 15000 samples, validate on 10000 samples Epoch 1/40 15000/15000 [==============================] - 1s 56us/step - loss: 0.6916 - acc: 0.6060 - val_loss: 0.6895 - val_acc: 0.5805 Epoch 2/40 15000/15000 [==============================] - 1s 34us/step - loss: 0.6851 - acc: 0.7123 - val_loss: 0.6804 - val_acc: 0.7446 Epoch 3/40 15000/15000 [==============================] - 0s 32us/step - loss: 0.6713 - acc: 0.7484 - val_loss: 0.6625 - val_acc: 0.7322 Epoch 4/40 15000/15000 [==============================] - 1s 33us/step - loss: 0.6454 - acc: 0.7679 - val_loss: 0.6329 - val_acc: 0.7712 Epoch 5/40 15000/15000 [==============================] - 1s 42us/step - loss: 0.6068 - acc: 0.7939 - val_loss: 0.5929 - val_acc: 0.7876 Epoch 6/40 15000/15000 [==============================] - 0s 33us/step - loss: 0.5597 - acc: 0.8133 - val_loss: 0.5490 - val_acc: 0.8033 Epoch 7/40 15000/15000 [==============================] - 1s 34us/step - loss: 0.5087 - acc: 0.8345 - val_loss: 0.5020 - val_acc: 0.8210 Epoch 8/40 15000/15000 [==============================] - 1s 38us/step - loss: 0.4595 - acc: 0.8512 - val_loss: 0.4601 - val_acc: 0.8369 Epoch 9/40 15000/15000 [==============================] - 0s 33us/step - loss: 0.4154 - acc: 0.8650 - val_loss: 0.4242 - val_acc: 0.8476 Epoch 10/40 15000/15000 [==============================] - 1s 38us/step - loss: 0.3780 - acc: 0.8770 - val_loss: 0.3956 - val_acc: 0.8535 Epoch 11/40 15000/15000 [==============================] - 1s 34us/step - loss: 0.3467 - acc: 0.8862 - val_loss: 0.3725 - val_acc: 0.8612 Epoch 12/40 15000/15000 [==============================] - 0s 33us/step - loss: 0.3209 - acc: 0.8931 - val_loss: 0.3542 - val_acc: 0.8673 Epoch 13/40 15000/15000 [==============================] - 1s 37us/step - loss: 0.2989 - acc: 0.8989 - val_loss: 0.3414 - val_acc: 0.8662 Epoch 14/40 15000/15000 [==============================] - 0s 33us/step - loss: 0.2800 - acc: 0.9038 - val_loss: 0.3285 - val_acc: 0.8736 Epoch 15/40 15000/15000 [==============================] - 1s 34us/step - loss: 0.2633 - acc: 0.9090 - val_loss: 0.3189 - val_acc: 0.8759 Epoch 16/40 15000/15000 [==============================] - 1s 41us/step - loss: 0.2485 - acc: 0.9148 - val_loss: 0.3119 - val_acc: 0.8778 Epoch 17/40 15000/15000 [==============================] - 1s 33us/step - loss: 0.2355 - acc: 0.9193 - val_loss: 0.3051 - val_acc: 0.8796 Epoch 18/40 15000/15000 [==============================] - 1s 34us/step - loss: 0.2234 - acc: 0.9233 - val_loss: 0.3004 - val_acc: 0.8801 Epoch 19/40 15000/15000 [==============================] - 1s 39us/step - loss: 0.2124 - acc: 0.9266 - val_loss: 0.2965 - val_acc: 0.8812 Epoch 20/40 15000/15000 [==============================] - 1s 35us/step - loss: 0.2021 - acc: 0.9300 - val_loss: 0.2931 - val_acc: 0.8823 Epoch 21/40 15000/15000 [==============================] - 1s 35us/step - loss: 0.1927 - acc: 0.9347 - val_loss: 0.2904 - val_acc: 0.8830 Epoch 22/40 15000/15000 [==============================] - 1s 39us/step - loss: 0.1839 - acc: 0.9395 - val_loss: 0.2886 - val_acc: 0.8840 Epoch 23/40 15000/15000 [==============================] - 1s 33us/step - loss: 0.1761 - acc: 0.9426 - val_loss: 0.2873 - val_acc: 0.8841 Epoch 24/40 15000/15000 [==============================] - 1s 35us/step - loss: 0.1690 - acc: 0.9459 - val_loss: 0.2866 - val_acc: 0.8850 Epoch 25/40 15000/15000 [==============================] - 0s 33us/step - loss: 0.1606 - acc: 0.9497 - val_loss: 0.2864 - val_acc: 0.8845 Epoch 26/40 15000/15000 [==============================] - 1s 35us/step - loss: 0.1540 - acc: 0.9519 - val_loss: 0.2871 - val_acc: 0.8839 Epoch 27/40 15000/15000 [==============================] - 1s 38us/step - loss: 0.1475 - acc: 0.9548 - val_loss: 0.2877 - val_acc: 0.8855 Epoch 28/40 15000/15000 [==============================] - 1s 39us/step - loss: 0.1412 - acc: 0.9570 - val_loss: 0.2878 - val_acc: 0.8855 Epoch 29/40 15000/15000 [==============================] - 1s 34us/step - loss: 0.1354 - acc: 0.9597 - val_loss: 0.2891 - val_acc: 0.8857 Epoch 30/40 15000/15000 [==============================] - 1s 39us/step - loss: 0.1299 - acc: 0.9617 - val_loss: 0.2911 - val_acc: 0.8862 Epoch 31/40 15000/15000 [==============================] - 1s 33us/step - loss: 0.1250 - acc: 0.9632 - val_loss: 0.2920 - val_acc: 0.8863 Epoch 32/40 15000/15000 [==============================] - 0s 33us/step - loss: 0.1197 - acc: 0.9653 - val_loss: 0.2939 - val_acc: 0.8868 Epoch 33/40 15000/15000 [==============================] - 1s 37us/step - loss: 0.1150 - acc: 0.9679 - val_loss: 0.2960 - val_acc: 0.8854 Epoch 34/40 15000/15000 [==============================] - 0s 33us/step - loss: 0.1101 - acc: 0.9697 - val_loss: 0.2989 - val_acc: 0.8847 Epoch 35/40 15000/15000 [==============================] - 1s 35us/step - loss: 0.1061 - acc: 0.9702 - val_loss: 0.3026 - val_acc: 0.8857 Epoch 36/40 15000/15000 [==============================] - 1s 34us/step - loss: 0.1021 - acc: 0.9722 - val_loss: 0.3057 - val_acc: 0.8830 Epoch 37/40 15000/15000 [==============================] - 0s 32us/step - loss: 0.0978 - acc: 0.9745 - val_loss: 0.3086 - val_acc: 0.8830 Epoch 38/40 15000/15000 [==============================] - 1s 37us/step - loss: 0.0939 - acc: 0.9762 - val_loss: 0.3110 - val_acc: 0.8825 Epoch 39/40 15000/15000 [==============================] - 1s 36us/step - loss: 0.0897 - acc: 0.9771 - val_loss: 0.3142 - val_acc: 0.8823 Epoch 40/40 15000/15000 [==============================] - 1s 34us/step - loss: 0.0861 - acc: 0.9786 - val_loss: 0.3181 - val_acc: 0.8818
And let's see how the model performs. Two values will be returned. Loss (a number which represents our error, lower values are better), and accuracy.
results <- model %>% evaluate(test_data, test_labels) results
25000/25000 [==============================] - 0s 15us/step $loss [1] 0.34057 $acc [1] 0.8696
This fairly naive approach achieves an accuracy of about 87%. With more advanced approaches, the model should get closer to 95%.
fit
returns a keras_training_history
object whose metrics
slot contains loss and metrics values recorded during training.
You can conveniently plot the loss and metrics curves like so:
plot(history)
The evolution of loss and metrics can also be seen during training in the RStudio Viewer pane.
Notice the training loss decreases with each epoch and the training accuracy increases with each epoch. This is expected when using gradient descent optimization — it should minimize the desired quantity on every iteration.
This isn't the case for the validation loss and accuracy — they seem to peak after about twenty epochs. This is an example of overfitting: the model performs better on the training data than it does on data it has never seen before. After this point, the model over-optimizes and learns representations specific to the training data that do not generalize to test data.
For this particular case, we could prevent overfitting by simply stopping the training after twenty or so epochs. Later, you'll see how to do this automatically with a callback.
Check out these additional tutorials to learn more:
Basic Classification --- In this tutorial, we train a neural network model to classify images of clothing, like sneakers and shirts.
Basic Regression --- This tutorial builds a model to predict the median price of homes in a Boston suburb during the mid-1970s.
Overfitting and Underfitting --- In this tutorial, we explore two common regularization techniques (weight regularization and dropout) and use them to improve our movie review classification results.
Save and Restore Models --- This tutorial demonstrates various ways to save and share models (after as well as during training).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.