kms: foRmulas foR keRas
In kerasformula: A High-Level R Interface for Neural Nets

library(knitr)
opts_chunk$set(comment = "", message = FALSE, warning = FALSE)

The goal of this document is to introduce kms (as in keras_model_sequential()), a regression-style function which allows users to call keras neural nets with R formula objects (hence, library(kerasformula)). kms() enables users to easily crossvalidate a neural net and eases the coding burden which stems from setting the potentially large number of advanced hyperparameters.

First, make sure that keras is properly configured:

install.packages("keras")
library(keras)
install_keras() # see https://keras.rstudio.com/ for details.

kms splits training and test data into sparse matrices.kms also auto-detects whether the dependent variable is categorical, binary, or continuous. kms accepts the major parameters found in library(keras) as inputs (loss function, batch size, number of epochs, etc.) and allows users to customize basic neural nets (dense neural nets of various input shapes and dropout rates). The final example below also shows how to pass a compiled keras_model_sequential to kms (preferable for more complex models).

IMDB Movie Reviews

This example works with some of the imdb movie review data that comes with library(keras). Specifically, this example compares the default dense model that ksm generates to the lstm model described here. To expedite package building and installation, the code below is not actually run but can be run in under six minutes on a 2017 MacBook Pro with 16 GB of RAM (of which the majority of the time is for the lstm).

max_features <- 5000 # 5,000 words (ranked by popularity) found in movie reviews
maxlen <- 50  # Cut texts after 50 words (among top max_features most common words) 
Nsample <- 1000 

cat('Loading data...\n')
imdb <- keras::dataset_imdb(num_words = max_features)
imdb_df <- as.data.frame(cbind(c(imdb$train$y, imdb$test$y),
                               pad_sequences(c(imdb$train$x, imdb$test$x))))

set.seed(2017)   # can also set kms(..., seed = 2017)

demo_sample <- sample(nrow(imdb_df), Nsample)
P <- ncol(imdb_df) - 1
colnames(imdb_df) <- c("y", paste0("x", 1:P))

out_dense <- kms("y ~ .", data = imdb_df[demo_sample, ], Nepochs = 10, 
                 scale_continuous=NULL) # scale_continuous=NULL means leave data on original scale_continuous


plot(out_dense$history)  # incredibly useful 
# choose Nepochs to maximize out of sample accuracy

out_dense$confusion

    1
  0 107
  1 105

cat('Test accuracy:', out_dense$evaluations$acc, "\n")

Test accuracy: 0.495283

Pretty bad--that's a 'broken clock' model. Suppose want to add some more layers, say 6 total. The vector units is only length 5 since the final layer is determined by the type of outcome (one for regression, 2 or more for classification). Inputs, like dropout or activation function below, are repeated so that each layer is specified. (Each layer will have a 40\% dropout rate and alternate between relu and softmax.)

out_dense <- kms("y ~ .", data = imdb_df[demo_sample, ], Nepochs = 10, seed=123, scale_continuous=NULL,
                 N_layers = 6,
                 units = c(1024, 512, 256, 128, 64), 
                 activation = c("relu", "softmax"),
                 dropout = 0.4)
out_dense$confusion

     1
  0 92
  1 106

cat('Test accuracy:', out_dense$evaluations$acc, "\n")

Test accuracy: 0.4816514

No progress. Suppose we want to build an lstm model and pass it to ksm.

use_session_with_seed(12345)
k <- keras_model_sequential()
k %>%
  layer_embedding(input_dim = max_features, output_dim = 128) %>% 
  layer_lstm(units = 64, dropout = 0.2, recurrent_dropout = 0.2) %>% 
  layer_dense(units = 1, activation = 'sigmoid')

k %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = c('accuracy')
)
out_lstm <- kms("y ~ .", imdb_df[demo_sample, ], 
                keras_model_seq = k, Nepochs = 10, seed = 12345, scale_continuous = NULL)
out_lstm$confusion

     0  1
  0 74 23
  1 23 79

cat('Test accuracy:', out_lstm$evaluations$acc, "\n")

Test accuracy: 0.7688442

76.8% out-of-sample accuracy. That's marked improvement!

If you're OK with -> (right assignment), the above is equivalent to:

use_session_with_seed(12345)

keras_model_sequential() %>%

  layer_embedding(input_dim = max_features, output_dim = 128) %>% 

    layer_lstm(units = 64, dropout = 0.2, recurrent_dropout = 0.2) %>% 

      layer_dense(units = 1, activation = 'sigmoid') %>% 

        compile(loss = 'binary_crossentropy', 
                optimizer = 'adam', metrics = c('accuracy')) %>%

            kms(input_formula = "y ~ .", data = imdb_df[demo_sample, ], 
                Nepochs = 10, seed = 12345, scale_continuous = NULL) -> 
  out_lstm

plot(out_lstm$history)