In this section, we show how to fit the examples discussed in the text. We use the keras package, which interfaces to the
tensorflow package which in turn links to efficient
python code. This code is
impressively fast, and the package is well-structured. A good companion
is the text Deep Learning with R
(F. Chollet and
J.J. Allaire, \emph{Deep Learning with R}
(2018), Manning Publications,) and most of our code is adapted from
there.
Getting keras up and running on your computer can be a
challenge. The book website www.statlearning.com gives
step-by-step instructions on how to achieve this.
(Many thanks to Balasubramanian Narasimhan for preparing the keras
installation instructions.)
Guidance can also be found at keras.rstudio.com)
We start by fitting the models in Section 10.6. We set up the data, and separate out a training and test set.
library(ISLR2) Gitters <- na.omit(Hitters) n <- nrow(Gitters) set.seed(13) ntest <- trunc(n / 3) testid <- sample(1:n, ntest)
The linear model should be familiar, but we present it anyway.
lfit <- lm(Salary ~ ., data = Gitters[-testid, ]) lpred <- predict(lfit, Gitters[testid, ]) with(Gitters[testid, ], mean(abs(lpred - Salary)))
Notice the use of the with() command: the first argument is a
dataframe, and the second an expression that can refer to elements of
the dataframe by name. In this instance the dataframe corresponds to
the test data and the expression computes the mean absolute prediction
error on this data.
Next we fit the lasso using glmnet. Since this package does
not use formulas, we create x and y first.
x <- scale(model.matrix(Salary ~ . - 1, data = Gitters)) y <- Gitters$Salary
The first line makes a call
to model.matrix(), which produces the same matrix that was
used by lm() (the -1 omits the intercept).
This function automatically converts factors to dummy variables.
The scale() function standardizes the matrix so each column
has mean zero and variance one.
library(glmnet) cvfit <- cv.glmnet(x[-testid, ], y[-testid], type.measure = "mae") cpred <- predict(cvfit, x[testid, ], s = "lambda.min") mean(abs(y[testid] - cpred))
To fit the neural network, we first set up a model structure that describes the network.
library(keras)
modnn <- keras_model_sequential() %>% layer_dense(units = 50, activation = "relu", input_shape = ncol(x)) %>% layer_dropout(rate = 0.4) %>% layer_dense(units = 1)
We have created a vanilla model
object called modnn, and have added details about the successive
layers in a sequential manner, using the function keras_model_sequential().
The pipe operator %>%
passes the previous term as the first argument to the next
function, and returns the result.
It allows us to specify the layers of
a neural network in a readable form.
We illustrate the use of the pipe operator on a simple example. Earlier, we created x using the command
x <- scale(model.matrix(Salary ~ . - 1, data = Gitters))
We first make a matrix, and then we center each of the variables. Compound expressions like this can be difficult to parse. We could have obtained the same result using the pipe operator:
x <- model.matrix(Salary ~ . - 1, data = Gitters) %>% scale()
Using the pipe operator makes it easier to follow the sequence of operations.
We now return to our neural network. The object modnn has a single hidden layer with 50 hidden units, and
a ReLU activation function. It then has a dropout layer, in which a
random 40% of the 50 activations from the previous layer are set to zero
during each iteration of the stochastic gradient descent
algorithm. Finally, the output layer has just one unit with no
activation function, indicating that the model provides a single
quantitative output.
Next we add details to modnn that control the fitting
algorithm. Here we have simply followed the examples given in
the Keras book. We minimize squared-error loss as in
(10.22). The algorithm
tracks the mean absolute error on the training data, and
on validation data if it is supplied.
modnn %>% compile(loss = "mse", optimizer = optimizer_rmsprop(), metrics = list("mean_absolute_error") )
In the previous line, the pipe operator passes modnn as the first argument to compile(). The compile() function does not actually change the R
object modnn, but it does communicate these specifications to
the corresponding python instance of this model that has been created
along the way.
Now we fit the model. We supply the training data and two
fitting parameters, epochs and batch\_size. Using 32
for the latter means that at each step of SGD, the algorithm randomly
selects 32 training observations for the computation of the gradient. Recall from
Sections 10.4 and 10.7
that an epoch amounts to the number of SGD steps required to process $n$
observations. Since the training set has
$n=176$, an epoch is $176/32=5.5$ SGD steps. The fit() function has an argument
validation_data; these data are not used in the fitting,
but can be used to track the progress of the model (in this case reporting
mean absolute error). Here we
actually supply the test data so we can see mean absolute error of both the
training data and test data as the epochs proceed. To see more options
for fitting, use ?fit.keras.engine.training.Model.
history <- modnn %>% fit( # x[-testid, ], y[-testid], epochs = 1500, batch_size = 32, x[-testid, ], y[-testid], epochs = 600, batch_size = 32, validation_data = list(x[testid, ], y[testid]) )
(Here and elsewhere we have reduced the number of epochs to make runtimes manageable; users can of course change back)
We can plot the history to display the mean absolute error for the training and test data. For the best aesthetics, install the ggplot2 package before calling the plot() function. If you have not installed ggplot2, then the code below will still run, but the plot will be less attractive.
plot(history)
It is worth noting that if you run the fit() command a second time in the same R session, then the fitting process will pick up where it left off. Try re-running the fit() command, and then the plot() command, to see!
Finally, we predict from the final model, and
evaluate its performance on the test data. Due to the use of SGD, the results vary slightly with each
fit. Unfortunately the set.seed() function does not ensure
identical results (since the fitting is done in python), so your results will differ slightly.
npred <- predict(modnn, x[testid, ]) mean(abs(y[testid] - npred))
The keras package comes with a number of example datasets,
including the MNIST digit data. Our first step is to load the
MNIST data. The dataset_mnist() function is provided for this purpose.
mnist <- dataset_mnist() x_train <- mnist$train$x g_train <- mnist$train$y x_test <- mnist$test$x g_test <- mnist$test$y dim(x_train) dim(x_test)
There are 60,000 images in the training data and 10,000 in the test data. The images are $28\times 28$, and stored as a three-dimensional array,
so we need to reshape them into a matrix. Also, we need to `one-hot''
encode the class label. Luckilykeras` has a lot of built-in
functions that do this for us.
x_train <- array_reshape(x_train, c(nrow(x_train), 784)) x_test <- array_reshape(x_test, c(nrow(x_test), 784)) y_train <- to_categorical(g_train, 10) y_test <- to_categorical(g_test, 10)
Neural networks are somewhat sensitive to the scale of the inputs. For example, ridge and lasso regularization are affected by scaling. Here the inputs are eight-bit (Note: Eight bits means $2^8$, which equals 256. Since the convention is to start at $0$, the possible values range from $0$ to $255$.) grayscale values between 0 and 255, so we rescale to the unit interval.
x_train <- x_train / 255 x_test <- x_test / 255
Now we are ready to fit our neural network.
modelnn <- keras_model_sequential() modelnn %>% layer_dense(units = 256, activation = "relu", input_shape = c(784)) %>% layer_dropout(rate = 0.4) %>% layer_dense(units = 128, activation = "relu") %>% layer_dropout(rate = 0.3) %>% layer_dense(units = 10, activation = "softmax")
The first layer goes from $28\times28=784$ input units to a hidden
layer of $256$ units, which uses the ReLU activation function.
This is specified by a call to layer\_dense(), which takes as
input a modelnn object, and returns a modified modelnn
object.
This is then piped through layer\_dropout() to perform
dropout regularization. The second hidden layer comes
next, with $128$ hidden units, followed by a dropout layer.
The final layer is the output layer, with activation
"softmax" (10.12) for the 10-class classification
problem, which defines the map from the second hidden layer to class
probabilities.
Finally, we use summary() to summarize the model, and to make sure we got it
all right.
summary(modelnn)
The parameters for each layer include a bias term, which results in a parameter count of 235,146. For example, the first hidden layer involves $(784+1)\times 256=200{,}960$ parameters.
Notice that the layer names such as dropout_1 and
dense_2 have subscripts. These may appear somewhat random; in
fact, if you fit the same model again, these will change. They are of
no consequence: they vary because the model
specification code is run in python, and these subscripts are incremented every time
keras_model_sequential() is called.
Next, we add details to the model to specify the fitting algorithm. We fit the model by minimizing the cross-entropy function given by (10.13).
modelnn %>% compile(loss = "categorical_crossentropy", optimizer = optimizer_rmsprop(), metrics = c("accuracy") )
Now we are ready to go. The final step is to supply training data, and fit the model.
system.time( history <- modelnn %>% # fit(x_train, y_train, epochs = 30, batch_size = 128, fit(x_train, y_train, epochs = 15, batch_size = 128, validation_split = 0.2) ) plot(history, smooth = FALSE)
We have suppressed the output here, which is a progress report on the
fitting of the model, grouped by epoch. This is very useful, since on
large datasets fitting can take time. Fitting this model took 144
seconds on a 2.9GHz MacBook Pro with 4 cores and 32 GB of RAM.
Here we specified a
validation split of 20%, so the training is actually performed on
80% of the 60,000 observations in the training set. This is an
alternative to actually supplying validation data, like we did in
Section 10.9.1. See
?fit.keras.engine.training.Model for all the optional fitting arguments. SGD uses batches
of 128 observations in computing the gradient, and doing the
arithmetic, we see that an epoch corresponds to 375 gradient steps.
The last plot() command produces a figure similar to Figure 10.18.
To obtain the test error in Table 10.1, we first write
a simple function accuracy() that compares predicted and true
class labels, and then use it to evaluate our predictions.
accuracy <- function(pred, truth) { mean(drop(pred) == drop(truth)) } modelnn %>% predict_classes(x_test) %>% accuracy(g_test)
The table also reports LDA (Chapter 4) and multiclass logistic
regression. Although packages such as glmnet can handle
multiclass logistic regression, they are quite slow on this large
dataset. It is much faster and quite easy to fit such a model
using the keras software. We just have an input layer and output layer, and omit the hidden layers!
modellr <- keras_model_sequential() %>% layer_dense(input_shape = 784, units = 10, activation = "softmax") summary(modellr)
We fit the model just as before.
modellr %>% compile(loss = "categorical_crossentropy", optimizer = optimizer_rmsprop(), metrics = c("accuracy")) modellr %>% fit(x_train, y_train, epochs = 30, batch_size = 128, validation_split = 0.2) modellr %>% predict_classes(x_test) %>% accuracy(g_test)
In this section we fit a CNN to the CIFAR data, which is available in the keras
package. It is arranged in a similar fashion as the MNIST data.
cifar100 <- dataset_cifar100() names(cifar100) x_train <- cifar100$train$x g_train <- cifar100$train$y x_test <- cifar100$test$x g_test <- cifar100$test$y dim(x_train) range(x_train[1,,, 1])
The array of 50,000 training images has four dimensions: each three-color image is represented as a set of three channels, each of which consists of $32\times 32$ eight-bit pixels. We standardize as we did for the digits, but keep the array structure. We one-hot encode the response factors to produce a 100-column binary matrix.
x_train <- x_train / 255 x_test <- x_test / 255 y_train <- to_categorical(g_train, 100) dim(y_train)
Before we start, we look at some of the training images using the jpeg package; similar code produced
Figure 10.5 on page 411.
library(jpeg) par(mar = c(0, 0, 0, 0), mfrow = c(5, 5)) index <- sample(seq(50000), 25) for (i in index) plot(as.raster(x_train[i,,, ]))
The as.raster() function converts the feature map so that it can be plotted as a color image.
Here we specify a moderately-sized CNN for demonstration purposes, similar in structure to Figure 10.8.
model <- keras_model_sequential() %>% layer_conv_2d(filters = 32, kernel_size = c(3, 3), padding = "same", activation = "relu", input_shape = c(32, 32, 3)) %>% layer_max_pooling_2d(pool_size = c(2, 2)) %>% layer_conv_2d(filters = 64, kernel_size = c(3, 3), padding = "same", activation = "relu") %>% layer_max_pooling_2d(pool_size = c(2, 2)) %>% layer_conv_2d(filters = 128, kernel_size = c(3, 3), padding = "same", activation = "relu") %>% layer_max_pooling_2d(pool_size = c(2, 2)) %>% layer_conv_2d(filters = 256, kernel_size = c(3, 3), padding = "same", activation = "relu") %>% layer_max_pooling_2d(pool_size = c(2, 2)) %>% layer_flatten() %>% layer_dropout(rate = 0.5) %>% layer_dense(units = 512, activation = "relu") %>% layer_dense(units = 100, activation = "softmax") summary(model)
Notice that we used the padding = "same" argument to
layer_conv_2D(), which ensures that the output channels have the
same dimension as the input channels. There are 32 channels in the first
hidden layer, in contrast to the three channels in the input layer. We
use a $3\times 3$ convolution filter for each channel in all the layers. Each
convolution is followed by a max-pooling layer over $2\times2$
blocks. By studying the summary, we can see that the channels halve in both
dimensions
after each of these max-pooling operations. After the last of these we
have a layer with 256 channels of dimension $2\times 2$. These are then
flattened to a dense layer of size 1,024:
in other words, each of the $2\times 2$ matrices is turned into a
$4$-vector, and put side-by-side in one layer. This is followed by a
dropout regularization layer, then
another dense layer of size 512, which finally reaches the softmax
output layer.
Finally, we specify the fitting algorithm, and fit the model.
model %>% compile(loss = "categorical_crossentropy", optimizer = optimizer_rmsprop(), metrics = c("accuracy")) #history <- model %>% fit(x_train, y_train, epochs = 30, history <- model %>% fit(x_train, y_train, epochs = 10, batch_size = 128, validation_split = 0.2) model %>% predict_classes(x_test) %>% accuracy(g_test)
This model takes 10 minutes to run and achieves 46% accuracy on the test data. Although this is not terrible for 100-class data (a random classifier gets 1% accuracy), searching the web we see results around 75%. Typically it takes a lot of architecture carpentry, fiddling with regularization, and time to achieve such results.
We now show how to use a CNN pretrained on the imagenet database to classify natural
images, and demonstrate how we produced Figure ???.
We copied six jpeg images from a digital photo album into the
directory book_images. (These images are available
from the data section of www.statlearning.com, the ISL book website. Download book_images.zip; when
clicked it creates the book_images directory.) We first read in the images, and
convert them into the array format expected by the keras
software to match the specifications in imagenet. Make sure that your working directory in R is set to the folder in which the images are stored.
img_dir <- "book_images" image_names <- list.files(img_dir) num_images <- length(image_names) x <- array(dim = c(num_images, 224, 224, 3)) for (i in 1:num_images) { img_path <- paste(img_dir, image_names[i], sep = "/") img <- image_load(img_path, target_size = c(224, 224)) x[i,,, ] <- image_to_array(img) } x <- imagenet_preprocess_input(x)
We then load the trained network. The model has 50 layers, with a fair bit of complexity.
model <- application_resnet50(weights = "imagenet") summary(model)
Finally, we classify our six images, and return the top three class choices in terms of predicted probability for each.
pred6 <- model %>% predict(x) %>% imagenet_decode_predictions(top = 3) names(pred6) <- image_names print(pred6)
Now we perform document classification (Section ???) on the IMDB dataset, which is available as part of the keras
package. We limit the dictionary size to the
10,000 most frequently-used words and tokens.
max_features <- 10000 imdb <- dataset_imdb(num_words = max_features) c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb
The third line is a shortcut for unpacking the list of lists.
Each element of x_train is a vector of numbers between 0 and
9999 (the document), referring to the words found in the dictionary.
For example, the first training document is the positive review on
page 419. The indices of the first 12 words are given below.
\
x_train[[1]][1:12]
To see the words, we create a function, decode_review(), that provides a simple interface to the dictionary.
word_index <- dataset_imdb_word_index() decode_review <- function(text, word_index) { word <- names(word_index) idx <- unlist(word_index, use.names = FALSE) word <- c("<PAD>", "<START>", "<UNK>", "<UNUSED>", word) idx <- c(0:3, idx + 3) words <- word[match(text, idx, 2)] paste(words, collapse = " ") } decode_review(x_train[[1]][1:12], word_index)
Next we write a function to one-hot encode each document in a list of documents, and return a binary matrix in sparse-matrix format.
library(Matrix) one_hot <- function(sequences, dimension) { seqlen <- sapply(sequences, length) n <- length(seqlen) rowind <- rep(1:n, seqlen) colind <- unlist(sequences) sparseMatrix(i = rowind, j = colind, dims = c(n, dimension)) }
To construct the sparse matrix, one supplies just the entries that are
nonzero. In the last line we call the function sparseMatrix() and supply the
row indices corresponding to each document and the column indices
corresponding to the words in each document, since we omit the
values they are taken to be all ones.
Words that appear more than once in any given document still get
recorded as a one.
x_train_1h <- one_hot(x_train, 10000) x_test_1h <- one_hot(x_test, 10000) dim(x_train_1h) nnzero(x_train_1h) / (25000 * 10000)
Only 1.3% of the entries are nonzero, so this amounts to considerable savings in memory. We create a validation set of size 2,000, leaving 23,000 for training.
set.seed(3) ival <- sample(seq(along = y_train), 2000)
First we fit a lasso logistic regression model using glmnet()
on the training data, and evaluate its performance on the validation
data. Finally, we plot the accuracy, acclmv, as a function of
the shrinkage parameter, $\lambda$. Similar expressions compute the
performance on the test data, and were used to produce the left plot
in Figure 10.11.
The code takes advantage of the sparse-matrix format of
x_train_1h, and runs in about 5 seconds; in the usual
dense format it would take about 5 minutes.
library(glmnet) fitlm <- glmnet(x_train_1h[-ival, ], y_train[-ival], family = "binomial", standardize = FALSE) classlmv <- predict(fitlm, x_train_1h[ival, ]) > 0 acclmv <- apply(classlmv, 2, accuracy, y_train[ival] > 0)
We applied the accuracy() function that we wrote in Lab 10.9.2
to every column of the prediction matrix
classlmv, and since this is a logical matrix of TRUE/FALSE values, we
supply the second argument truth as a logical vector as well.
Before making a plot, we adjust the plotting window.
par(mar = c(4, 4, 4, 4), mfrow = c(1, 1)) plot(-log(fitlm$lambda), acclmv)
Next we fit a fully-connected neural network with two hidden layers, each with 16 units and ReLU activation.
model <- keras_model_sequential() %>% layer_dense(units = 16, activation = "relu", input_shape = c(10000)) %>% layer_dense(units = 16, activation = "relu") %>% layer_dense(units = 1, activation = "sigmoid") model %>% compile(optimizer = "rmsprop", loss = "binary_crossentropy", metrics = c("accuracy")) history <- model %>% fit(x_train_1h[-ival, ], y_train[-ival], epochs = 20, batch_size = 512, validation_data = list(x_train_1h[ival, ], y_train[ival])) ``` The `history` object has a `metrics` component that records both the training and validation accuracy at each epoch. Figure ??? includes test accuracy at each epoch as well. To compute the test accuracy, we rerun the entire sequence above, replacing the last line with ```r history <- model %>% fit( x_train_1h[-ival, ], y_train[-ival], epochs = 20, batch_size = 512, validation_data = list(x_test_1h, y_test) )
In this lab we fit the models illustrated in Section 10.5.
Here we fit a simple LSTM RNN for sentiment analysis with
the IMDB movie-review data, as discussed in Section 10.5.1. We showed how to input the data in
10.9.5, so we will not repeat that here.
We first calculate the lengths of the documents.
wc <- sapply(x_train, length) median(wc) sum(wc <= 500) / length(wc)
We see that over 91% of the documents have fewer than 500 words. Our RNN requires all the document sequences to have the same length. We hence restrict the document lengths to the last $L=500$ words, and pad the beginning of the shorter ones with blanks.
maxlen <- 500 x_train <- pad_sequences(x_train, maxlen = maxlen) x_test <- pad_sequences(x_test, maxlen = maxlen) dim(x_train) dim(x_test) x_train[1, 490:500]
The last expression shows the last few words in the first document. At this stage, each of the 500 words in the document is represented using an integer corresponding to the location of that word in the 10,000-word dictionary. The first layer of the RNN is an embedding layer of size 32, which will be learned during training. This layer one-hot encodes each document as a matrix of dimension $500 \times 10,000$, and then maps these $10,000$ dimensions down to $32$.
model <- keras_model_sequential() %>% layer_embedding(input_dim = 10000, output_dim = 32) %>% layer_lstm(units = 32) %>% layer_dense(units = 1, activation = "sigmoid")
The second layer is an LSTM with 32 units, and the output layer is a single sigmoid for the binary classification task.
The rest is now similar to other networks we have fit. We track the test performance as the network is fit, and see that it attains 87\% accuracy.
model %>% compile(optimizer = "rmsprop", loss = "binary_crossentropy", metrics = c("acc")) #history <- model %>% fit(x_train, y_train, epochs = 10, history <- model %>% fit(x_train, y_train, epochs = 3, batch_size = 128, validation_data = list(x_test, y_test)) plot(history) predy <- predict(model, x_test) > 0.5 mean(abs(y_test == as.numeric(predy)))
We now show how to fit the models in Section ??? for time series prediction. We first set up the data, and standardize each of the variables.
library(ISLR2) xdata <- data.matrix( NYSE[, c("DJ_return", "log_volume","log_volatility")] ) istrain <- NYSE[, "train"] xdata <- scale(xdata)
The variable istrain contains a TRUE for each year that is in the training set, and a FALSE for each year
in the test set.
We first write functions to create lagged versions of the three time series. We start with a function that takes as input a data
matrix and a lag $L$, and returns a lagged version of the matrix. It
simply inserts $L$ rows of NA at the top, and truncates the
bottom.
lagm <- function(x, k = 1) { n <- nrow(x) pad <- matrix(NA, k, ncol(x)) rbind(pad, x[1:(n - k), ]) }
We now use this function to create a data frame with all the required lags, as well as the response variable.
arframe <- data.frame(log_volume = xdata[, "log_volume"], L1 = lagm(xdata, 1), L2 = lagm(xdata, 2), L3 = lagm(xdata, 3), L4 = lagm(xdata, 4), L5 = lagm(xdata, 5) )
If we look at the first five rows of this frame, we will see some
missing values in the lagged variables (due to the construction above). We remove these rows, and adjust istrain
accordingly.
arframe <- arframe[-(1:5), ] istrain <- istrain[-(1:5)]
We now fit the linear AR model to the training data using lm(), and predict on the
test data.
arfit <- lm(log_volume ~ ., data = arframe[istrain, ]) arpred <- predict(arfit, arframe[!istrain, ]) V0 <- var(arframe[!istrain, "log_volume"]) 1 - mean((arpred - arframe[!istrain, "log_volume"])^2) / V0
The last two lines compute the $R^2$ on the test data, as defined in (3.17).
We refit this model, including the factor variable day_of_week.
arframed <- data.frame(day = NYSE[-(1:5), "day_of_week"], arframe) arfitd <- lm(log_volume ~ ., data = arframed[istrain, ]) arpredd <- predict(arfitd, arframed[!istrain, ]) 1 - mean((arpredd - arframe[!istrain, "log_volume"])^2) / V0
To fit the RNN, we need to reshape these data, since it expects a sequence of $L=5$ feature vectors $X={X_\ell}_1^L$ for each observation, as in (10.20) on page 428. These are lagged versions of the time series going back $L$ time points.
n <- nrow(arframe) xrnn <- data.matrix(arframe[, -1]) xrnn <- array(xrnn, c(n, 3, 5)) xrnn <- xrnn[,, 5:1] xrnn <- aperm(xrnn, c(1, 3, 2)) dim(xrnn)
We have done this in four steps. The first simply extracts the
$n\times 15$ matrix of lagged versions of the three predictor
variables from arframe. The second converts this matrix to a
$n\times 3\times 5$ array. We can do this by simply changing the
dimension attribute, since the new array is filled column wise. The
third step reverses the order of lagged variables, so that index $1$
is furthest back in time, and index $5$ closest. The
final step rearranges the coordinates of the array (like a partial
transpose) into the format that the RNN module in keras
expects.
Now we are ready to proceed with the RNN, which uses 12 hidden units.
model <- keras_model_sequential() %>% layer_simple_rnn(units = 12, input_shape = list(5, 3), dropout = 0.1, recurrent_dropout = 0.1) %>% layer_dense(units = 1) model %>% compile(optimizer = optimizer_rmsprop(), loss = "mse")
We specify two forms of dropout for the units feeding into the hidden layer. The first is for the input sequence feeding into this layer, and the second is for the previous hidden units feeding into the layer. The output layer has a single unit for the response.
We fit the model in a similar fashion to previous networks. We
supply the fit function with test data as validation data, so that when
we monitor its progress and plot the history function we can see the
progress on the test data. Of course we should not use this as a basis for
early stopping, since then the test performance would be biased.
history <- model %>% fit( xrnn[istrain,, ], arframe[istrain, "log_volume"], # batch_size = 64, epochs = 200, batch_size = 64, epochs = 75, validation_data = list(xrnn[!istrain,, ], arframe[!istrain, "log_volume"]) ) kpred <- predict(model, xrnn[!istrain,, ]) 1 - mean((kpred - arframe[!istrain, "log_volume"])^2) / V0
This model takes about one minute to train.
We could replace the keras_model_sequential() command above with the following command:
model <- keras_model_sequential() %>% layer_flatten(input_shape = c(5, 3)) %>% layer_dense(units = 1)
Here, layer_flatten() simply takes the input sequence and
turns it into a long vector of predictors. This results in a linear AR model.
To fit a nonlinear AR model, we could add in a hidden layer.
However, since we already have the matrix of lagged variables from the AR
model that we fit earlier using the lm() command, we can actually fit a nonlinear AR model without needing to perform flattening.
We extract the model matrix x from arframed, which
includes the day_of_week variable.
x <- model.matrix(log_volume ~ . - 1, data = arframed) colnames(x)
The -1 in the formula avoids the creation of a column of ones for
the intercept. The variable day\_of\_week is a five-level
factor (there are five trading days), and the
-1 results in five rather than four dummy variables.
The rest of the steps to fit a nonlinear AR model should by now be familiar.
arnnd <- keras_model_sequential() %>% layer_dense(units = 32, activation = 'relu', input_shape = ncol(x)) %>% layer_dropout(rate = 0.5) %>% layer_dense(units = 1) arnnd %>% compile(loss = "mse", optimizer = optimizer_rmsprop()) history <- arnnd %>% fit( # x[istrain, ], arframe[istrain, "log_volume"], epochs = 100, x[istrain, ], arframe[istrain, "log_volume"], epochs = 30, batch_size = 32, validation_data = list(x[!istrain, ], arframe[!istrain, "log_volume"]) ) plot(history) npred <- predict(arnnd, x[!istrain, ]) 1 - mean((arframe[!istrain, "log_volume"] - npred)^2) / V0
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.