In dfalbel/keras: R Interface to 'Keras'

knitr::opts_chunk$set(echo = TRUE, eval = FALSE)

In a regression problem, we aim to predict the output of a continuous value, like a price or a probability. Contrast this with a classification problem, where we aim to predict a discrete label (for example, where a picture contains an apple or an orange).

This notebook builds a model to predict the median price of homes in a Boston suburb during the mid-1970s. To do this, we'll provide the model with some data points about the suburb, such as the crime rate and the local property tax rate.

library(keras)

The Boston Housing Prices dataset

The Boston Housing Prices dataset is accessible directly from keras.

boston_housing <- dataset_boston_housing()

c(train_data, train_labels) %<-% boston_housing$train
c(test_data, test_labels) %<-% boston_housing$test

Examples and features

This dataset is much smaller than the others we've worked with so far: it has 506 total examples that are split between 404 training examples and 102 test examples:

paste0("Training entries: ", length(train_data), ", labels: ", length(train_labels))

[1] "Training entries: 5252, labels: 404"

The dataset contains 13 different features:

Per capita crime rate.
The proportion of residential land zoned for lots over 25,000 square feet.
The proportion of non-retail business acres per town.
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
Nitric oxides concentration (parts per 10 million).
The average number of rooms per dwelling.
The proportion of owner-occupied units built before 1940.
Weighted distances to five Boston employment centers.
Index of accessibility to radial highways.
Full-value property-tax rate per $10,000.
Pupil-teacher ratio by town.
1000 * (Bk - 0.63) ** 2 where Bk is the proportion of Black people by town.
Percentage lower status of the population.

Each one of these input data features is stored using a different scale. Some features are represented by a proportion between 0 and 1, other features are ranges between 1 and 12, some are ranges between 0 and 100, and so on.

train_data[1, ] # Display sample features, notice the different scales

[1]   1.23247   0.00000   8.14000   0.00000   0.53800   6.14200  91.70000   3.97690
[9]   4.00000 307.00000  21.00000 396.90000  18.72000

Let's add column names for better data inspection.

library(tibble)

column_names <- c('CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 
                  'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT')
train_df <- as_tibble(train_data)
colnames(train_df) <- column_names

train_df

# A tibble: 404 x 13
     CRIM    ZN INDUS  CHAS   NOX    RM   AGE   DIS   RAD   TAX PTRATIO     B LSTAT
    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>
 1 1.23     0    8.14     0 0.538  6.14  91.7  3.98     4   307    21    397. 18.7 
 2 0.0218  82.5  2.03     0 0.415  7.61  15.7  6.27     2   348    14.7  395.  3.11
 3 4.90     0   18.1      0 0.631  4.97 100    1.33    24   666    20.2  376.  3.26
 4 0.0396   0    5.19     0 0.515  6.04  34.5  5.99     5   224    20.2  397.  8.01
 5 3.69     0   18.1      0 0.713  6.38  88.4  2.57    24   666    20.2  391. 14.6 
 6 0.284    0    7.38     0 0.493  5.71  74.3  4.72     5   287    19.6  391. 11.7 
 7 9.19     0   18.1      0 0.7    5.54 100    1.58    24   666    20.2  397. 23.6 
 8 4.10     0   19.6      0 0.871  5.47 100    1.41     5   403    14.7  397. 26.4 
 9 2.16     0   19.6      0 0.871  5.63 100    1.52     5   403    14.7  169. 16.6 
10 1.63     0   21.9      0 0.624  5.02 100    1.44     4   437    21.2  397. 34.4 
# ... with 394 more rows

Labels

The labels are the house prices in thousands of dollars. (You may notice the mid-1970s prices.)

train_labels[1:10] # Display first 10 entries

[1] 15.2 42.3 50.0 21.1 17.7 18.5 11.3 15.6 15.6 14.4

Normalize features

It's recommended to normalize features that use different scales and ranges. Although the model might converge without feature normalization, it makes training more difficult, and it makes the resulting model more dependant on the choice of units used in the input.

# Test data is *not* used when calculating the mean and std.

# Normalize training data
train_data <- scale(train_data) 

# Use means and standard deviations from training set to normalize test set
col_means_train <- attr(train_data, "scaled:center") 
col_stddevs_train <- attr(train_data, "scaled:scale")
test_data <- scale(test_data, center = col_means_train, scale = col_stddevs_train)

train_data[1, ] # First training sample, normalized

[1] -0.2719092 -0.4830166 -0.4352220 -0.2565147 -0.1650220 -0.1762241  0.8120550  0.1165538 -0.6254735
[10] -0.5944330  1.1470781  0.4475222  0.8241983

Create the model

Let's build our model. Here, we'll use a sequential model with two densely connected hidden layers, and an output layer that returns a single, continuous value. The model building steps are wrapped in a function, build_model, since we'll create a second model, later on.

build_model <- function() {

  model <- keras_model_sequential() %>%
    layer_dense(units = 64, activation = "relu",
                input_shape = dim(train_data)[2]) %>%
    layer_dense(units = 64, activation = "relu") %>%
    layer_dense(units = 1)

  model %>% compile(
    loss = "mse",
    optimizer = optimizer_rmsprop(),
    metrics = list("mean_absolute_error")
  )

  model
}

model <- build_model()
model %>% summary()

_____________________________________________________________________________________
Layer (type)                          Output Shape                      Param #      
=====================================================================================
dense_5 (Dense)                       (None, 64)                        896          
_____________________________________________________________________________________
dense_6 (Dense)                       (None, 64)                        4160         
_____________________________________________________________________________________
dense_7 (Dense)                       (None, 1)                         65           
=====================================================================================
Total params: 5,121
Trainable params: 5,121
Non-trainable params: 0
_____________________________________________________________________________________

Train the model

The model is trained for 500 epochs, recording training and validation accuracy in a keras_training_history object. We also show how to use a custom callback, replacing the default training output by a single dot per epoch.

# Display training progress by printing a single dot for each completed epoch.
print_dot_callback <- callback_lambda(
  on_epoch_end = function(epoch, logs) {
    if (epoch %% 80 == 0) cat("\n")
    cat(".")
  }
)    

epochs <- 500

# Fit the model and store training stats
history <- model %>% fit(
  train_data,
  train_labels,
  epochs = epochs,
  validation_split = 0.2,
  verbose = 0,
  callbacks = list(print_dot_callback)
)

Now, we visualize the model's training progress using the metrics stored in the history variable. We want to use this data to determine how long to train before the model stops making progress.

library(ggplot2)

plot(history, metrics = "mean_absolute_error", smooth = FALSE) +
  coord_cartesian(ylim = c(0, 5))

This graph shows little improvement in the model after about 200 epochs. Let's update the fit method to automatically stop training when the validation score doesn't improve. We'll use a callback that tests a training condition for every epoch. If a set amount of epochs elapses without showing improvement, it automatically stops the training.

# The patience parameter is the amount of epochs to check for improvement.
early_stop <- callback_early_stopping(monitor = "val_loss", patience = 20)

model <- build_model()
history <- model %>% fit(
  train_data,
  train_labels,
  epochs = epochs,
  validation_split = 0.2,
  verbose = 0,
  callbacks = list(early_stop, print_dot_callback)
)

plot(history, metrics = "mean_absolute_error", smooth = FALSE) +
  coord_cartesian(xlim = c(0, 150), ylim = c(0, 5))

The graph shows the average error is about $2,500 dollars. Is this good? Well, $2,500 is not an insignificant amount when some of the labels are only $15,000.

Let's see how did the model performs on the test set:

c(loss, mae) %<-% (model %>% evaluate(test_data, test_labels, verbose = 0))

paste0("Mean absolute error on test set: $", sprintf("%.2f", mae * 1000))

[1] "Mean absolute error on test set: $3297.68"

Predict

Finally, predict some housing prices using data in the testing set:

test_predictions <- model %>% predict(test_data)
test_predictions[ , 1]

 [1]  9.159314 17.955666 20.573296 31.487156 25.231384 18.803967 27.139153
 [8] 21.052799 18.904579 22.056618 19.140137 17.342262 15.233129 42.001091
[15] 19.280727 19.559774 27.276485 20.737257 19.391312 38.450863 12.431134
[22] 16.025173 19.910103 14.362184 20.846870 24.595688 31.234753 30.109112
[29] 11.271907 21.081585 18.724422 14.542423 33.109241 25.842684 18.071476
[36]  9.046785 14.701134 17.113651 21.169674 27.008324 29.676132 28.280304
[43] 15.355518 41.252007 29.731274 24.258526 25.582203 16.032135 24.014944
[50] 22.071520 35.658638 19.342590 13.662583 15.854269 34.375328 28.051319
[57] 13.002036 47.801872 33.513954 23.775620 25.214602 17.864346 14.284246
[64] 17.458893 22.757492 22.424841 13.578171 22.530212 15.667303  7.438343
[71] 38.318726 29.219141 25.282124 15.476329 24.670732 17.125381 20.079552
[78] 23.601147 34.540359 12.151771 19.177418 37.980789 15.576267 14.904464
[85] 17.581717 17.851192 20.480953 19.700697 21.921551 31.415789 19.116734
[92] 21.192280 24.934101 41.778465 35.113403 19.307007 35.754066 53.983509
[99] 26.797831 44.472233 32.520882 19.591730

Conclusion

This notebook introduced a few techniques to handle a regression problem.

Mean Squared Error (MSE) is a common loss function used for regression problems (different than classification problems).
Similarly, evaluation metrics used for regression differ from classification. A common regression metric is Mean Absolute Error (MAE).
When input data features have values with different ranges, each feature should be scaled independently.
If there is not much training data, prefer a small network with few hidden layers to avoid overfitting.
Early stopping is a useful technique to prevent overfitting.

dfalbel/keras
R Interface to 'Keras'

In dfalbel/keras: R Interface to 'Keras'

The Boston Housing Prices dataset

Examples and features

Labels

Normalize features

Create the model

Train the model

Predict

Conclusion

More Tutorials

R Package Documentation

Browse R Packages

We want your feedback!

dfalbel/keras R Interface to 'Keras'

In dfalbel/keras: R Interface to 'Keras'

The Boston Housing Prices dataset

Examples and features

Labels

Normalize features

Create the model

Train the model

Predict

Conclusion

More Tutorials

R Package Documentation

Browse R Packages

We want your feedback!

dfalbel/keras
R Interface to 'Keras'