Over Sampling for Time Series Classification"

 require(knitcitations)
 cleanbib()
 options("citation_format" = "pandoc")

Abstract

The OSTSC package is a powerful oversampling approach for classifying univariant, but multinomial time series data. This vignette provides a brief overview of the over-sampling methodology implemented by the package. A tutorial of the OSTSC package is provided. We begin by providing three test cases for the user to quickly validate the functionality in the package. To demonstrate the performance impact of OSTSC, we then provide two medium size imbalanced time series datasets. Each example applies a TensorFlow implementation of a Long Short-Term Memory (LSTM) classifier - a type of a Recurrent Neural Network (RNN) classifier - to imbalanced time series. The classifier performance is compared with and without oversampling. Finally, larger versions of these two datasets are evaluated to demonstrate the scalability of the package. The examples demonstrate that the OSTSC package improves the performance of RNN classifiers applied to highly imbalanced time series data. In particular, OSTSC is observed to increase the AUC of LSTM from 0.543 to 0.784 on a high frequency trading dataset consisting of 30,000 time series observations.

Introduction

A significant number of learning problems involve the accurate classification of rare events or outliers from time series data. For example, the detection of a flash crash, rogue trading, or heart arrhythmia from an electrocardiogram. Due to the rarity of these events, machine learning classifiers for detecting these events may be biased towards avoiding false positives. This is because any potential for false positives is greatly exaggerated by the number of negative samples in the data set.

Class imbalance problems are most easily addressed by treating the observations as conditionally independent. Then standard statistical techniques, such as oversampling the minority class or undersampling the majority class, or both, are applicable. @More2016 compared a batch of resampling techniques' classification performances on imbalanced datasets. Besides the conventional resampling approaches, More showed how ensemble methods retain as much original information from the majority class as possible when performing undersampling. Ensemble methods perform well and have gained popularity in the data mining literature. @Dubey2014 studied an ensemble system of feature selection and data sampling from an imbalanced Alzheimer's Disease Neuroimaging Initiative dataset.

However the imbalanced time series classification problem is more complex when the time dimension needs to be accounted for. Not only is the assumption that the observations are conditionally independent too strong, but also the predictors may be cross-correlated too. The sample correlation structure may weaken or be entirely lost under application of the conventional resampling approaches described above.

There are two existing research directions for imbalanced time series classification. One is to preserve the covariance structure during oversampling proposed by @Cao2011. Another is to conduct undersampling with various learning algorithms, proposed by @Liang2012. Both approaches are limited to binary classification and do not consider the more general problem of multi-classification.

A key assertation by @Cao2014 is that a time series sampling scheme should preserve the covariance structure. When the observations are conditionally dependent, this approach has been shown to outperform other sampling approaches such as undersampling the majority class, oversampling the minority class, and SMOTE. Our R package Over Sampling for Time Series Classification (OSTSC) is built on this idea. OSTSC first implements Enhanced Structure Preserving Oversampling (EPSO) of the minority class. It then uses a nearest neighbor method from the SMOTE family to generate synthetic positives. Specifically, it uses an Adaptive Synthetic Sampling Approach for Imbalanced Learning (ADASYN). Note that other packages such as @smotefamily2017 already implement SMOTE sampling techniques, including ADASYN. However an implementation of ADASYN has been provided in OSTSC for compatibility with the format required for use with EPSO and TensorFlow.

For examining the performance of oversampling for times series classification, RNNs are preferred (@Graves2013). Recently @Dixon2017 applied RNNs to imbalanced times series data used in high frequency trading. The RNN classifier predicts a price-flip in the limit order book based on a sequence of limit order book depths and market orders. The approach uses standard under-sampling of the majority class to improve the classifier performance. OSTSC provides a uni-variate sample of this data and demonstrates the application of EPSO and ADASYN to improve the performance of the RNN. The RNN is implemented in 'TensorFlow' (@Abadi2016) and made available in R by using a wrapper for 'Keras' (@keras), a high-level API for 'TensorFlow'.

The current version of the package currently only supports univariant classification of time series. The extension to multi-features requires tensor computations which are not implemented here.

Overview

This vignette provides a brief description of the sampling methodologies implemented. We introduce the OSTSC package and illustrate its application using various examples. For validation purposes only, we first apply OSTSC to three small built-in toy datasets. These datasets are not sufficiently large to demonstrate the methodology. However, they can be used to quickly verify that the OSTSC function generates a balanced dataset.

For demonstrating the effect of OSTSC on LSTM performance, we provide two medium size datasets that can be computed with moderate computation. Finally, to demonstrate scalability, we evaluate OSTSC on two larger datasets. The reader is advised that the total amount of computation in this case is significant. We would therefore expect a user to test the OSTSC functionality on the small or medium size datasets, but reserve running the larger dataset examples on a higher performance machine. The medium and large datasets are not built-in to keep the package size within 5MB.

Background

ESPO is used to generate a large percentage of the synthetic minority samples from univariate labeled time series under the modeling assumption that the predictors are Gaussian. EPSO estimates the covariance structure of the minority-class samples and applies a spectral filter to reduce noise. ADASYN is a nearest neighbor interpolation approach which is subsequently applied to the EPSO samples (@Cao2013).

More formally, given the time series of positive labeled predictors $P = \left { x_{11}, x_{12}, ..., x_{1|P|}\right }$ and the negative time series $N = \left { x_{01}, x_{02}, ..., x_{0|N|}\right }$, where $|N| \gg |P|$, $x_{ij} \in \mathbb{R}^{n \times 1}$, the new samples are generated by the following steps:

  1. Removal of the Common Null Space

Let $q_{ij} = L_{s}^{T}x_{ij}$ represent $x_{ij}$ in a lower-dimensional signal space, where $L_{s}$ consists of eigenvectors in the signal space.

  1. ESPO

Let

$$W_p=\frac{1}{|P|}\sum_{j=1}^{|P|}(q_{1j}-\bar{q}1)(q{1j}-\bar{q}1)^T.$$ and let $\hat{D}=V^TW_pV$ be the eigendecomposition with the diagonal matrix $\hat{D}$ of regularized eigenvalues $\left { \hat{d{1}}, ..., \hat{d_{n}}\right }$, sorted in descending order, and with orthogonal eigenvector matrix $V$. Then we generate a synthetic positive sample from

$$b=\hat{D}^{1/2}V^Tz + \bar{q_{1}}.$$

$z$ is drawn from a zero-mean mixed Gaussian distribution and $\bar{q_{1}}$ is the corresponding positive-class mean vector. The oversampling is repeated until all $(|N|-|P|)r$ required synthetic samples are generated, where $r\in[0,1]$ is the integration percentage of synthetic samples contributed by ESPO, which is chosen empirically. The remaining $(1-r)$ percentage of the samples are generated by the interpolation procedure described next.

  1. ADASYN

Given the transformed positive data $P_{t} = \left { q_{1i}\right }$ and negative data $N_{t} = \left { q_{0j}\right }$, each sample $q_{1i}$ is replicated $\Gamma_{i} = \left | S_{i:k-NN}\bigcap N_{t} \right | /Z$ times, where $S_{i:k-NN}$ is this sample's kNN in the entire dataset, $Z$ is a normalization factor so that $\sum_{i=1}^{|P_{t}|}\Gamma _{i} = 1$.

See @Cao2013 for further technical details of this approach.

Functionality

The package imports 'parallel' (@parallel), 'doParallel' (@doParallel), 'doSNOW' (@doSNOW) and 'foreach' (@foreach) for multi-threaded execution on shared memory architectures. Parallel execution is strongly suggested for datasets consisting of at least 30,000 observations. OSTSC also imports 'mvrnorm' from 'MASS' (@MASS) to generate random vectors from the multivariate normal distribution, and 'rdist' from 'fields' (@fields) in order to calculate the Euclidean distance between vectors and matrices.

This vignette displays some simple examples below. For calling the RNN and examining the classifier's performance, 'keras' (@keras), 'dummies' (@dummies) and 'pROC' (@pROC) are required.

Examples

Data loading & oversampling

The OSTSC package provides three small built-in datasets for verification that OSTSC has correctly installed and generates balanced time series. The first two examples use OSTSC to balance binary data while the third balances multinomial data.

The synthetically generated control dataset

The dataset Dataset_Synthetic_Control is a time series of sensor measurements of human body motion generated by @Alcock1999. We introduce the following labeling: Class 1 represents the 'Normal' state, while Class 0 represents one of 'Cyclic', 'Increasing trend', 'Decreasing trend', 'Upward shift' or 'Downward shift' (@Pham1998). Users load the dataset by calling data().

require(OSTSC)
require(keras)
require(dummies)
require(pROC)
ed <- local(get(load(url('https://github.com/lweicdsor/GSoC2017/raw/master/ElectricDevices%20LSTM%20model%20saved/ElectricalDevices_Precomputed.rdata'))))
ecg <- local(get(load(url('https://github.com/lweicdsor/GSoC2017/raw/master/ECG%20LSTM%20model%20saved/ECG_Precomputed.rdata'))))
mhealth <- local(get(load(url('https://github.com/lweicdsor/GSoC2017/raw/master/Mhealth%20LSTM%20model%20saved/MHEALTH_Precomputed.rdata'))))
hft <- local(get(load(url('https://github.com/lweicdsor/GSoC2017/raw/master/HFT%20LSTM%20model%20saved/HFT_Precomputed.rdata'))))
data(Dataset_Synthetic_Control)

train.label <- Dataset_Synthetic_Control$train.y
train.sample <- Dataset_Synthetic_Control$train.x
test.label <- Dataset_Synthetic_Control$test.y
test.sample <- Dataset_Synthetic_Control$test.x

Each row of the dataset is a sequence of observations. The sequence is of length 60 and there are 300 observations.

dim(train.sample)

The imbalance ratio of the training data is 1:5.

table(train.label)

We now provide a simple example demonstrating oversampling of the minority data to match the number of observations of the majority class. The output 'MyData' stores the samples (a.k.a. features) and labels. There are ten parameters in the OSTSC function, the details of which can be found in the help documentation. Calling the OSTSC function requires the user to provide at least the labels and sample data - the other parameters have default values. It is important to note that the labels are separated from the samples.

MyData <- OSTSC(train.sample, train.label, parallel = FALSE)
over.sample <- MyData$sample
over.label <- MyData$label

The positive and negative observations are now balanced. Let us check the (im)balance of the new dataset.

table(over.label)

The minority class data is oversampled to produce a balanced feature set. The minority-majority formation uses a one-vs-rest strategy. For this binary dataset, the Class 1 data has been oversampled to yield the same number of observations as Class 0.

dim(over.sample)

The automatic diatoms identification dataset

The dataset Dataset_Adiac is generated from a pilot study identifying diatoms (unicellular algae) from images by @Jalba2004 originally has 37 classes. For the purpose of demonstrating OSTSC we selected only one class as the positive class (Class 1) and all others are set as the negative class (Class 0) to form a highly imbalanced dataset. Users load the dataset into R by calling data().

data(Dataset_Adiac)

train.label <- Dataset_Adiac$train.y
train.sample <- Dataset_Adiac$train.x
test.label <- Dataset_Adiac$test.y
test.sample <- Dataset_Adiac$test.x

The training dataset consists of 390 observations of a 176 length sequence.

dim(train.sample)

The imbalance ratio of the training data is 1:29.

table(train.label)

The OSTSC function generates a balanced dataset:

MyData <- OSTSC(train.sample, train.label, parallel = FALSE)
over.sample <- MyData$sample
over.label <- MyData$label

table() provides a summary of the balanced dataset.

over.label <- rbind(matrix(0,377,1),matrix(1,377,1))
table(over.label)

The high frequency trading dataset

The OSTSC function provides support for multinomial classification. The user specifies which classes should be oversampled. Typically, oversampling is first applied to the minority class - the class with the least number of observations. The dataset Dataset_HFT300 is extracted from a real high frequency trading datafeed (@Dixon2017). It contains a feature representing instantaneous liquidity imbalance using the best bid to ask ratio. The data is labeled so that $Y=1$ for a next event mid-price up-tick, $Y=-1$ for a down-tick, and $Y=0$ for no mid-price movement.
Users load the dataset into the R environment by calling data().

data(Dataset_HFT300)

train.label <- Dataset_HFT300$y
train.sample <- Dataset_HFT300$x

The sequence length is set to 10 and 300 sequence observations are randomly drawn for this example dataset.

dim(train.sample)

The imbalance ratio of the three class dataset is 1:48:1.

table(train.label)

This example demonstrates the case when there are two minority classes and both are over-sampled. The oversampling is processed using a one-vs-rest strategy, which means that each minority class is oversampled to the same count as the sum of the count of all other classes. This results in a slight imbalance in the total number of labels.

MyData <- OSTSC(train.sample, train.label, parallel = FALSE)
over.sample <- MyData$sample
over.label <- MyData$label

We observe the ratio of the classes after oversampling.

over.label <- rbind(matrix(-1,294,1),matrix(0,288,1),matrix(1,294,1))
table(over.label)

The above examples illustrate how OSTSC oversamples small datasets. In the next section, we demonstrate and evaluate the oversampled data on two medium size datasets.

Applying OSTSC to medium size datasets

The Electrical Devices dataset

The dataset Dataset_ElectricalDevices is a sample collected from the 'Powering the Nation' study (@Lines2011). This study seeks to reduce the UK's carbon footprint by collecting behavioural data on how consumers use electricity within the home. Each class represent a signal from a different electrical device. Classes 5 and 6 in the original dataset are set as the negative and positive respectively. The dataset is split into training and testing feature vectors and labels.

ElectricalDevices <- Dataset_ElectricalDevices()

train.label <- ElectricalDevices$train.y
train.sample <- ElectricalDevices$train.x
test.label <- ElectricalDevices$test.y
test.sample <- ElectricalDevices$test.x
vali.label <- ElectricalDevices$vali.y
vali.sample <- ElectricalDevices$vali.x

Each row in the data represents a sequence of length 96.

dim(train.sample)

The imbalance ratio of the training data is 1:21.

table(train.label)

After oversampling with OSTSC, the positive and negative observations are balanced.

MyData <- OSTSC(train.sample, train.label, parallel = FALSE)
over.sample <- MyData$sample
over.label <- MyData$label
table(over.label)
over.label <- rbind(matrix(0,2100,1),matrix(1,2100,1))
table(over.label)

An LSTM classifier is used as the basis for performance assessment of oversampling with OSTSC. We use 'keras' (@keras) to configure the architecture, hyper-parameters and learning schedule of the LSTM classifier for sequence classification.

As a baseline for OSTSC, we assess the performance of LSTM trained on the unbalanced and balanced data. All performances are evaluated out-of-sample unless stated otherwise. Note that for the multi-classification examples, each F1 history is shown separately but is evaluated on the same validation set during training. The procedure for applying Keras is next outlined:

  1. One-hot encode the categorical label vectors as binary class matrices using 'dummy()' function. Then transform the feature matrices to tensors for LSTM.
library(keras)
library(dummies)
train.y <- dummy(train.label)
test.y <- dummy(test.label)
train.x <- array(train.sample, dim = c(dim(train.sample),1)) 
test.x <- array(test.sample, dim = c(dim(test.sample),1)) 
vali.y <- dummy(vali.label)
vali.x <- array(vali.sample, dim = c(dim(vali.sample),1)) 
over.y <- dummy(over.label)
over.x <- array(over.sample, dim = c(dim(over.sample),1)) 
  1. Initialize a sequential model, add layers and then compile it. Train the LSTM classifier on both of the imbalanced and the oversampled data.
K <- backend()

metric_f1_0 <- function(y_true, y_pred) { # positive is 0
  true_positives <- K$sum(y_true*y_pred, axis=K$cast(0,dtype='int32'))[1]
  possible_positives <- K$sum(y_true,axis=K$cast(0,dtype='int32'))[1]
  recall <- true_positives / (possible_positives + K$epsilon())

  predicted_positives<- K$sum(y_pred,axis=K$cast(0,dtype='int32'))[1]
  precision <- true_positives / (predicted_positives + K$epsilon())
  return(2*((precision*recall)/(precision+recall+ K$epsilon())))
}

metric_f1_1 <- function(y_true, y_pred) { # positive is 1
  true_positives <- K$sum(y_true*y_pred, axis=K$cast(0,dtype='int32'))[2]
  possible_positives <- K$sum(y_true,axis=K$cast(0,dtype='int32'))[2]
  recall <- true_positives / (possible_positives + K$epsilon())
  predicted_positives<- K$sum(y_pred,axis=K$cast(0,dtype='int32'))[2]
  precision <- true_positives / (predicted_positives + K$epsilon())
  return(2*((precision*recall)/(precision+recall+ K$epsilon())))
}

metric_f1_2 <- function(y_true, y_pred) { # positive is 2
  true_positives <- K$sum(y_true*y_pred, axis=K$cast(0,dtype='int32'))[3]
  possible_positives <- K$sum(y_true,axis=K$cast(0,dtype='int32'))[3]
  recall <- true_positives / (possible_positives + K$epsilon())
  predicted_positives<- K$sum(y_pred,axis=K$cast(0,dtype='int32'))[3]
  precision <- true_positives / (predicted_positives + K$epsilon())
  return(2*((precision*recall)/(precision+recall+ K$epsilon())))
}

LossHistory <- R6::R6Class("LossHistory",
  inherit = KerasCallback,

  public = list(

    losses = NULL,
    f1_0s    = NULL,
    f1_1s    = NULL,
    f1_2s    = NULL,
    val_f1_0s= NULL,
    val_f1_1s= NULL,
    val_f1_2s= NULL,

    on_epoch_end = function(epoch, logs = list()) {
      self$losses <- c(self$losses, logs[["loss"]])
      self$f1_0s    <- c(self$f1_0s, logs[["f1_score_0"]])
      self$f1_1s    <- c(self$f1_1s, logs[["f1_score_1"]])
      self$f1_2s    <- c(self$f1_2s, logs[["f1_score_2"]])
      self$val_f1_0s    <- c(self$val_f1_0s, logs[["val_f1_score_0"]])
      self$val_f1_1s    <- c(self$val_f1_1s, logs[["val_f1_score_1"]])
      self$val_f1_2s    <- c(self$val_f1_2s, logs[["val_f1_score_2"]])
    }
))

model <- keras_model_sequential()
model %>%
    layer_lstm(10, input_shape = c(dim(train.x)[2], dim(train.x)[3])) %>%
    #layer_dropout(rate = 0.1) %>% 
    layer_dense(dim(train.y)[2]) %>%
    layer_dropout(rate = 0.1) %>%
    layer_activation("softmax")
history <- LossHistory$new()
model %>% compile(
    loss = "categorical_crossentropy", 
    optimizer = optimizer_adam(lr = 0.005),
    metrics = c("accuracy",'f1_score_0' = metric_f1_0, 'f1_score_1' = metric_f1_1)
)
lstm.before <- model %>% fit( 
    x = train.x, 
    y = train.y, 
    validation_data=list(vali.x,vali.y),
    batch_size = 256,
    callbacks = list(history),
    epochs = 50
)

model.over <- keras_model_sequential()
model.over %>%
    layer_lstm(10, input_shape = c(dim(over.x)[2], dim(over.x)[3])) %>%
    #layer_dropout(rate = 0.1) %>% 
    layer_dense(dim(over.y)[2]) %>%
    layer_dropout(rate = 0.1) %>% 
    layer_activation("softmax")
history.over <- LossHistory$new()
model.over %>% compile(
    loss = "categorical_crossentropy", 
    optimizer = optimizer_adam(lr = 0.005), 
    metrics = c("accuracy",'f1_score_0' = metric_f1_0, 'f1_score_1' = metric_f1_1)
)
lstm.after <- model.over %>% fit( 
    x = over.x, 
    y = over.y, 
    validation_data=list(vali.x,vali.y),
    batch_size = 256,
    callbacks = list(history.over),
    epochs = 50
)
pred.label <- as.vector(unlist(ed$pred.label)) 
history.val_f1_0s <- as.vector(unlist(ed$history.val_f1_0s)) 
history.val_f1_1s <- as.vector(unlist(ed$history.val_f1_1s)) 
history.losses <- as.vector(unlist(ed$history.losses)) 
pred.label.over <- as.vector(unlist(ed$pred.label.over)) 
history.over.val_f1_0s <- as.vector(unlist(ed$history.over.val_f1_0s)) 
history.over.val_f1_1s <- as.vector(unlist(ed$history.over.val_f1_1s)) 
history.over.losses <- as.vector(unlist(ed$history.over.losses)) 
  1. From the training history, Figures 1 and 2 compare the F1 scores of the two models without and with oversampling. Figure 3 compares the losses of the two models.
plot(history.over.val_f1_1s, type = "b", pch = 19, col = "red", main = "F1 of the LSTM classifier on Electrical Devices dataset", yaxt = "n", xaxt = "n", xlab = "Epoches", ylab = "", ylim = c(0.0, 1.0), xlim = c(0, 50))
lines(history.val_f1_1s, type = "b", pch = 0, col = "blue")
axis(1, at = c(0,5,10,15,20,25,30,35,40,45,50),labels = c(0,5,10,15,20,25,30,35,40,45,50), las = 1)
axis(2, at = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1), labels = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1), las = 1)
mtext("F1", side = 2, las = 1, line = 3)
legend("topleft", legend = c("Balanced dataset", "Unbalanced dataset"), col = c("red","blue"), pch = c(19, 0), lty = c(NA, NA), bty = "n", lwd = 2, cex = 0.7)
plot(history.over.val_f1_0s, type = "b", pch = 19, col = "red", main = "F1 of the LSTM classifier on Electrical Devices dataset", yaxt = "n", xaxt = "n", xlab = "Epoches", ylab = "", ylim = c(0.0, 1.0), xlim = c(0, 50))
lines(history.val_f1_0s, type = "b", pch = 0, col = "blue")
axis(1, at = c(0,5,10,15,20,25,30,35,40,45,50),labels = c(0,5,10,15,20,25,30,35,40,45,50), las = 1)
axis(2, at = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1), labels = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1), las = 1)
mtext("F1", side = 2, las = 1, line = 3)
legend("bottomright", legend = c("Balanced dataset", "Unbalanced dataset"), col = c("red","blue"), pch = c(19, 0), lty = c(NA, NA), bty = "n", lwd = 2, cex = 0.7)
plot(history.losses, type = "b", pch = 0, col = "blue", main = "Loss of the LSTM classifier on Electrical Devices dataset", yaxt = "n", xaxt = "n", xlab = "Epoches", ylab = "", ylim = c(0, 1), xlim = c(0, 50))
lines(history.over.losses, type = "b", pch = 19, col = "red")
axis(1, at = c(0,5,10,15,20,25,30,35,40,45,50),labels = c(0,5,10,15,20,25,30,35,40,45,50), las = 1)
axis(2, at = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1), labels = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1), las = 1)
mtext("Loss", side = 2, las = 1, line = 2)
legend("topleft", legend = c("Balanced dataset", "Unbalanced dataset"), col = c("red","blue"), pch = c(19, 0), lty = c(NA, NA), bty = "n", lwd = 2, cex = 0.7)

In addition to the training history, Figures 4 and 5 compare the confusion matrices of the two models without and with oversampling. Figure 6 compares the receiver operating characteristic (ROC) curves of the models. The final out-of-sample F1 scores of the two trained models are also shown below for comparison.

pred.label <- model %>% predict_classes(test.x)
pred.label.over <- model.over %>% predict_classes(test.x)
cm.before <- table(test.label, pred.label)
cm.after <- table(test.label, pred.label.over)
res <- as.numeric(cm.before)
for (i in 1:4){
  res[i][is.na(res[i])] <- 0
}
f1_1 <- 2*res[4]/(2*res[4]+res[2]+res[3])
f1_0 <- 2*res[1]/(2*res[1]+res[2]+res[3])
cat("The class 1 F1 score without oversampling: ", f1_1)
cat("The class 0 F1 score without oversampling: ", f1_0)
res <- as.numeric(cm.after)
for (i in 1:4){
  res[i][is.na(res[i])] <- 0
}
f1_1 <- 2*res[4]/(2*res[4]+res[2]+res[3])
f1_0 <- 2*res[1]/(2*res[1]+res[2]+res[3])
cat("The class 1 F1 score with oversampling: ", f1_1)
cat("The class 0 F1 score with oversampling: ", f1_0)
layout(matrix(c(1,1,1)))
par(mar=c(2,2,2,2))
plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')

rect(150, 430, 240, 370, col='#3F97D0')
rect(250, 430, 340, 370, col='#F7AD50')
rect(150, 305, 240, 365, col='#F7AD50')
rect(250, 305, 340, 365, col='#3F97D0')
text(195, 435, '0', cex=1.1)
text(295, 435, '1', cex=1.1)
text(125, 370, 'True', cex=1.2, srt=90, font=2)
text(245, 450, 'Predicted', cex=1.2, font=2)
text(140, 400, '0', cex=1.1, srt=90)
text(140, 335, '1', cex=1.1, srt=90)
res <- as.numeric(cm.before)
for (i in 1:4){
  res[i][is.na(res[i])] <- 0
}
sum1 <- res[1] + res[3]
sum2 <- res[2] + res[4] 
text(195, 400, round(res[1]/sum1, 4), cex=1.3, font=2, col='white')
text(195, 335, round(res[2]/sum2, 4), cex=1.3, font=2, col='white')
text(295, 400, round(res[3]/sum1, 4), cex=1.3, font=2, col='white')
text(295, 335, round(res[4]/sum2, 4), cex=1.3, font=2, col='white')
layout(matrix(c(1,1,1)))
par(mar=c(2,2,2,2))
plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')

rect(150, 430, 240, 370, col='#3F97D0')
rect(250, 430, 340, 370, col='#F7AD50')
rect(150, 305, 240, 365, col='#F7AD50')
rect(250, 305, 340, 365, col='#3F97D0')
text(195, 435, '5', cex=1.1)
text(295, 435, '6', cex=1.1)
text(125, 370, 'True', cex=1.2, srt=90, font=2)
text(245, 450, 'Predicted', cex=1.2, font=2)
text(140, 400, '5', cex=1.1, srt=90)
text(140, 335, '6', cex=1.1, srt=90)
res <- as.numeric(cm.after)
for (i in 1:4){
  res[i][is.na(res[i])] <- 0
}
sum1 <- res[1] + res[3]
sum2 <- res[2] + res[4] 
text(195, 400, round(res[1]/sum1, 4), cex=1.3, font=2, col='white')
text(195, 335, round(res[2]/sum2, 4), cex=1.3, font=2, col='white')
text(295, 400, round(res[3]/sum1, 4), cex=1.3, font=2, col='white')
text(295, 335, round(res[4]/sum2, 4), cex=1.3, font=2, col='white')
library(pROC)
par(pty = "s")
plot.roc(as.vector(test.label), pred.label, legacy.axes = TRUE, col = "blue", print.auc = TRUE,  
         print.auc.cex= .8, xlab = 'False Positive Rate', ylab = 'True Positive Rate', lty = "dashed")
plot.roc(as.vector(test.label), pred.label.over, legacy.axes = TRUE, col = "red", print.auc = TRUE,   
         print.auc.y = .4, print.auc.cex= .8, add = TRUE)
legend("bottomright", legend=c("Before Oversampling", "After Oversampling"), 
       col=c("blue", "red"), lwd=2, cex= .6, lty = c("dashed", "solid"))

The Electrocardiogram dataset

The dataset Dataset_ECG was originally created by @Goldberger2015 and records heartbeats from patients with severe congestive heart failure. The dataset was pre-processed to extract heartbeat sequences and add labels by @Chen2015. The vignettes uses 5,000 randomly selected heartbeat sequences.

ECG <- Dataset_ECG()

train.label <- ECG$train.y
train.sample <- ECG$train.x
test.label <- ECG$test.y
test.sample <- ECG$test.x
vali.label <- ECG$vali.y
vali.sample <- ECG$vali.x

Each row in the data represents a sequence of length 140.

dim(train.sample)

This experiment uses 3 classes of the dataset to ensure a high degree of imbalance: the imbalance ratio is 32:1:2.

table(train.label)

Let us check that the data is balanced after oversampling.

MyData <- OSTSC(train.sample, train.label, parallel = FALSE)
over.sample <- MyData$sample
over.label <- MyData$label
table(over.label)
over.label <- rbind(matrix(0,2100,1),matrix(1,2230,1),matrix(2,2166,1))
table(over.label)

We evaluate the effect of oversampling on the performance of LSTM following Steps 1-3 above. First the data is transformed. During configuring and training the model, the F1 scores and losses are measured at the end of each epoch using the same validation set.

library(keras)
library(dummies)
train.y <- dummy(train.label)
test.y <- dummy(test.label)
train.x <- array(train.sample, dim = c(dim(train.sample),1)) 
test.x <- array(test.sample, dim = c(dim(test.sample),1)) 
vali.y <- dummy(vali.label)
vali.x <- array(vali.sample, dim = c(dim(vali.sample),1)) 
over.y <- dummy(over.label)
over.x <- array(over.sample, dim = c(dim(over.sample),1)) 

model <- keras_model_sequential()
model %>%
    layer_lstm(10, input_shape = c(dim(train.x)[2], dim(train.x)[3])) %>%
    #layer_dropout(rate = 0.1) %>% 
    layer_dense(dim(train.y)[2]) %>%
    layer_dropout(rate = 0.1) %>%
    layer_activation("softmax")
history <- LossHistory$new()
model %>% compile(
    loss = "categorical_crossentropy", 
    optimizer = optimizer_adam(lr = 0.001),
    metrics = c("accuracy",'f1_score_0' = metric_f1_0, 'f1_score_1' = metric_f1_1, 
                'f1_score_2' = metric_f1_2)
)
lstm.before <- model %>% fit( 
    x = train.x, 
    y = train.y, 
    validation_data=list(vali.x,vali.y),
    batch_size = 256,
    callbacks = list(history),
    epochs = 50
)

model.over <- keras_model_sequential()
model.over %>%
    layer_lstm(10, input_shape = c(dim(over.x)[2], dim(over.x)[3])) %>%
    #layer_dropout(rate = 0.1) %>% 
    layer_dense(dim(over.y)[2]) %>%
    layer_dropout(rate = 0.1) %>% 
    layer_activation("softmax")
history.over <- LossHistory$new()
model.over %>% compile(
    loss = "categorical_crossentropy", 
    optimizer = optimizer_adam(lr = 0.001), 
    metrics = c("accuracy",'f1_score_0' = metric_f1_0, 'f1_score_1' = metric_f1_1, 
                'f1_score_2' = metric_f1_2)
)
lstm.after <- model.over %>% fit( 
    x = over.x, 
    y = over.y, 
    validation_data=list(vali.x,vali.y),
    batch_size = 256,
    callbacks = list(history.over),
    epochs = 50
)
pred.label <- as.vector(unlist(ecg$pred.label)) 
history.val_f1_0s <- as.vector(unlist(ecg$history.val_f1_0s)) 
history.val_f1_1s <- as.vector(unlist(ecg$history.val_f1_1s)) 
history.val_f1_2s <- as.vector(unlist(ecg$history.val_f1_2s)) 
history.losses <- as.vector(unlist(ecg$history.losses)) 
pred.label.over <- as.vector(unlist(ecg$pred.label.over)) 
history.over.val_f1_0s <- as.vector(unlist(ecg$history.over.val_f1_0s)) 
history.over.val_f1_1s <- as.vector(unlist(ecg$history.over.val_f1_1s)) 
history.over.val_f1_2s <- as.vector(unlist(ecg$history.over.val_f1_2s)) 
history.over.losses <- as.vector(unlist(ecg$history.over.losses)) 

Keeping the number of epoches fixed, Figures 7, 8 and 9 respectively compare the F1 scores of three different classes of the two models without and with oversampling. Figure 10 compares the losses of the two models. From the losses and F1 scores, we note that the model has not yet been adequately trained after 50 epoches. We are trying to demonstrate the utility of OSTSC with only a modest amount of computation. The user can of course choose to increase the number of epoches, but will this require more computation. The user should refer to the larger dataset examples below for comparative evaluations which use more epoches for training LSTM.

plot(history.over.val_f1_2s, type = "b", pch = 19, col = "red", main = "F1 of the LSTM classifier on Electrocardiogram dataset", yaxt = "n", xaxt = "n", xlab = "Epoches", ylab = "", ylim = c(0.0, 1.0), xlim = c(0, 50))
lines(history.val_f1_2s, type = "b", pch = 0, col = "blue")
axis(1, at = c(0,5,10,15,20,25,30,35,40,45,50),labels = c(0,5,10,15,20,25,30,35,40,45,50), las = 1)
axis(2, at = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1), labels = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1), las = 1)
mtext("F1", side = 2, las = 1, line = 3)
legend("topleft", legend = c("Balanced dataset", "Unbalanced dataset"), col = c("red","blue"), pch = c(19, 0), lty = c(NA, NA), bty = "n", lwd = 2, cex = 0.7)
plot(history.over.val_f1_1s, type = "b", pch = 19, col = "red", main = "F1 of the LSTM classifier on Electrocardiogram dataset", yaxt = "n", xaxt = "n", xlab = "Epoches", ylab = "", ylim = c(0.0, 1.0), xlim = c(0, 50))
lines(history.val_f1_1s, type = "b", pch = 0, col = "blue")
axis(1, at = c(0,5,10,15,20,25,30,35,40,45,50),labels = c(0,5,10,15,20,25,30,35,40,45,50), las = 1)
axis(2, at = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1), labels = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1), las = 1)
mtext("F1", side = 2, las = 1, line = 3)
legend("topleft", legend = c("Balanced dataset", "Unbalanced dataset"), col = c("red","blue"), pch = c(19, 0), lty = c(NA, NA), bty = "n", lwd = 2, cex = 0.7)
plot(history.over.val_f1_0s, type = "b", pch = 19, col = "red", main = "F1 of the LSTM classifier on Electrocardiogram dataset", yaxt = "n", xaxt = "n", xlab = "Epoches", ylab = "", ylim = c(0.0, 1.0), xlim = c(0, 50))
lines(history.val_f1_0s, type = "b", pch = 0, col = "blue")
axis(1, at = c(0,5,10,15,20,25,30,35,40,45,50),labels = c(0,5,10,15,20,25,30,35,40,45,50), las = 1)
axis(2, at = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1), labels = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1), las = 1)
mtext("F1", side = 2, las = 1, line = 3)
legend("bottomright", legend = c("Balanced dataset", "Unbalanced dataset"), col = c("red","blue"), pch = c(19, 0), lty = c(NA, NA), bty = "n", lwd = 2, cex = 0.7)
plot(history.losses, type = "b", pch = 0, col = "blue", main = "Loss of the LSTM classifier on Electrocardiogram dataset", yaxt = "n", xaxt = "n", xlab = "Epoches", ylab = "", ylim = c(0, 1.2), xlim = c(0, 50))
lines(history.over.losses, type = "b", pch = 19, col = "red")
axis(1, at = c(0,5,10,15,20,25,30,35,40,45,50),labels = c(0,5,10,15,20,25,30,35,40,45,50), las = 1)
axis(2, at = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1,1.1,1.2), labels = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1,1.1,1.2), las = 1)
mtext("Loss", side = 2, las = 1, line = 2)
legend("topright", legend = c("Balanced dataset", "Unbalanced dataset"), col = c("red","blue"), pch = c(19, 0), lty = c(NA, NA), bty = "n", lwd = 2, cex = 0.7)

In addition to the training history, Figures 11 and 12 compare the confusion matrices of the two models without and with oversampling. Figure 13 compares the receiver operating characteristic (ROC) curves of the models. The final F1 scores of the two trained models, using the same validation set, are also shown below for comparison.

pred.label <- model %>% predict_classes(test.x)
pred.label.over <- model.over %>% predict_classes(test.x)
cm.before <- table(test.label, pred.label)
cm.after <- table(test.label, pred.label.over)
res <- as.numeric(cm.before)
for (i in 1:9){
  res[i][is.na(res[i])] <- 0
}
f1_2 <- 2*res[9]/(2*res[9]+res[3]+res[6]+res[7]+res[8])
f1_1 <- 2*res[5]/(2*res[5]+res[2]+res[4]+res[6]+res[8])
f1_0 <- 2*res[1]/(2*res[1]+res[2]+res[3]+res[4]+res[7])
cat("The class 2 F1 score without oversampling: ", f1_2)
cat("The class 1 F1 score without oversampling: ", f1_1)
cat("The class 0 F1 score without oversampling: ", f1_0)
res <- as.numeric(cm.after)
for (i in 1:9){
  res[i][is.na(res[i])] <- 0
}
f1_2 <- 2*res[9]/(2*res[9]+res[3]+res[6]+res[7]+res[8])
f1_1 <- 2*res[5]/(2*res[5]+res[2]+res[4]+res[6]+res[8])
f1_0 <- 2*res[1]/(2*res[1]+res[2]+res[3]+res[4]+res[7])
cat("The class 2 F1 score with oversampling: ", f1_2)
cat("The class 1 F1 score with oversampling: ", f1_1)
cat("The class 0 F1 score with oversampling: ", f1_0)
layout(matrix(c(1,1,1)))
par(mar=c(2,2,2,2))
plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')

rect(140, 430, 200, 390, col='#3F97D0')
rect(210, 430, 270, 390, col='#F7AD50')
rect(280, 430, 340, 390, col='#F7AD50')
rect(140, 345, 200, 385, col='#F7AD50')
rect(210, 345, 270, 385, col='#3F97D0')
rect(280, 345, 340, 385, col='#F7AD50')
rect(140, 300, 200, 340, col='#F7AD50')
rect(210, 300, 270, 340, col='#F7AD50')
rect(280, 300, 340, 340, col='#3F97D0')
text(170, 435, '0', cex=1.1)
text(240, 435, '1', cex=1.1)
text(310, 435, '2', cex=1.1)
text(130, 410, '0', cex=1.1, srt=90)
text(130, 365, '1', cex=1.1, srt=90)
text(130, 320, '2', cex=1.1, srt=90)
text(120, 370, 'True', cex=1.2, srt=90, font=2)
text(240, 450, 'Predicted', cex=1.2, font=2)

res <- as.numeric(cm.before)
sum1 <- res[1] + res[4] + res[7]
sum2 <- res[2] + res[5] + res[8]
sum3 <- res[3] + res[6] + res[9]
text(170, 410, round(res[1]/sum1, 4), cex=1.3, font=2, col='white')
text(170, 365, round(res[2]/sum2, 4), cex=1.3, font=2, col='white')
text(170, 320, round(res[3]/sum3, 4), cex=1.3, font=2, col='white')
text(240, 410, round(res[4]/sum1, 4), cex=1.3, font=2, col='white')
text(240, 365, round(res[5]/sum2, 4), cex=1.3, font=2, col='white')
text(240, 320, round(res[6]/sum3, 4), cex=1.3, font=2, col='white')
text(310, 410, round(res[7]/sum1, 4), cex=1.3, font=2, col='white')
text(310, 365, round(res[8]/sum2, 4), cex=1.3, font=2, col='white')
text(310, 320, round(res[9]/sum3, 4), cex=1.3, font=2, col='white')
layout(matrix(c(1,1,1)))
par(mar=c(2,2,2,2))
plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')

rect(140, 430, 200, 390, col='#3F97D0')
rect(210, 430, 270, 390, col='#F7AD50')
rect(280, 430, 340, 390, col='#F7AD50')
rect(140, 345, 200, 385, col='#F7AD50')
rect(210, 345, 270, 385, col='#3F97D0')
rect(280, 345, 340, 385, col='#F7AD50')
rect(140, 300, 200, 340, col='#F7AD50')
rect(210, 300, 270, 340, col='#F7AD50')
rect(280, 300, 340, 340, col='#3F97D0')
text(170, 435, '0', cex=1.1)
text(240, 435, '1', cex=1.1)
text(310, 435, '2', cex=1.1)
text(130, 410, '0', cex=1.1, srt=90)
text(130, 365, '1', cex=1.1, srt=90)
text(130, 320, '2', cex=1.1, srt=90)
text(120, 370, 'True', cex=1.2, srt=90, font=2)
text(240, 450, 'Predicted', cex=1.2, font=2)

res <- as.numeric(cm.after)
sum1 <- res[1] + res[4] + res[7]
sum2 <- res[2] + res[5] + res[8]
sum3 <- res[3] + res[6] + res[9]
text(170, 410, round(res[1]/sum1, 4), cex=1.3, font=2, col='white')
text(170, 365, round(res[2]/sum2, 4), cex=1.3, font=2, col='white')
text(170, 320, round(res[3]/sum3, 4), cex=1.3, font=2, col='white')
text(240, 410, round(res[4]/sum1, 4), cex=1.3, font=2, col='white')
text(240, 365, round(res[5]/sum2, 4), cex=1.3, font=2, col='white')
text(240, 320, round(res[6]/sum3, 4), cex=1.3, font=2, col='white')
text(310, 410, round(res[7]/sum1, 4), cex=1.3, font=2, col='white')
text(310, 365, round(res[8]/sum2, 4), cex=1.3, font=2, col='white')
text(310, 320, round(res[9]/sum3, 4), cex=1.3, font=2, col='white')
library(pROC)
par(pty = "s")
plot.roc(as.vector(test.label), pred.label, legacy.axes = TRUE, col = "blue", print.auc = TRUE,  
         print.auc.cex= .8, xlab = 'False Positive Rate', ylab = 'True Positive Rate', lty = "dashed")
plot.roc(as.vector(test.label), pred.label.over, legacy.axes = TRUE, col = "red", print.auc = TRUE,   
         print.auc.y = .4, print.auc.cex= .8, add = TRUE)
legend("bottomright", legend=c("Before Oversampling", "After Oversampling"), 
       col=c("blue", "red"), lwd=2, cex= .6, lty = c("dashed", "solid"))

Evaluating OSTSC on the large datasets

The evaluation of oversampling uses larger datasets: the MHEALTH and HFT datasets. The purpose of this evaluation is to demonstrate how OSTSC performs at scale. We increase the data sizes by a factor of up to 10x. The evaluation of each dataset takes approximately three hours on a 1.7 GHz four-core laptop with 8GM of RAM.

The MHEALTH dataset

The dataset Dataset_MHEALTH benchmarks techniques for human behavioral analysis applied to multimodal body sensing (@Banos2014). In this experiment, only Subjects 1-5 and Feature 12 (the x coordinate of the magnetometer reading from the left-ankle sensor) are used. The dataset is labeled with a dichotonomous response (@Banos2015). Class 11 (Running) is set as the positive and the remaining states are the negative. The dataset is split into training and testing feature vectors and labels.

MHEALTH <- Dataset_MHEALTH()

train.label <- MHEALTH$train.y
train.sample <- MHEALTH$train.x
test.label <- MHEALTH$test.y
test.sample <- MHEALTH$test.x
vali.label <- MHEALTH$vali.y
vali.sample <- MHEALTH$vali.x

Each row in the data represents a sequence of length 30.

dim(train.sample)

Class 1 represents the positive data and class 0 represents the negative. The imbalance ratio of the train dataset is 1:40.

table(train.label)

After oversampling, the positive and negative observations are balanced.

MyData <- OSTSC(train.sample, train.label, parallel = FALSE)
over.sample <- MyData$sample
over.label <- MyData$label
table(over.label)
over.label <- rbind(matrix(0,10000,1),matrix(1,10000,1))
table(over.label)

We are concerned more here with the comparative performance without and with oversampling and less with the absolute gain (which is subject to further parameter tuning). Keeping the number of epoches fixed, Figures 14 and 15 compare the F1 scores of the two models without and with oversampling, Figure 16 compares the losses of the two models, Figures 17 and 18 compare the confusion matrices of the two models without and with oversampling, and Figure 19 compares the ROC curves of the models. The final F1 scores of the two trained models, using the same validation set, are also shown below for comparison.

train.y <- dummy(train.label)
test.y <- dummy(test.label)
train.x <- array(train.sample, dim = c(dim(train.sample),1)) 
test.x <- array(test.sample, dim = c(dim(test.sample),1)) 
vali.y <- dummy(vali.label)
vali.x <- array(vali.sample, dim = c(dim(vali.sample),1)) 
over.y <- dummy(over.label)
over.x <- array(over.sample, dim = c(dim(over.sample),1)) 

model <- keras_model_sequential()
model %>%
  layer_lstm(10, input_shape = c(dim(train.x)[2], dim(train.x)[3])) %>%
  layer_dropout(rate = 0.2) %>% 
  layer_dense(dim(train.y)[2]) %>%
  layer_dropout(rate = 0.2) %>% 
  layer_activation("softmax")
history <- LossHistory$new()
model %>% compile(
  loss = "categorical_crossentropy", 
  optimizer = "adam",
  metrics = c("accuracy",'f1_score_0' = metric_f1_0, 'f1_score_1' = metric_f1_1)
)
lstm.before <- model %>% fit( 
  x = train.x, 
  y = train.y, 
  validation_data=list(vali.x,vali.y),
  callbacks = list(history),
  epochs = 50
)

model.over <- keras_model_sequential()
model.over %>%
  layer_lstm(10, input_shape = c(dim(over.x)[2], dim(over.x)[3])) %>%
  layer_dropout(rate = 0.1) %>% 
  layer_dense(dim(over.y)[2]) %>%
  layer_dropout(rate = 0.1) %>% 
  layer_activation("softmax")
history.over <- LossHistory$new()
model.over %>% compile(
  loss = "categorical_crossentropy", 
  optimizer = "adam",
  metrics = c("accuracy",'f1_score_0' = metric_f1_0, 'f1_score_1' = metric_f1_1)
)
lstm.after <- model.over %>% fit( 
  x = over.x, 
  y = over.y, 
  validation_data=list(vali.x,vali.y),
  callbacks = list(history.over),
  epochs = 50
)

pred.label <- model %>% predict_classes(test.x)
pred.label.over <- model.over %>% predict_classes(test.x)
pred.label <- as.vector(unlist(mhealth$pred.label)) 
history.val_f1_0s <- as.vector(unlist(mhealth$history.val_f1_0s)) 
history.val_f1_1s <- as.vector(unlist(mhealth$history.val_f1_1s)) 
history.losses <- as.vector(unlist(mhealth$history.losses)) 
pred.label.over <- as.vector(unlist(mhealth$pred.label.over)) 
history.over.val_f1_0s <- as.vector(unlist(mhealth$history.over.val_f1_0s)) 
history.over.val_f1_1s <- as.vector(unlist(mhealth$history.over.val_f1_1s)) 
history.over.losses <- as.vector(unlist(mhealth$history.over.losses)) 
plot(history.over.val_f1_1s, type = "b", pch = 19, col = "red", main = "F1 of the LSTM classifier on MHEALTH dataset", yaxt = "n", xaxt = "n", xlab = "Epoches", ylab = "", ylim = c(0.0, 1.0), xlim = c(0, 50))
lines(history.val_f1_1s, type = "b", pch = 0, col = "blue")
axis(1, at = c(0,10,20,30,40,50),labels = c(0,10,20,30,40,50), las = 1)
axis(2, at = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1), labels = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1), las = 1)
mtext("F1", side = 2, las = 1, line = 3)
legend("topleft", legend = c("Balanced dataset", "Unbalanced dataset"), col = c("red","blue"), pch = c(19, 0), lty = c(NA, NA), bty = "n", lwd = 2, cex = 0.7)
plot(history.over.val_f1_0s, type = "b", pch = 19, col = "red", main = "F1 of the LSTM classifier on MHEALTH dataset", yaxt = "n", xaxt = "n", xlab = "Epoches", ylab = "", ylim = c(0.0, 1.0), xlim = c(0, 50))
lines(history.val_f1_0s, type = "b", pch = 0, col = "blue")
axis(1, at = c(0,10,20,30,40,50),labels = c(0,10,20,30,40,50), las = 1)
axis(2, at = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1), labels = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1), las = 1)
mtext("F1", side = 2, las = 1, line = 3)
legend("bottomright", legend = c("Balanced dataset", "Unbalanced dataset"), col = c("red","blue"), pch = c(19, 0), lty = c(NA, NA), bty = "n", lwd = 2, cex = 0.7)
plot(history.losses, type = "b", pch = 0, col = "blue", main = "Loss of the LSTM classifier on MHEALTH dataset", yaxt = "n", xaxt = "n", xlab = "Epoches", ylab = "", ylim = c(0, 1), xlim = c(0, 50))
lines(history.over.losses, type = "b", pch = 19, col = "red")
axis(1, at = c(0,10,20,30,40,50),labels = c(0,10,20,30,40,50), las = 1)
axis(2, at = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1), labels = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1), las = 1)
mtext("Loss", side = 2, las = 1, line = 2)
legend("topleft", legend = c("Balanced dataset", "Unbalanced dataset"), col = c("red","blue"), pch = c(19, 0), lty = c(NA, NA), bty = "n", lwd = 2, cex = 0.7)
cm.before <- table(test.label, pred.label)
cm.after <- table(test.label, pred.label.over)
res <- as.numeric(cm.before)
for (i in 1:4){
  res[i][is.na(res[i])] <- 0
}
f1_1 <- 2*res[4]/(2*res[4]+res[2]+res[3])
f1_0 <- 2*res[1]/(2*res[1]+res[2]+res[3])
cat("The class 1 F1 score without oversampling: ", f1_1)
cat("The class 0 F1 score without oversampling: ", f1_0)
res <- as.numeric(cm.after)
for (i in 1:4){
  res[i][is.na(res[i])] <- 0
}
f1_1 <- 2*res[4]/(2*res[4]+res[2]+res[3])
f1_0 <- 2*res[1]/(2*res[1]+res[2]+res[3])
cat("The class 1 F1 score with oversampling: ", f1_1)
cat("The class 0 F1 score with oversampling: ", f1_0)
layout(matrix(c(1,1,1)))
par(mar=c(2,2,2,2))
plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')

rect(150, 430, 240, 370, col='#3F97D0')
rect(250, 430, 340, 370, col='#F7AD50')
rect(150, 305, 240, 365, col='#F7AD50')
rect(250, 305, 340, 365, col='#3F97D0')
text(195, 435, '0', cex=1.1)
text(295, 435, '1', cex=1.1)
text(125, 370, 'True', cex=1.2, srt=90, font=2)
text(245, 450, 'Predicted', cex=1.2, font=2)
text(140, 400, '0', cex=1.1, srt=90)
text(140, 335, '1', cex=1.1, srt=90)

res <- as.numeric(cm.before)
sum1 <- res[1] + res[3]
sum2 <- res[2] + res[4] 
text(195, 400, round(res[1]/sum1, 4), cex=1.3, font=2, col='white')
text(195, 335, round(res[2]/sum2, 4), cex=1.3, font=2, col='white')
text(295, 400, round(res[3]/sum1, 4), cex=1.3, font=2, col='white')
text(295, 335, round(res[4]/sum2, 4), cex=1.3, font=2, col='white')
layout(matrix(c(1,1,1)))
par(mar=c(2,2,2,2))
plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')

rect(150, 430, 240, 370, col='#3F97D0')
rect(250, 430, 340, 370, col='#F7AD50')
rect(150, 305, 240, 365, col='#F7AD50')
rect(250, 305, 340, 365, col='#3F97D0')
text(195, 435, '0', cex=1.1)
text(295, 435, '1', cex=1.1)
text(125, 370, 'True', cex=1.2, srt=90, font=2)
text(245, 450, 'Predicted', cex=1.2, font=2)
text(140, 400, '0', cex=1.1, srt=90)
text(140, 335, '1', cex=1.1, srt=90)

res <- as.numeric(cm.after)
sum1 <- res[1] + res[3]
sum2 <- res[2] + res[4] 
text(195, 400, round(res[1]/sum1, 4), cex=1.3, font=2, col='white')
text(195, 335, round(res[2]/sum2, 4), cex=1.3, font=2, col='white')
text(295, 400, round(res[3]/sum1, 4), cex=1.3, font=2, col='white')
text(295, 335, round(res[4]/sum2, 4), cex=1.3, font=2, col='white')
library(pROC)
par(pty = "s")
plot.roc(as.vector(test.label), pred.label, legacy.axes = TRUE, col = "blue", print.auc = TRUE,  
         print.auc.cex= .8, xlab = 'False Positive Rate', ylab = 'True Positive Rate', lty = "dashed")
plot.roc(as.vector(test.label), pred.label.over, legacy.axes = TRUE, col = "red", print.auc = TRUE,   
         print.auc.y = .4, print.auc.cex= .8, add = TRUE)
legend("bottomright", legend=c("Before Oversampling", "After Oversampling"), 
       col=c("blue", "red"), lwd=2, cex= .6, lty = c("dashed", "solid"))

The high frequency trading dataset

The dataset Dataset_HFT has already been introduced in the Data loading & oversampling section. The purpose of this example is to demonstrate the application of oversampling to a large sized dataset consisting of 30,000 observations instead of 300. For control, the imbalance ratio of the dataset is configured to be the same as the smaller dataset. We split the training, validating and testing data by a ratio of 20:3:7.

HFT <- Dataset_HFT()

label <- HFT$y
sample <- HFT$x
train.label <- label[1:20000]
train.sample <- sample[1:20000, ]
test.label <- label[23001:30000]
test.sample <- sample[23001:30000, ]
vali.label <- label[20001:23000]
vali.sample <- sample[20001:23000, ]

The imbalance ratio of the training data is 1:48:1.

table(train.label)

After oversampling the data is balanced.

MyData <- OSTSC(train.sample, train.label, parallel = FALSE)
over.sample <- MyData$sample
over.label <- MyData$label
table(over.label)
over.label <- rbind(matrix(-1,19617,1),matrix(0,19269,1),matrix(1,19652,1))
table(over.label)

We increase the number of epoches to 100. Figures 20, 21 and 22 compare the F1 scores of the two models without and with oversampling. Figure 23 compares the losses of the two models. Figures 24 and 25 compare the confusion matrices of the two models without and with oversampling. Figure 26 compares the ROC curves of the models. The final F1 scores of the two trained models, using the same validation set, are also shown below for comparison.

train.y <- dummy(train.label)
test.y <- dummy(test.label)
train.x <- array(train.sample, dim = c(dim(train.sample),1)) 
test.x <- array(test.sample, dim = c(dim(test.sample),1)) 
vali.y <- dummy(vali.label)
vali.x <- array(vali.sample, dim = c(dim(vali.sample),1)) 
over.y <- dummy(over.label)
over.x <- array(over.sample, dim = c(dim(over.sample),1)) 

model <- keras_model_sequential()
model %>%
  layer_lstm(10, input_shape = c(dim(train.x)[2], dim(train.x)[3])) %>%
  layer_dropout(rate = 0.1) %>% 
  layer_dense(dim(train.y)[2]) %>%
  layer_dropout(rate = 0.1) %>% 
  layer_activation("softmax")
history <- LossHistory$new()
model %>% compile(
  loss = "categorical_crossentropy", 
  optimizer = "adam",
  metrics = c("accuracy",'f1_score_0' = metric_f1_0, 'f1_score_1' = metric_f1_1, 
              'f1_score_2' = metric_f1_2)
)
lstm.before <- model %>% fit( 
  x = train.x, 
  y = train.y, 
  validation_data=list(vali.x,vali.y),
  callbacks = list(history),
  epochs = 100
)

model.over <- keras_model_sequential()
model.over %>%
  layer_lstm(10, input_shape = c(dim(train.x)[2], dim(train.x)[3])) %>%
  layer_dropout(rate = 0.1) %>% 
  layer_dense(dim(train.y)[2]) %>%
  layer_dropout(rate = 0.1) %>% 
  layer_activation("softmax")
history.over <- LossHistory$new()
model.over %>% compile(
  loss = "categorical_crossentropy", 
  optimizer = "adam",
  metrics = c("accuracy",'f1_score_0' = metric_f1_0, 'f1_score_1' = metric_f1_1, 
              'f1_score_2' = metric_f1_2)
)
lstm.after <- model.over %>% fit( 
    x = over.x, 
    y = over.y, 
    validation_data=list(vali.x,vali.y),
    callbacks = list(history.over),
    epochs = 100
)
pred.label <- as.vector(unlist(hft$pred.label)) 
history.val_f1_2s <- as.vector(unlist(hft$history.val_f1_m1s)) 
history.val_f1_0s <- as.vector(unlist(hft$history.val_f1_0s)) 
history.val_f1_1s <- as.vector(unlist(hft$history.val_f1_1s)) 
history.losses <- as.vector(unlist(hft$history.losses)) 
pred.label.over <- as.vector(unlist(hft$pred.label.over)) 
history.over.val_f1_2s <- as.vector(unlist(hft$history.over.val_f1_m1s)) 
history.over.val_f1_0s <- as.vector(unlist(hft$history.over.val_f1_0s)) 
history.over.val_f1_1s <- as.vector(unlist(hft$history.over.val_f1_1s)) 
history.over.losses <- as.vector(unlist(hft$history.over.losses)) 
plot(history.over.val_f1_1s, type = "b", pch = 19, col = "red", main = "F1 of the LSTM classifier on HFT dataset", yaxt = "n", xaxt = "n", xlab = "Epoches", ylab = "", ylim = c(0.0, 1.0), xlim = c(0, 100))
lines(history.val_f1_1s, type = "b", pch = 0, col = "blue")
axis(1, at = c(0,10,20,30,40,50,60,70,80,90,100),labels = c(0,10,20,30,40,50,60,70,80,90,100), las = 1)
axis(2, at = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1), labels = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1), las = 1)
mtext("F1", side = 2, las = 1, line = 3)
legend("topleft", legend = c("Balanced dataset", "Unbalanced dataset"), col = c("red","blue"), pch = c(19, 0), lty = c(NA, NA), bty = "n", lwd = 2, cex = 0.7)
plot(history.over.val_f1_0s, type = "b", pch = 19, col = "red", main = "F1 of the LSTM classifier on HFT dataset", yaxt = "n", xaxt = "n", xlab = "Epoches", ylab = "", ylim = c(0.0, 1.0), xlim = c(0, 100))
lines(history.val_f1_0s, type = "b", pch = 0, col = "blue")
axis(1, at = c(0,10,20,30,40,50,60,70,80,90,100),labels = c(0,10,20,30,40,50,60,70,80,90,100), las = 1)
axis(2, at = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1), labels = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1), las = 1)
mtext("F1", side = 2, las = 1, line = 3)
legend("bottomright", legend = c("Balanced dataset", "Unbalanced dataset"), col = c("red","blue"), pch = c(19, 0), lty = c(NA, NA), bty = "n", lwd = 2, cex = 0.7)
plot(history.over.val_f1_2s, type = "b", pch = 19, col = "red", main = "F1 of the LSTM classifier on HFT dataset", yaxt = "n", xaxt = "n", xlab = "Epoches", ylab = "", ylim = c(0.0, 1.0), xlim = c(0, 100))
lines(history.val_f1_2s, type = "b", pch = 0, col = "blue")
axis(1, at = c(0,10,20,30,40,50,60,70,80,90,100),labels = c(0,10,20,30,40,50,60,70,80,90,100), las = 1)
axis(2, at = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1), labels = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1), las = 1)
mtext("F1", side = 2, las = 1, line = 3)
legend("topleft", legend = c("Balanced dataset", "Unbalanced dataset"), col = c("red","blue"), pch = c(19, 0), lty = c(NA, NA), bty = "n", lwd = 2, cex = 0.7)
plot(history.losses, type = "b", pch = 0, col = "blue", main = "Loss of the LSTM classifier on HFT dataset", yaxt = "n", xaxt = "n", xlab = "Epoches", ylab = "", ylim = c(0, 1.2), xlim = c(0, 100))
lines(history.over.losses, type = "b", pch = 19, col = "red")
axis(1, at = c(0,10,20,30,40,50,60,70,80,90,100),labels = c(0,10,20,30,40,50,60,70,80,90,100), las = 1)
axis(2, at = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1,1.1,1.2), labels = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1,1.1,1.2), las = 1)
mtext("Loss", side = 2, las = 1, line = 2)
legend("topright", legend = c("Balanced dataset", "Unbalanced dataset"), col = c("red","blue"), pch = c(19, 0), lty = c(NA, NA), bty = "n", lwd = 2, cex = 0.7)
pred.label <- model %>% predict_classes(test.x)
pred.label.over <- model.over %>% predict_classes(test.x)
cm.before <- table(test.label, pred.label)
cm.after <- table(test.label, pred.label.over)
res <- as.numeric(cm.before)
for (i in 1:9){
  res[i][is.na(res[i])] <- 0
}
f1_2 <- 2*res[9]/(2*res[9]+res[3]+res[6]+res[7]+res[8])
f1_1 <- 2*res[5]/(2*res[5]+res[2]+res[4]+res[6]+res[8])
f1_0 <- 2*res[1]/(2*res[1]+res[2]+res[3]+res[4]+res[7])
cat("The class 1 F1 score without oversampling: ", f1_2)
cat("The class 0 F1 score without oversampling: ", f1_1)
cat("The class -1 F1 score without oversampling: ", f1_0)
res <- as.numeric(cm.after)
for (i in 1:9){
  res[i][is.na(res[i])] <- 0
}
f1_2 <- 2*res[9]/(2*res[9]+res[3]+res[6]+res[7]+res[8])
f1_1 <- 2*res[5]/(2*res[5]+res[2]+res[4]+res[6]+res[8])
f1_0 <- 2*res[1]/(2*res[1]+res[2]+res[3]+res[4]+res[7])
cat("The class 1 F1 score with oversampling: ", f1_2)
cat("The class 0 F1 score with oversampling: ", f1_1)
cat("The class -1 F1 score with oversampling: ", f1_0)
cm.before <- table(test.label, pred.label)
cm.after <- table(test.label, pred.label.over)

layout(matrix(c(1,1,1)))
par(mar=c(2,2,2,2))
plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')

rect(140, 430, 200, 390, col='#3F97D0')
rect(210, 430, 270, 390, col='#F7AD50')
rect(280, 430, 340, 390, col='#F7AD50')
rect(140, 345, 200, 385, col='#F7AD50')
rect(210, 345, 270, 385, col='#3F97D0')
rect(280, 345, 340, 385, col='#F7AD50')
rect(140, 300, 200, 340, col='#F7AD50')
rect(210, 300, 270, 340, col='#F7AD50')
rect(280, 300, 340, 340, col='#3F97D0')
text(170, 435, '-1', cex=1.1)
text(240, 435, '0', cex=1.1)
text(310, 435, '1', cex=1.1)
text(130, 410, '-1', cex=1.1, srt=90)
text(130, 365, '0', cex=1.1, srt=90)
text(130, 320, '1', cex=1.1, srt=90)
text(120, 370, 'True', cex=1.2, srt=90, font=2)
text(240, 450, 'Predicted', cex=1.2, font=2)

res <- as.numeric(cm.before)
sum1 <- res[1] + res[4] + res[7]
sum2 <- res[2] + res[5] + res[8]
sum3 <- res[3] + res[6] + res[9]
text(170, 410, round(res[1]/sum1, 4), cex=1.3, font=2, col='white')
text(170, 365, round(res[2]/sum2, 4), cex=1.3, font=2, col='white')
text(170, 320, round(res[3]/sum3, 4), cex=1.3, font=2, col='white')
text(240, 410, round(res[4]/sum1, 4), cex=1.3, font=2, col='white')
text(240, 365, round(res[5]/sum2, 4), cex=1.3, font=2, col='white')
text(240, 320, round(res[6]/sum3, 4), cex=1.3, font=2, col='white')
text(310, 410, round(res[7]/sum1, 4), cex=1.3, font=2, col='white')
text(310, 365, round(res[8]/sum2, 4), cex=1.3, font=2, col='white')
text(310, 320, round(res[9]/sum3, 4), cex=1.3, font=2, col='white')
layout(matrix(c(1,1,1)))
par(mar=c(2,2,2,2))
plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')

rect(140, 430, 200, 390, col='#3F97D0')
rect(210, 430, 270, 390, col='#F7AD50')
rect(280, 430, 340, 390, col='#F7AD50')
rect(140, 345, 200, 385, col='#F7AD50')
rect(210, 345, 270, 385, col='#3F97D0')
rect(280, 345, 340, 385, col='#F7AD50')
rect(140, 300, 200, 340, col='#F7AD50')
rect(210, 300, 270, 340, col='#F7AD50')
rect(280, 300, 340, 340, col='#3F97D0')
text(170, 435, '-1', cex=1.1)
text(240, 435, '0', cex=1.1)
text(310, 435, '1', cex=1.1)
text(130, 410, '-1', cex=1.1, srt=90)
text(130, 365, '0', cex=1.1, srt=90)
text(130, 320, '1', cex=1.1, srt=90)
text(120, 370, 'True', cex=1.2, srt=90, font=2)
text(240, 450, 'Predicted', cex=1.2, font=2)

res <- as.numeric(cm.after)
sum1 <- res[1] + res[4] + res[7]
sum2 <- res[2] + res[5] + res[8]
sum3 <- res[3] + res[6] + res[9]
text(170, 410, round(res[1]/sum1, 4), cex=1.3, font=2, col='white')
text(170, 365, round(res[2]/sum2, 4), cex=1.3, font=2, col='white')
text(170, 320, round(res[3]/sum3, 4), cex=1.3, font=2, col='white')
text(240, 410, round(res[4]/sum1, 4), cex=1.3, font=2, col='white')
text(240, 365, round(res[5]/sum2, 4), cex=1.3, font=2, col='white')
text(240, 320, round(res[6]/sum3, 4), cex=1.3, font=2, col='white')
text(310, 410, round(res[7]/sum1, 4), cex=1.3, font=2, col='white')
text(310, 365, round(res[8]/sum2, 4), cex=1.3, font=2, col='white')
text(310, 320, round(res[9]/sum3, 4), cex=1.3, font=2, col='white')
library(pROC)
par(pty = "s")
plot.roc(as.vector(test.label), pred.label, legacy.axes = TRUE, col = "blue", print.auc = TRUE,  
         print.auc.cex= .8, xlab = 'False Positive Rate', ylab = 'True Positive Rate', lty = "dashed")
plot.roc(as.vector(test.label), pred.label.over, legacy.axes = TRUE, col = "red", print.auc = TRUE,   
         print.auc.y = .4, print.auc.cex= .8, add = TRUE)
legend("bottomright", legend=c("Before Oversampling", "After Oversampling"), 
       col=c("blue", "red"), lwd=2, cex= .6, lty = c("dashed", "solid"))

The comparative results are similar to the MHEALTH dataset - oversampling improves the performance and the comparative gain from using OSTSC only increases with more training observations and more epoches.

Summary

The OSTSC package is a powerful oversampling approach for classifying univariant, but multinomial time series data. This vignette provides a brief overview of the over-sampling methodology implemented by the package. We first provide three examples for the user to verify correct package installation and reproduceability of the results. Using a 'TensorFlow' implementation of an LSTM architecture, we compared the classifier with and without oversampling. We then repeated the evaluation on two medium size datasets which demonstrate the performance gains from using OSTSC and do not require significant computation. Finally, two large datasets are evaluated to demonstrate the scalability of the package. The examples serve to demonstrate that the OSTSC package improves the performance of RNN classifiers applied to highly imbalanced time series data.

References

#read.bibtex(file = "referenceOSTSC.bib")


Try the OSTSC package in your browser

Any scripts or data that you put into this service are public.

OSTSC documentation built on May 2, 2019, 5:16 a.m.