In krzjoa/torchts: Time series Models with torch

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(dplyr, warn.conflicts = FALSE)
library(torch)
library(torchts)

Look at the `tiny_m5`

If you practice time series modeling, you probably may hear about the M5 challenge by Spyros Makridakis, hosted on Kaggle. It's an excellent data set, if we want to play with demand time series. The whole dataset is really large, so we rather can use a subset to demonstrate, how to work with such data.

unique(tiny_m5$store_id)

ca_1_data <-
  tiny_m5 %>% 
  filter(store_id == "CA_1") %>% 
  select(item_id, store_id, date, value, wday,
         month, year, snap, sell_price) %>% 
    arrange(item_id, date)

ca_1_data %>% 
  group_by(item_id) %>% 
  summarise(n = n()) 

skimr::skim(ca_1_data)

For deep learning models for time series, we'd typically like to create a three-dimensional tensor. In the analyzed case, the each dimension may represent:

item
time steps
features

The first dimension is "free" - we can add an arbitrary number of items. When it comes to the second one, they length may vary as well. However, for convenience, we'll keep same-length time series. Otherwise, we'd have to use masking or split the dataset into multiple tensors. The last dimension size is guaranteed by the data.frame structure itself (each row has the same number of columns).

As we can observed in the output of skim function, there are time series with missing data.

As mentioned before, for simplicity's sake, we can just select a subset of items with the same series length as well as the first and last date. In such case, we'll be sure that our data are properly aligned in the tensor. Later, in a separate vignette, we'll dive into a set of methods, how to handle missing/non-aligned multiple time series when training a deep learning model.

as_tensor

head(ca_1_data)

The first column, item_id, describes the item we already the item and the second one (date) - a current time step. These two columns will be used to create a data "fold", i.e. form a 3D tensor. As we mentioned above, the completness of the time moments is crucial to obtain a proper result from this transformation.

If it comes, to the rest of columns:

value is a target we want to predict
wday is a categorical variable
month is a categorical variable
year can be treated as categorical, but in this case we may remove this variable and introduce a counter instead
snap is a categorical variable
sell_price is a real-valued variable

Summarizing, we have three categorical variables, which should be represented in some way. The most efficient manner to represent categorical variables in neural network is embedding layer.
In fact, it works similar to a linear (dense) layer. The difference is that instead of performing resource-consuming dot product between weight matrix and the input one-hot encoded sparse matrix, we just use an index to select "right" row from the weight matrix.

Let's transform tabular data into tensor. A good way to do it is to use a convenient as_tensor function. The first argument of the function (described as .data) is a data.frame object, which we want to transform into a torch_tensor.

colnames(ca_1_data)

First, we'll select only integer item_id, date and integer variables.

ca_1 <- 
  ca_1_data %>% 
  select(item_id, date, wday, month, year, snap)

ca_1_tensor <- 
  ca_1 %>% 
  as_tensor(item_id, date)

dim(ca_1_tensor)
class(ca_1_tensor)

ca_1_tensor <- 
  ca_1_data %>% 
  select(item_id, date, wday, month, year, snap, sell_price) %>% 
  mutate(across(where(is.integer), as.numeric)) %>% 
  as_tensor(item_id, date)

dim(ca_1_tensor)
class(ca_1_tensor)

as_ts_dataset

To speed up torch models developments, torchts package provides easy-to-use as_ts_dataset method, which is a shortcut to create a torch dataset from a data.frame. For now keys like item_id are not supported - this feature will be implemented in the near future. We'll present this function using weather_pl dataset.

library(rsample)

suwalki_temp <-
  weather_pl %>%
  filter(station == "SWK") %>%
  select(date, tmax_daily, rr_type)

#' # Splitting on training and test
data_split <- initial_time_split(suwalki_temp)

debugonce(torchts:::as_ts_dataset.data.frame)

train_ds <-
  training(data_split) %>%
  as_ts_dataset(tmax_daily ~ date + rr_type, timesteps = 20, horizon = 1)

train_ds[1]

as_ts_dataloader

The quickest shortcut to get needed data-provding object is to call as_ts_dataloader function. It can be used as follows.

train_dl <-
   training(data_split) %>%
   as_ts_dataloader(temp ~ date, timesteps = 20, horizon = 1, batch_size = 32)

train_dl

dataloader_next(dataloader_make_iter(train_dl))