In vivienroussez/autoTS: Automatic Model Selection and Prediction for Univariate Time Series

knitr::opts_chunk$set(warning = F,message = F,fig.width = 8,fig.height = 5)
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(lubridate))
library(autoTS)

Introduction

What does this package do ?

The autoTS package provides a high level interface for univariate time series predictions. It implements many algorithms, most of them provided by the forecast package. The main goals of the package are :

Simplify the preparation of the time series ;
Train the algorithms and compare their results, to chose the best one ;
Gather the results in a final tidy dataframe

What are the inputs ?

The package is designed to work on one time series at a time. Parallel calculations can be put on top of it (see example below). The user has to provide 2 simple vectors :

One with the dates (s.t. the lubridate package can parse them)
The second with the corresponding values

Warnings

This package implements each algorithm with a unique parametrization, meaning that the user cannot tweak the algorithms (eg modify SARIMA specfic parameters).

Exemple on real-world data

For this example, we will use the GDP quarterly data of the european countries provided by eurostat. The database can be downloaded from this page and then chose "GDP and main components (output, expenditure and income) (namq_10_gdp)" and then adjust the time dimension to select all available data and download as a csv file with the correct formatting (1 234.56). The csv is in the "Data" folder of this notebook.

tmp_dir <- tempdir() %>% normalizePath()
  unzip(zipfile = "../inst/extdata/namq_10_gdp.zip",exdir = tmp_dir)
dat <- read.csv(paste0(tmp_dir,"/namq_10_gdp_1_Data.csv"))
file.remove(paste0(tmp_dir,"/namq_10_gdp_1_Data.csv"),paste0(tmp_dir,"/namq_10_gdp_Label.csv"))
str(dat)
head(dat)

Data preparation

First, we have to clean the data (not too ugly though). First thing is to convert the TIME column into a well known date format that lubridate can handle. In this example, the yq function can parse the date without modification of the column. Then, we have to remove the blank in the values that separates thousands... Finally, we only keep data since 2000 and the unadjusted series in current prices.

After that, we should get one time series per country

dat <- mutate(dat,dates=yq(as.character(TIME)),
              values = as.numeric(stringr::str_remove(Value," "))) %>% 
  filter(year(dates)>=2000 & 
           S_ADJ=="Unadjusted data (i.e. neither seasonally adjusted nor calendar adjusted data)" &
           UNIT == "Current prices, million euro")

filter(dat,GEO %in% c("France","Austria")) %>% 
  ggplot(aes(dates,values,color=GEO)) + geom_line() + theme_minimal() +
  labs(title="GDP of (completely) random countries")

Now we're good to go !

Prediction on a random country

Let's see how to use the package on one time series :

Extract dates and values of the time series you want to work on
Create the object containing all you need afterwards
Train algo and determine which one is the best (over the last known year)
Implement the best algorithm on full data

ex1 <- filter(dat,GEO=="France") 
preparedTS <- prepare.ts(ex1$dates,ex1$values,"quarter")

## What is in this new object ?
str(preparedTS)
plot.ts(preparedTS$obj.ts)
ggplot(preparedTS$obj.df,aes(dates,val)) + geom_line() + theme_minimal()

Get the best algorithm for this time series :

## What is the best model for prediction ?
best.algo <- getBestModel(ex1$dates,ex1$values,"quarter",graph = F)
names(best.algo)
print(paste("The best algorithm is",best.algo$best))
best.algo$graph.train

You find in the result of this function :

The name of the best model
The errors of each algorithm on the test set
The graphic of the train step
The prepared time series
The list of used algorithm (that you can customize)

The result of this function can be used as direct input of the my.prediction function

## Build the predictions
final.pred <- my.predictions(bestmod = best.algo)
tail(final.pred,24)
ggplot(final.pred) + geom_line(aes(dates,actual.value),color="black") + 
  geom_line(aes_string("dates",stringr::str_remove(best.algo$best,"my."),linetype="type"),color="red") +
  theme_minimal()

Not too bad, right ?

Scaling predictions

Let's say we want to make a prediction for each country in the same time and be the fastest possible $\rightarrow$ let's combine the package's functions with parallel computing. We have to reshape the data to get one column per country and then iterate over the columns of the data frame.

Prepare data

suppressPackageStartupMessages(library(tidyr))
dat.wide <- select(dat,GEO,dates,values) %>% 
  group_by(dates) %>% 
  spread(key = "GEO",value = "values")
head(dat.wide)

Compute bulk predictions

Note : The following code is not executed for this vignette but does work (you can try it at home)

library(doParallel)
pipeline <- function(dates,values)
{
  pred <- getBestModel(dates,values,"quarter",graph = F)  %>%
    my.predictions()
  return(pred)
}
doMC::registerDoMC(parallel::detectCores()-1) # parallel backend (for UNIX)

system.time({
  res <- foreach(ii=2:ncol(dat.wide),.packages = c("dplyr","autoTS")) %dopar%
  pipeline(dat.wide$dates,pull(dat.wide,ii))
})
names(res) <- colnames(dat.wide)[-1]
str(res)

There is no free lunch...

There is no best algorithm in general $\Rightarrow$ depends on the data ! Likewise, this is not executed in this vignette, but works if you want to replicate it.

sapply(res,function(xx) colnames(select(xx,-dates,-type,-actual.value)) ) %>% table()
sapply(res,function(xx) colnames(select(xx,-dates,-type,-actual.value)) )