knitr::opts_chunk$set( message = FALSE, warning = FALSE, fig.width = 8, fig.height = 4.5, fig.align = 'center', out.width='95%', dpi = 100 ) # devtools::load_all() # Travis CI fails on load_all()
Time series data wrangling is an essential skill for any forecaster. timetk
includes the essential data wrangling tools. In this tutorial, we'll cover:
Additional concepts covered:
%+time
infix operation (See Padding Data: Low to High Frequency)plot_time_series()
for all visualizationsLoad the following libraries.
library(dplyr) library(tidyr) library(timetk)
This tutorial will use the FANG
dataset:
FANG
The adjusted column contains the adjusted closing prices for each day.
FANG %>% group_by(symbol) %>% plot_time_series(date, adjusted, .facet_ncol = 2, .interactive = FALSE)
The volume column contains the trade volume (number of times the stock was transacted) for the day.
FANG %>% group_by(symbol) %>% plot_time_series(date, volume, .facet_ncol = 2, .interactive = FALSE)
summarise_by_time()
aggregates by a period. It's great for:
sum()
mean()
, first()
, last()
Objective: Get the total trade volume by quarter
sum()
.by = "quarter"
FANG %>% group_by(symbol) %>% summarise_by_time( date, .by = "quarter", volume = sum(volume) ) %>% plot_time_series(date, volume, .facet_ncol = 2, .interactive = FALSE, .y_intercept = 0)
Objective: Get the first value in each month
first()
to get the first value, which has the effect of reducing the data (i.e. smoothing). We could use mean()
or median()
. .by = "month"
to aggregate by month. FANG %>% group_by(symbol) %>% summarise_by_time( date, .by = "month", adjusted = first(adjusted) ) %>% plot_time_series(date, adjusted, .facet_ncol = 2, .interactive = FALSE)
Used to quickly filter a continuous time range.
Objective: Get the adjusted stock prices in the 3rd quarter of 2013.
.start_date = "2013-09"
: Converts to "2013-09-01.end_date = "2013"
: Converts to "2013-12-31%+time
and %-time
is shown in "Padding Data: Low to High Frequency". FANG %>% group_by(symbol) %>% filter_by_time(date, "2013-09", "2013") %>% plot_time_series(date, adjusted, .facet_ncol = 2, .interactive = FALSE)
Used to fill in (pad) gaps and to go from from low frequency to high frequency. This function uses the awesome padr
library for filling and expanding timestamps.
Objective: Make an irregular series regular.
NA
. .pad_value
or we can impute using a function like ts_impute_vec()
(shown next). FANG %>% group_by(symbol) %>% pad_by_time(date, .by = "auto") # Guesses .by = "day"
Objective: Go from Daily to Hourly timestamp intervals for 1 month from the start date. Impute the missing values.
.by = "hour"
pads from daily to hourlyts_impute_vec()
, which performs linear interpolation when period = 1
.FIRST(date) %+time% "1 month"
: Selecting the first date in the sequence then using a special infix operation, %+time%
, called "add time". In this case I add "1 month". FANG %>% group_by(symbol) %>% pad_by_time(date, .by = "hour") %>% mutate_at(vars(open:adjusted), .funs = ts_impute_vec, period = 1) %>% filter_by_time(date, "start", first(date) %+time% "1 month") %>% plot_time_series(date, adjusted, .facet_ncol = 2, .interactive = FALSE)
We have a new function, slidify()
that turns any function into a sliding (rolling) window function. It takes concepts from tibbletime::rollify()
and it improves them with the R package slider
.
Objective: Calculate a "centered" simple rolling average with partial window rolling and the start and end windows.
slidify()
turns the mean()
function into a rolling average. # Make the rolling function roll_avg_30 <- slidify(.f = mean, .period = 30, .align = "center", .partial = TRUE) # Apply the rolling function FANG %>% select(symbol, date, adjusted) %>% group_by(symbol) %>% # Apply Sliding Function mutate(rolling_avg_30 = roll_avg_30(adjusted)) %>% tidyr::pivot_longer(cols = c(adjusted, rolling_avg_30)) %>% plot_time_series(date, value, .color_var = name, .facet_ncol = 2, .smooth = FALSE, .interactive = FALSE)
For simple rolling calculations (rolling average), we can accomplish this operation faster with slidify_vec()
- A vectorized rolling function for simple summary rolls (e.g. mean()
, sd()
, sum()
, etc)
FANG %>% select(symbol, date, adjusted) %>% group_by(symbol) %>% # Apply roll apply Function mutate(rolling_avg_30 = slidify_vec(adjusted, ~ mean(.), .period = 30, .partial = TRUE))
Objective: Calculate a rolling regression.
slidify()
is built for this.purrr
..1
, ..2
, ..3
, etc notation to setup a function# Rolling regressions are easy to implement using `.unlist = FALSE` lm_roll <- slidify(~ lm(..1 ~ ..2 + ..3), .period = 90, .unlist = FALSE, .align = "right") FANG %>% select(symbol, date, adjusted, volume) %>% group_by(symbol) %>% mutate(numeric_date = as.numeric(date)) %>% # Apply rolling regression mutate(rolling_lm = lm_roll(adjusted, volume, numeric_date)) %>% filter(!is.na(rolling_lm))
My Talk on High-Performance Time Series Forecasting
Time series is changing. Businesses now need 10,000+ time series forecasts every day.
High-Performance Forecasting Systems will save companies MILLIONS of dollars. Imagine what will happen to your career if you can provide your organization a "High-Performance Time Series Forecasting System" (HPTSF System).
I teach how to build a HPTFS System in my High-Performance Time Series Forecasting Course. If interested in learning Scalable High-Performance Forecasting Strategies then take my course. You will learn:
Modeltime
- 30+ Models (Prophet, ARIMA, XGBoost, Random Forest, & many more)GluonTS
(Competition Winners)Unlock the High-Performance Time Series Forecasting Course
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.