knitr::opts_chunk$set( message = FALSE, warning = FALSE, fig.width = 8, fig.height = 4.5, fig.align = 'center', out.width='95%', dpi = 100 ) # devtools::load_all() # Travis CI fails on load_all()
Time series data wrangling is an essential skill for any forecaster. timetk
includes the essential data wrangling tools. In this tutorial, we'll cover:
Additional concepts covered:
%+time
infix operation (See Padding Data: Low to High Frequency)plot_time_series()
for all visualizationsLoad the following libraries.
library(dplyr) library(tidyr) library(timetk)
This tutorial will use the FANG
dataset:
FANG
The adjusted column contains the adjusted closing prices for each day.
FANG %>% group_by(symbol) %>% plot_time_series(date, adjusted, .facet_ncol = 2, .interactive = FALSE)
The volume column contains the trade volume (number of times the stock was transacted) for the day.
FANG %>% group_by(symbol) %>% plot_time_series(date, volume, .facet_ncol = 2, .interactive = FALSE)
summarise_by_time()
aggregates by a period. It's great for:
sum()
mean()
, first()
, last()
Objective: Get the total trade volume by quarter
sum()
.by = "quarter"
FANG %>% group_by(symbol) %>% summarise_by_time( date, .by = "quarter", volume = sum(volume) ) %>% plot_time_series(date, volume, .facet_ncol = 2, .interactive = FALSE, .y_intercept = 0)
Objective: Get the first value in each month
first()
to get the first value, which has the effect of reducing the data (i.e. smoothing). We could use mean()
or median()
. .by = "month"
to aggregate by month. FANG %>% group_by(symbol) %>% summarise_by_time( date, .by = "month", adjusted = first(adjusted) ) %>% plot_time_series(date, adjusted, .facet_ncol = 2, .interactive = FALSE)
Used to quickly filter a continuous time range.
Objective: Get the adjusted stock prices in the 3rd quarter of 2013.
.start_date = "2013-09"
: Converts to "2013-09-01.end_date = "2013"
: Converts to "2013-12-31%+time
and %-time
is shown in "Padding Data: Low to High Frequency". FANG %>% group_by(symbol) %>% filter_by_time(date, "2013-09", "2013") %>% plot_time_series(date, adjusted, .facet_ncol = 2, .interactive = FALSE)
Used to fill in (pad) gaps and to go from from low frequency to high frequency. This function uses the awesome padr
library for filling and expanding timestamps.
Objective: Make an irregular series regular.
NA
. .pad_value
or we can impute using a function like ts_impute_vec()
(shown next). FANG %>% group_by(symbol) %>% pad_by_time(date, .by = "auto") # Guesses .by = "day"
Objective: Go from Daily to Hourly timestamp intervals for 1 month from the start date. Impute the missing values.
.by = "hour"
pads from daily to hourlyts_impute_vec()
, which performs linear interpolation when period = 1
.FIRST(date) %+time% "1 month"
: Selecting the first date in the sequence then using a special infix operation, %+time%
, called "add time". In this case I add "1 month". FANG %>% group_by(symbol) %>% pad_by_time(date, .by = "hour") %>% mutate_at(vars(open:adjusted), .funs = ts_impute_vec, period = 1) %>% filter_by_time(date, "start", first(date) %+time% "1 month") %>% plot_time_series(date, adjusted, .facet_ncol = 2, .interactive = FALSE)
We have a new function, slidify()
that turns any function into a sliding (rolling) window function. It takes concepts from tibbletime::rollify()
and it improves them with the R package slider
.
Objective: Calculate a "centered" simple rolling average with partial window rolling and the start and end windows.
slidify()
turns the mean()
function into a rolling average. # Make the rolling function roll_avg_30 <- slidify(.f = mean, .period = 30, .align = "center", .partial = TRUE) # Apply the rolling function FANG %>% select(symbol, date, adjusted) %>% group_by(symbol) %>% # Apply Sliding Function mutate(rolling_avg_30 = roll_avg_30(adjusted)) %>% pivot_longer(cols = c(adjusted, rolling_avg_30)) %>% plot_time_series(date, value, .color_var = name, .facet_ncol = 2, .smooth = FALSE, .interactive = FALSE)
For simple rolling calculations (rolling average), we can accomplish this operation faster with slidify_vec()
- A vectorized rolling function for simple summary rolls (e.g. mean()
, sd()
, sum()
, etc)
FANG %>% select(symbol, date, adjusted) %>% group_by(symbol) %>% # Apply roll apply Function mutate(rolling_avg_30 = slidify_vec(adjusted, ~ mean(.), .period = 30, .partial = TRUE))
Objective: Calculate a rolling regression.
slidify()
is built for this.purrr
..1
, ..2
, ..3
, etc notation to setup a function# Rolling regressions are easy to implement using `.unlist = FALSE` lm_roll <- slidify(~ lm(..1 ~ ..2 + ..3), .period = 90, .unlist = FALSE, .align = "right") FANG %>% select(symbol, date, adjusted, volume) %>% group_by(symbol) %>% mutate(numeric_date = as.numeric(date)) %>% # Apply rolling regression mutate(rolling_lm = lm_roll(adjusted, volume, numeric_date)) %>% filter(!is.na(rolling_lm))
My Talk on High-Performance Time Series Forecasting
Time series is changing. Businesses now need 10,000+ time series forecasts every day.
High-Performance Forecasting Systems will save companies MILLIONS of dollars. Imagine what will happen to your career if you can provide your organization a "High-Performance Time Series Forecasting System" (HPTSF System).
I teach how to build a HPTFS System in my High-Performance Time Series Forecasting Course. If interested in learning Scalable High-Performance Forecasting Strategies then take my course. You will learn:
Modeltime
- 30+ Models (Prophet, ARIMA, XGBoost, Random Forest, & many more)GluonTS
(Competition Winners)Unlock the High-Performance Time Series Forecasting Course
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.