Time Series Data Wrangling"

    message = FALSE,
    warning = FALSE,
    fig.width = 8, 
    fig.height = 4.5,
    fig.align = 'center',
    dpi = 100

# devtools::load_all() # Travis CI fails on load_all()

Time series data wrangling is an essential skill for any forecaster. timetk includes the essential data wrangling tools. In this tutorial, we'll cover:

Additional concepts covered:


Load the following libraries.



This tutorial will use the FANG dataset:


The adjusted column contains the adjusted closing prices for each day.

FANG %>%
  group_by(symbol) %>%
  plot_time_series(date, adjusted, .facet_ncol = 2, .interactive = FALSE)

The volume column contains the trade volume (number of times the stock was transacted) for the day.

FANG %>%
  group_by(symbol) %>%
  plot_time_series(date, volume, .facet_ncol = 2, .interactive = FALSE)

Summarize by Time

summarise_by_time() aggregates by a period. It's great for:

Period Summarization

Objective: Get the total trade volume by quarter

FANG %>%
  group_by(symbol) %>%
    date, .by = "quarter",
    volume = SUM(volume)
  ) %>%
  plot_time_series(date, volume, .facet_ncol = 2, .interactive = FALSE, .y_intercept = 0)

Period Smoothing

Objective: Get the first value in each month

FANG %>%
  group_by(symbol) %>%
    date, .by = "month",
    adjusted = FIRST(adjusted)
  ) %>%
  plot_time_series(date, adjusted, .facet_ncol = 2, .interactive = FALSE)

Filter By Time

Used to quickly filter a continuous time range.

Time Range Filtering

Objective: Get the adjusted stock prices in the 3rd quarter of 2013.

FANG %>%
  group_by(symbol) %>%
  filter_by_time(date, "2013-09", "2013") %>%
  plot_time_series(date, adjusted, .facet_ncol = 2, .interactive = FALSE)

Padding Data

Used to fill in (pad) gaps and to go from from low frequency to high frequency. This function uses the awesome padr library for filling and expanding timestamps.

Fill in Gaps

Objective: Make an irregular series regular.

FANG %>%
  group_by(symbol) %>%
  pad_by_time(date, .by = "auto") # Guesses .by = "day"

Low to High Frequency

Objective: Go from Daily to Hourly timestamp intervals for 1 month from the start date. Impute the missing values.

FANG %>%
  group_by(symbol) %>%
  pad_by_time(date, .by = "hour") %>%
  mutate_at(vars(open:adjusted), .funs = ts_impute_vec, period = 1) %>%
  filter_by_time(date, "start", FIRST(date) %+time% "1 month") %>%
  plot_time_series(date, adjusted, .facet_ncol = 2, .interactive = FALSE) 

Sliding (Rolling) Calculations

We have a new function, slidify() that turns any function into a sliding (rolling) window function. It takes concepts from tibbletime::rollify() and it improves them with the R package slider.

Rolling Mean

Objective: Calculate a "centered" simple rolling average with partial window rolling and the start and end windows.

# Make the rolling function
roll_avg_30 <- slidify(.f = AVERAGE, .period = 30, .align = "center", .partial = TRUE)

# Apply the rolling function
FANG %>%
  select(symbol, date, adjusted) %>%
  group_by(symbol) %>%
  # Apply Sliding Function
  mutate(rolling_avg_30 = roll_avg_30(adjusted)) %>%
  pivot_longer(cols = c(adjusted, rolling_avg_30)) %>%
  plot_time_series(date, value, .color_var = name,
                   .facet_ncol = 2, .smooth = FALSE, 
                   .interactive = FALSE)

For simple rolling calculations (rolling average), we can accomplish this operation faster with slidify_vec() - A vectorized rolling function for simple summary rolls (e.g. mean(), sd(), sum(), etc)

FANG %>%
  select(symbol, date, adjusted) %>%
  group_by(symbol) %>%
  # Apply roll apply Function
  mutate(rolling_avg_30 = slidify_vec(adjusted,  ~ AVERAGE(.), 
                                      .period = 30, .partial = TRUE))

Rolling Regression

Objective: Calculate a rolling regression.

# Rolling regressions are easy to implement using `.unlist = FALSE`
lm_roll <- slidify(~ lm(..1 ~ ..2 + ..3), .period = 90, 
                   .unlist = FALSE, .align = "right")

FANG %>%
  select(symbol, date, adjusted, volume) %>%
  group_by(symbol) %>%
  mutate(numeric_date = as.numeric(date)) %>%
  # Apply rolling regression
  mutate(rolling_lm = lm_roll(adjusted, volume, numeric_date)) %>%

Learning More

My Talk on High-Performance Time Series Forecasting

Time series is changing. Businesses now need 10,000+ time series forecasts every day.

High-Performance Forecasting Systems will save companies MILLIONS of dollars. Imagine what will happen to your career if you can provide your organization a "High-Performance Time Series Forecasting System" (HPTSF System).

I teach how to build a HPTFS System in my High-Performance Time Series Forecasting Course. If interested in learning Scalable High-Performance Forecasting Strategies then take my course. You will learn:

Unlock the High-Performance Time Series Forecasting Course

Try the timetk package in your browser

Any scripts or data that you put into this service are public.

timetk documentation built on April 7, 2022, 5:08 p.m.