knitr::opts_chunk$set( message = FALSE, warning = FALSE, fig.width = 8, fig.height = 4.5, fig.align = 'center', out.width='95%', dpi = 100 ) # devtools::load_all() # Travis CI fails on load_all()

**Time series data wrangling** is an essential skill for any forecaster. `timetk`

includes the essential data wrangling tools. In this tutorial, we'll cover:

**Summarise by Time**- For time-based aggregations**Filter by Time**- For complex time-based filtering**Pad by Time**- For filling in gaps and going from*low to high frequency***Slidify**- For turning any function into a sliding (rolling) function

Additional concepts covered:

**Imputation**- Needed for Padding (See Low to High Frequency)**Advanced Filtering**- Using the new add time`%+time`

infix operation (See*Padding Data: Low to High Frequency*)**Visualization**-`plot_time_series()`

for all visualizations

Load the following libraries.

library(dplyr) library(tidyr) library(timetk)

This tutorial will use the `FANG`

dataset:

- Daily
- Irregular (missing business holidays and weekends)
- 4 groups (FB, AMZN, NFLX, and GOOG).

FANG

The adjusted column contains the adjusted closing prices for each day.

FANG %>% group_by(symbol) %>% plot_time_series(date, adjusted, .facet_ncol = 2, .interactive = FALSE)

The volume column contains the trade volume (number of times the stock was transacted) for the day.

FANG %>% group_by(symbol) %>% plot_time_series(date, volume, .facet_ncol = 2, .interactive = FALSE)

`summarise_by_time()`

aggregates by a period. It's great for:

- Period Aggregation -
`sum()`

- Period Smoothing -
`mean()`

,`first()`

,`last()`

Objective: Get the total trade volume by quarter

- Use
`sum()`

- Aggregate using
`.by = "quarter"`

FANG %>% group_by(symbol) %>% summarise_by_time( date, .by = "quarter", volume = sum(volume) ) %>% plot_time_series(date, volume, .facet_ncol = 2, .interactive = FALSE, .y_intercept = 0)

Objective: Get the first value in each month

- We can use
`first()`

to get the first value, which has the effect of reducing the data (i.e. smoothing). We could use`mean()`

or`median()`

. - Use the summarization by time:
`.by = "month"`

to aggregate by month.

FANG %>% group_by(symbol) %>% summarise_by_time( date, .by = "month", adjusted = first(adjusted) ) %>% plot_time_series(date, adjusted, .facet_ncol = 2, .interactive = FALSE)

Used to quickly filter a continuous time range.

Objective: Get the adjusted stock prices in the 3rd quarter of 2013.

`.start_date = "2013-09"`

: Converts to "2013-09-01`.end_date = "2013"`

: Converts to "2013-12-31- A more advanced example of filtering using
`%+time`

and`%-time`

is shown in*"Padding Data: Low to High Frequency"*.

FANG %>% group_by(symbol) %>% filter_by_time(date, "2013-09", "2013") %>% plot_time_series(date, adjusted, .facet_ncol = 2, .interactive = FALSE)

Used to fill in (pad) gaps and to go from from low frequency to high frequency. This function uses the awesome `padr`

library for filling and expanding timestamps.

Objective: Make an irregular series regular.

- We will leave padded values as
`NA`

. - We can add a value using
`.pad_value`

or we can impute using a function like`ts_impute_vec()`

(shown next).

FANG %>% group_by(symbol) %>% pad_by_time(date, .by = "auto") # Guesses .by = "day"

Objective: Go from Daily to Hourly timestamp intervals for 1 month from the start date. Impute the missing values.

`.by = "hour"`

pads from daily to hourly- Imputation of hourly data is accomplished with
`ts_impute_vec()`

, which performs linear interpolation when`period = 1`

. - Filtering is accomplished using:
- "start": A special keyword that signals the start of a series
`FIRST(date) %+time% "1 month"`

: Selecting the first date in the sequence then using a special infix operation,`%+time%`

, called "add time". In this case I add "1 month".

FANG %>% group_by(symbol) %>% pad_by_time(date, .by = "hour") %>% mutate_at(vars(open:adjusted), .funs = ts_impute_vec, period = 1) %>% filter_by_time(date, "start", first(date) %+time% "1 month") %>% plot_time_series(date, adjusted, .facet_ncol = 2, .interactive = FALSE)

We have a new function, `slidify()`

that turns any function into a sliding (rolling) window function. It takes concepts from `tibbletime::rollify()`

and it improves them with the R package `slider`

.

Objective: Calculate a "centered" simple rolling average with partial window rolling and the start and end windows.

`slidify()`

turns the`mean()`

function into a rolling average.

# Make the rolling function roll_avg_30 <- slidify(.f = mean, .period = 30, .align = "center", .partial = TRUE) # Apply the rolling function FANG %>% select(symbol, date, adjusted) %>% group_by(symbol) %>% # Apply Sliding Function mutate(rolling_avg_30 = roll_avg_30(adjusted)) %>% tidyr::pivot_longer(cols = c(adjusted, rolling_avg_30)) %>% plot_time_series(date, value, .color_var = name, .facet_ncol = 2, .smooth = FALSE, .interactive = FALSE)

For simple rolling calculations (rolling average), we can accomplish this operation faster with `slidify_vec()`

- A vectorized rolling function for simple summary rolls (e.g. `mean()`

, `sd()`

, `sum()`

, etc)

FANG %>% select(symbol, date, adjusted) %>% group_by(symbol) %>% # Apply roll apply Function mutate(rolling_avg_30 = slidify_vec(adjusted, ~ mean(.), .period = 30, .partial = TRUE))

Objective: Calculate a rolling regression.

- This is a complex sliding (rolling) calculation that requires multiple columns to be involved.
`slidify()`

is built for this.- Use the multi-variable
`purrr`

`..1`

,`..2`

,`..3`

, etc notation to setup a function

# Rolling regressions are easy to implement using `.unlist = FALSE` lm_roll <- slidify(~ lm(..1 ~ ..2 + ..3), .period = 90, .unlist = FALSE, .align = "right") FANG %>% select(symbol, date, adjusted, volume) %>% group_by(symbol) %>% mutate(numeric_date = as.numeric(date)) %>% # Apply rolling regression mutate(rolling_lm = lm_roll(adjusted, volume, numeric_date)) %>% filter(!is.na(rolling_lm))

*My Talk on High-Performance Time Series Forecasting*

Time series is changing. **Businesses now need 10,000+ time series forecasts every day.**

**High-Performance Forecasting Systems will save companies MILLIONS of dollars.** Imagine what will happen to your career if you can provide your organization a "High-Performance Time Series Forecasting System" (HPTSF System).

I teach how to build a HPTFS System in my **High-Performance Time Series Forecasting Course**. If interested in learning Scalable High-Performance Forecasting Strategies then take my course. You will learn:

- Time Series Machine Learning (cutting-edge) with
`Modeltime`

- 30+ Models (Prophet, ARIMA, XGBoost, Random Forest, & many more) - NEW - Deep Learning with
`GluonTS`

(Competition Winners) - Time Series Preprocessing, Noise Reduction, & Anomaly Detection
- Feature engineering using lagged variables & external regressors
- Hyperparameter Tuning
- Time series cross-validation
- Ensembling Multiple Machine Learning & Univariate Modeling Techniques (Competition Winner)
- Scalable Forecasting - Forecast 1000+ time series in parallel
- and more.

Unlock the High-Performance Time Series Forecasting Course

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.