source("_common.R")
library(dplyr) library(tidyr) library(parsnip) library(recipes) library(epiprocess) library(epipredict) library(epidatasets) library(ggplot2) theme_set(theme_bw())
Panel data, or longitudinal data,
contain cross-sectional measurements of subjects over time. The epipredict
package is most suitable for running forecasters on epidemiological panel data.
An example of this is the covid_case_death_rates
dataset, which contains daily state-wise measures of case_rate
and
death_rate
for COVID-19 in 2021:
head(covid_case_death_rates, 3)
epipredict
functions work with data in
epi_df
format. Despite the stated goal and name of the package, other panel datasets
are also valid candidates for epipredict
functionality, as long as they are
in epi_df
format.
year_start <- min(grad_employ_subset$time_value) year_end <- max(grad_employ_subset$time_value)
In this vignette, we will demonstrate using epipredict
with employment panel
data from Statistics Canada. We will be using
Table 37-10-0115-01: Characteristics and median employment income of
longitudinal cohorts of postsecondary graduates two and five years after
graduation, by educational qualification and field of study (primary
groupings)
.
The full dataset contains yearly median employment income two and five years
after graduation, and number of graduates. The data is stratified by
variables such as geographic region (Canadian province), education, and
age group. The year range of the dataset is r year_start
to r year_end
,
inclusive. The full dataset also contains metadata that describes the
quality of data collected. For demonstration purposes, we make the following
modifications to get a subset of the full dataset:
STATUS
columnTo use this data with epipredict
, we need to convert it into epi_df
format
using epiprocess::as_epi_df()
with additional keys. In our case, the additional keys are age_group
,
and edu_qual
. Note that in the above modifications, we encoded time_value
as type integer
. This lets us set time_type = "year"
, and ensures that
lag and ahead modifications later on are using the correct time units. See the
epiprocess::epi_df
for
a list of all the time_type
s available.
employ_rowcount <- format(nrow(grad_employ_subset), big.mark = ",") employ_colcount <- length(names(grad_employ_subset))
Now, we are ready to use grad_employ_subset
with epipredict
.
Our epi_df
contains r employ_rowcount
rows and r employ_colcount
columns.
Here is a quick summary of the columns in our epi_df
:
time_value
(time value): year in date
formatgeo_value
(geo value): province in Canadanum_graduates
(raw, time series value): number of graduatesmed_income_2y
(raw, time series value): median employment income 2 years
after graduationmed_income_5y
(raw, time series value): median employment income 5 years
after graduationage_group
(key): one of two age groups, either 15 to 34 years, or 35 to 64
yearsedu_qual
(key): one of 32 unique educational qualifications, e.g.,
"Master's diploma"# Rename for simplicity employ <- grad_employ_subset sample_n(employ, 6)
In the following sections, we will go over pre-processing the data in the
epi_recipe
framework, and fitting a model and making predictions within the
epipredict
framework and using the package's canned forecasters.
As a simple example, let's work with the num_graduates
column for now. We will
first pre-process by standardizing each numeric column by the total within
each group of keys. We do this since those raw numeric values will vary greatly
from province to province since there are large differences in population.
employ_small <- employ %>% group_by(geo_value, age_group, edu_qual) %>% # Select groups where there are complete time series values filter(n() >= 6) %>% mutate( num_graduates_prop = num_graduates / sum(num_graduates), med_income_2y_prop = med_income_2y / sum(med_income_2y), med_income_5y_prop = med_income_5y / sum(med_income_5y) ) %>% ungroup() head(employ_small)
Below is a visualization for a sample of the small data for British Columbia and Ontario. Note that some groups do not have any time series information since we filtered out all time series with incomplete dates.
employ_small %>% filter(geo_value %in% c("British Columbia", "Ontario")) %>% filter(grepl("degree", edu_qual, fixed = T)) %>% group_by(geo_value, time_value, edu_qual, age_group) %>% summarise(num_graduates_prop = sum(num_graduates_prop), .groups = "drop") %>% ggplot(aes(x = time_value, y = num_graduates_prop, color = geo_value)) + geom_line() + scale_colour_manual(values = c("Cornflowerblue", "Orange"), name = "") + facet_grid(rows = vars(edu_qual), cols = vars(age_group)) + xlab("Year") + ylab("Percentage of gratuates") + theme(legend.position = "bottom")
We will predict the standardized number of graduates (a proportion) in the next year (time $t+1$) using an autoregressive model with three lags (i.e., an AR(3) model). Such a model is represented algebraically like this:
[ y_{t+1,ijk} = \alpha_0 + \alpha_1 y_{tijk} + \alpha_2 y_{t-1,ijk} + \alpha_3 y_{t-2,ijk} + \epsilon_{tijk} ]
where $y_{tij}$ is the proportion of graduates at time $t$ in location $i$ and age group $j$ with education quality $k$.
In the pre-processing step, we need to create additional columns in employ
for
each of $y_{t+1,ijk}$, $y_{tijk}$, $y_{t-1,ijk}$, and $y_{t-2,ijk}$.
We do this via an
epi_recipe
. Note that creating an epi_recipe
alone doesn't add these
outcome and predictor columns; the recipe just stores the instructions for
adding them.
Our epi_recipe
should add one ahead
column representing $y_{t+1,ijk}$ and
3 lag
columns representing $y_{tijk}$, $y_{t-1,ijk}$, and $y_{t-2,ijk}$
(it's more accurate to think of the 0th "lag" as the "current" value with 2 lags,
but that's not quite how the processing works).
Also note that
since we specified our time_type
to be year
, our lag
and lead
values are both in years.
r <- epi_recipe(employ_small) %>% step_epi_ahead(num_graduates_prop, ahead = 1) %>% step_epi_lag(num_graduates_prop, lag = 0:2) %>% step_epi_naomit() r
Let's apply this recipe using prep
and bake
to generate and view the lag
and ahead
columns.
# Display a sample of the pre-processed data bake_and_show_sample <- function(recipe, data, n = 5) { recipe %>% prep(data) %>% bake(new_data = data) %>% sample_n(n) } r %>% bake_and_show_sample(employ_small)
We can see that the prep
and bake
steps created new columns according to
our epi_recipe
:
ahead_1_num_graduates_prop
corresponds to $y_{t+1,ijk}$lag_0_num_graduates_prop
, lag_1_num_graduates_prop
, and
lag_2_num_graduates_prop
correspond to $y_{tijk}$, $y_{t-1,ijk}$, and $y_{t-2,ijk}$
respectively.Since our goal for now is to fit a simple autoregressive model, we can use
parsnip::linear_reg()
with the default
engine lm
, which fits a linear regression using ordinary least squares.
We will use epi_workflow
with the epi_recipe
we defined in the
pre-processing section along with the parsnip::linear_reg()
model. Note
that epi_workflow
is a container and doesn't actually do the fitting. We have
to pass the workflow into fit()
to get our estimated model coefficients
$\widehat{\alpha}_i,\ i=0,...,3$.
wf_linreg <- epi_workflow(r, linear_reg()) %>% fit(employ_small) summary(extract_fit_engine(wf_linreg))
This output tells us the coefficients of the fitted model; for instance,
the estimated intercept is $\widehat{\alpha}0 =$
r round(coef(extract_fit_engine(wf_linreg))[1], 3)
and the coefficient for
$y{tijk}$ is
$\widehat\alpha_1 =$ r round(coef(extract_fit_engine(wf_linreg))[2], 3)
.
The summary also tells us that all estimated coefficients are significantly
different from zero. Extracting the 95% confidence intervals for the
coefficients also leads us to
the same conclusion: all the coefficient estimates are significantly different
from 0.
confint(extract_fit_engine(wf_linreg))
Now that we have our workflow, we can generate predictions from a subset of our data. For this demo, we will predict the number of graduates using the last 2 years of our dataset.
latest <- get_test_data(recipe = r, x = employ_small) preds <- stats::predict(wf_linreg, latest) %>% filter(!is.na(.pred)) # Display a sample of the prediction values, excluding NAs preds %>% sample_n(5)
We can do this using the augment
function too. Note that predict
and
augment
both still return an epiprocess::epi_df
with all of the keys that
were present in the original dataset.
augment(wf_linreg, latest) %>% sample_n(5)
First, we'll plot the residuals (that is, $y_{tijk} - \widehat{y}{tijk}$) against the fitted values ($\widehat{y}{tijk}$).
par(mfrow = c(2, 2), mar = c(5, 3, 1.2, 0)) plot(extract_fit_engine(wf_linreg))
The fitted values vs. residuals plot shows us that the residuals are mostly clustered around zero, but do not form an even band around the zero line, indicating that the variance of the residuals is not constant. Additionally, the fitted values vs. square root of standardized residuals makes this more obvious - the spread of the square root of standardized residuals varies with the fitted values.
The Q-Q plot shows us that the residuals have heavier tails than a Normal distribution. So the normality of residuals assumption doesn't hold either.
Finally, the residuals vs. leverage plot shows us that we have a few influential points based on the Cook's distance (those outside the red dotted line).
Since we appear to be violating the linear model assumptions, we might consider transforming our data differently, or considering a non-linear model, or something else.
Now suppose we want to model the 1-step-ahead 5-year employment income using current and two previous values, while also incorporating information from the other two time-series in our dataset: the 2-year employment income and the number of graduates in the previous 2 years. We would do this using an autoregressive model with exogenous inputs, defined as follows:
[ \begin{aligned} y_{t+1,ijk} &= \alpha_0 + \alpha_1 y_{tijk} + \alpha_2 y_{t-1,ijk} + \alpha_3 y_{t-2,ijk}\ &\quad + \beta_1 x_{tijk} + \beta_2 x_{t-1,ijk}\ &\quad + \gamma_2 z_{tijk} + \gamma_2 z_{t-1,ijk} + \epsilon_{tijk} \end{aligned} ]
where $y_{tijk}$ is the 5-year median income (proportion) at time $t$ (in location $i$, age group $j$ with education quality $k$), $x_{tijk}$ is the 2-year median income (proportion) at time $t$, and $z_{tijk}$ is the number of graduates (proportion) at time $t$.
Again, we construct an epi_recipe
detailing the pre-processing steps.
rx <- epi_recipe(employ_small) %>% step_epi_ahead(med_income_5y_prop, ahead = 1) %>% # 5-year median income has current, and two lags c(0, 1, 2) step_epi_lag(med_income_5y_prop, lag = 0:2) %>% # But the two exogenous variables have current values, and 1 lag c(0, 1) step_epi_lag(med_income_2y_prop, lag = c(0, 1)) %>% step_epi_lag(num_graduates_prop, lag = c(0, 1)) %>% step_epi_naomit() bake_and_show_sample(rx, employ_small)
Before fitting our model and making predictions, let's add some post-processing
steps using a few frosting
layers to do
a few things:
layer_population_scaling()
.# Create dataframe of the sums we used for standardizing # Only have to include med_income_5y since that is our outcome totals <- employ_small %>% group_by(geo_value, age_group, edu_qual) %>% summarise(med_income_5y_tot = sum(med_income_5y), .groups = "drop") # Define post-processing steps f <- frosting() %>% layer_predict() %>% layer_naomit(.pred) %>% layer_threshold(.pred, lower = 0) %>% # 90% prediction interval layer_residual_quantiles( symmetrize = FALSE ) %>% layer_population_scaling( .pred, .pred_distn, df = totals, df_pop_col = "med_income_5y_tot" ) wfx_linreg <- epi_workflow(rx, parsnip::linear_reg()) %>% fit(employ_small) %>% add_frosting(f) summary(extract_fit_engine(wfx_linreg))
Based on the summary output for this model, we can examine confidence intervals and perform hypothesis tests as usual.
Let's take a look at the predictions along with their 90% prediction intervals.
latest <- get_test_data(recipe = rx, x = employ_small) predsx <- predict(wfx_linreg, latest) # Display predictions along with prediction intervals predsx %>% select( geo_value, time_value, edu_qual, age_group, .pred_scaled, .pred_distn_scaled ) %>% head() %>% pivot_quantiles_wider(.pred_distn_scaled)
We've seen what we can do with non-epidemiological panel data using the
recipes frame, with epi_recipe
for pre-processing, epi_workflow
for model
fitting, and frosting
for post-processing.
epipredict
also comes with canned forecasters that do all of those steps
behind the scenes for some simple models. Even though we aren't working with
epidemiological data, canned forecasters still work as expected, out of the box.
We will demonstrate this with the simple
flatline_forecaster
and the direct autoregressive (AR) forecaster
arx_forecaster
.
For both illustrations, we will continue to use the employ_small
dataset
with the transformed numeric columns that are proportions within each group
by the keys in our epi_df
.
In this first example, we'll use flatline_forecaster
to make a simple
prediction of the 2-year median income for the next year, based on one previous
time point. This model is representated algebraically as:
[y_{t+1,ijk} = y_{tijk} + \epsilon_{tijk}]
where $y_{tijk}$ is the 2-year median income (proportion) at time $t$.
out_fl <- flatline_forecaster(employ_small, "med_income_2y_prop", args_list = flatline_args_list(ahead = 1) ) out_fl
In this second example, we'll use arx_forecaster
to make a prediction of the
5-year median income based using two lags, and using two lags on two exogenous
variables: 2-year median income and number of graduates.
The canned forecaster gives us a simple way of making this forecast since it defines the recipe, workflow, and post-processing steps behind the scenes. This is very similar to the model we introduced in the "Autoregressive Linear Model with Exogenous Inputs" section of this article, but where all inputs have the same number of lags.
arx_args <- arx_args_list(lags = c(0L, 1L), ahead = 1L) out_arx_lr <- arx_forecaster(employ_small, "med_income_5y_prop", c("med_income_5y_prop", "med_income_2y_prop", "num_graduates_prop"), args_list = arx_args ) out_arx_lr
Other changes to the direct AR forecaster, like changing the engine, also work as expected. Below we use a boosted tree model instead of a linear regression.
out_arx_rf <- arx_forecaster( employ_small, "med_income_5y_prop", c("med_income_5y_prop", "med_income_2y_prop", "num_graduates_prop"), trainer = parsnip::boost_tree(mode = "regression", trees = 20), args_list = arx_args ) out_arx_rf
While the purpose of {epipredict}
is to allow {tidymodels}
to operate on
epidemiology data, it can be easily adapted (both the workflows and the canned
forecasters) to work for generic panel data modelling.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.