knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

While we most often want to directly build models on our original dataset to generate predicted values, we might instead want to generate average trends across larger groups instead, and then apply this to our original data. For instance, generating trends by region, and then applying those regional trends back to the country level. The predict_...avg_trend() functions in augury allow us to do just that, applying any of the models we are used to a grouped set of columns.

These work across specific groups, specified by average_cols, and averaging numeric values specified as the response variable or variables extracted from a formula. The specified model is then fit to this averaged data, and the predicted values are joined back up to the original data frame. Let's look at an example using blood pressure data, which has a comprehensive time series.

library(augury)

df <- ghost::gho_data("BP_04", query = "$filter=Dim1 eq 'MLE' and Dim2 eq 'YEARS18-PLUS'") %>%
  billionaiRe::wrangle_gho_data() %>%
  dplyr::right_join(covariates_df) %>%
  dplyr::select(iso3, year, year_n, value) %>%
  dplyr::filter(whoville::is_who_member(iso3),                # keep WHO member states
                year >= 2000, year <= 2023) %>%               # get relevant years  
  dplyr::mutate(who_region = whoville::iso3_to_regions(iso3)) # get WHO regions

ur <- unique(df$who_region)
ur

Alright, so, here we have r length(ur) WHO regions. We will use these regions to fit a model to and use INLA to predict out to 2023, then apply these trends to input countries.

pred_df <- df %>%
  predict_inla_avg_trend(formula = value ~ f(year_n, model = "rw2"),
                         average_cols = c("who_region", "year_n"),
                         group_models = TRUE,
                         group_col = "iso3",
                         sort_col = "year_n")

pred_df %>%
  dplyr::filter(iso3 == "AFG", year >= 2013)

Above, we can see we have a generated a model using 2nd order random walk with INLA, however, the model was generated by averaging data to WHO regions first, fitting the random walk to each reach (since group_models = TRUE) and then fitting those trends to the original data. Note some specifics of what had to be set, as the predict_..._avg_trend() functions are slightly more complex than others:

To highlight this point, in the above example, what's actually happening is we are actually fitting the model on the summarized data:

df %>%
  dplyr::group_by(who_region, year_n) %>%
  dplyr::summarize(value = mean(value, na.rm = T)) %>%
  head()

Since average_cols = c("who_region", "year_n"), we took the mean of all values in formula not in average_cols, in this case just value. If for instance, we tried to specify a model using iso3 in the formula:

predict_inla_avg_trend(df,
                       formula = value ~ iso3 + f(year_n, model = "rw2"),
                       average_cols = c("who_region", "year_n"),
                       group_models = TRUE,
                       group_col = "iso3",
                       sort_col = "year_n")

We get an error message indicating that iso3 must be numeric or included in average_cols for grouping. This is because without it being numeric or in the average_cols, there's no way dplyr::group_by() %>% dplyr::summarize() a non-numeric column automatically (how would we reduce country-level ISO3 codes to the regional level?).

While slightly complex, ensuring you follow the above means you should easily and successfully get out meaningful trend predictions for your data frames using trends generated on grouped data.



caldwellst/augury documentation built on Oct. 10, 2024, 8:20 a.m.