forecast_regression: Run forecast regression

Description Usage Arguments Value

View source: R/forecasting_main.R

Description

Run forecast regression

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
forecast_regression(
  epi_lag,
  quo_groupfield,
  fc_model_family,
  report_settings,
  groupings,
  env_variables_used,
  report_dates,
  req_date,
  valid_run,
  naive
)

Arguments

epi_lag

Epidemiological dataset with basis spline summaries of the lagged environmental data (or anomalies), as output by lag_environ_to_epi().

quo_groupfield

Quosure of the user given geographic grouping field to run_epidemia().

fc_model_family

The family parameter passsed to mgcv::bam, and the extended families in family.mgcv can also be used. This sets the type of generalized additive model (GAM) to run: it specifies the distribution and link to use in model fitting. E.g. for a Poisson regression, the user would input "poisson()". If a cached model is being used, set the parameter to '"cached"'.

report_settings

This is a named list of all the report, forecasting, event detection and other settings. All of these have defaults, but they are not likely the defaults needed for your system, so each of these should be reviewed:

  • report_period = 26: The number of weeks that the entire report will cover. The report_period minus fc_future_period is the number of weeks of past (known) data that will be included. Default is 26 weeks.

  • report_value_type = "cases": How to report the results, either in terms of "cases" (default) or "incidence".

  • report_inc_per = 1000: If reporting incidence, what should be denominator be? Default is per 1000 persons.

  • epi_date_type = "weekISO": String indicating the standard (WHO ISO-8601 or CDC epi weeks) that the weeks of the year in epidemiological and environmental reference data use ("weekISO" or "weekCDC"). Required: epidemiological observation dates listed are LAST day of week.

  • epi_interpolate = FALSE: TRUE/FALSE flag for if the given epidemiological data be linearly interpolated for any explicitly missing values before modeling?

  • epi_transform = "none" (default if not set): Should the case counts be transformed just before regression modeling and backtransformed directly after prediction/forecast creation? The current only supported transformation is "log_plus_one", where log(cases + 1) is modeled and back-transformed by exp(pred) - 1 (though pmax(exp(pred) - 1, 0) is used in case of small predicted values).

  • model_run = FALSE: TRUE/FALSE flag for whether to only generate the model regression object plus metadata. This model can be cached and used later on its own, skipping a large portion of the slow calculations for future runs.

  • model_cached = NULL: The output of a previous model_run = TRUE run of run_epidemia() that produces a model (regression object) and metadata. The metadata will be used for input checking and validation. Using a prebuilt model saves on processing time, but will need to be updated periodically. If using a cached model, also set 'fc_model_family = "cached"'.

  • env_var: List environmental variables to actually use in the modelling. (You can therefore have extra variables or data in the environmental dataset.) Input should be a one column tibble, header row as 'obsfield' and each row with entries of the variables (must match what is in env_data, env_ref-data, and env_info). Default is to use all environmental variables that are present in all three of env_data, env_ref_data, and env_info.

  • env_lag_length = 181: The number of days of past environmental data to include for the lagged effects. The distributed lags are summarized using a thin plate basis function. Default is 181 days.

  • env_anomalies = FALSE: TRUE/FALSE indicating if the environmental variables should be replaced with their anomalies. The variables were transformed by taking the residuals from a GAM with geographic unit and cyclical cubic regression spline on day of year per geographic group.

  • fc_start_date: The date to start the forecasting, also the start of the early warning period. Epidemiological data does not have to exist just before the start date, though higher accuracy will be obtained with more recent data. The default is the week following the last known observation in /codeepi_data.

  • fc_future_period = 8: Number of future weeks from the end of the epi_data to produce forecasts, or if fc_start_date is set, the number of weeks from and including the start date to create forecasts. Synonymous with early warning period. Default is 8 weeks.

  • fc_clusters: Dataframe/tible of geographic units and a cluster id. This clusters, or groups, certain geographic locations together, to better model when spatial non-stationarity in the relationship between environmental variables and cases. See the overview and data & mdoeling vignettes for more discussion. Default is a global model, all geographic units in one cluster.

  • fc_cyclicals = FALSE: TRUE/FALSE flag on whether to include a smooth term based on day of year in the modeling (as one way of accounting for seasonality).

  • fc_cyclicals_by: Unit to run the 'fc_cyclicals' terms by. Either by 'cluster' (default; clusters given by ‘fc_clusters') or by ’group' (per geogroup in 'groupfield').

  • fc_splines: The type of splines that will be used to handle long-term trends and lagged environmental variables. If supplemental package 'clusterapply' is not installed, the default (and only choice) uses modified b-splines ('modbs'). If the package is installed, then 'tp' becomes an option and the default which uses thin plate splines instead.

  • fc_ncores: The number of physical CPU cores available. Will be used to determine the multi-threading (or not) for use in modeling and predicting.

  • ed_summary_period = 4: The number of weeks that will be considered the "early detection period". It will count back from the week of last known epidemiological data. Default is 4 weeks.

  • ed_method = 'none': Which method for early detection should be used ("farrington" is only current option, or "none").

  • ed_control = Controls passed along to the event detection method. E.g. for ‘ed_method = ’farrington'', these are passed to surveillance::farringtonFlexible(). Currently, these parameters are supported for Farrington: 'b', 'w', 'reweight', 'weightsThreshold', 'trend', 'pThresholdTrend', 'populationOffset', 'noPeriods', 'pastWeeksNotIncluded', 'thresholdMethod'. Any control not included will use surveillance package defaults, with the exception of 'b', the number of past years to include: epidemiar default is to use as many years are available in the data.

groupings

A unique list of the geographic groupings (from groupfield).

env_variables_used

List of environmental variables that were used in the modeling.

report_dates

Internally generated set of report date information: min, max, list of dates for full report, known epidemiological data period, forecast period, and early detection period.

req_date

The end date of requested forecast regression. When fit_freq == "once", this is the last date of the full report, the end date of the forecast period.

valid_run

Internal TRUE/FALSE for whether this is part of a validation run.

naive

Internal TRUE/FALSE flag on if this is a naive-model run.

Value

Named list containing: date_preds: Full forecasted resulting dataset. reg_obj: The regression object from modeling. Unless model_run is TRUE, in which case only the regression object is returned.


EcoGRAPH/epidemiar documentation built on Nov. 13, 2020, 5:31 p.m.