run_epidemia: Run EPIDEMIA forecast models and early detection algorithm.

Description Usage Arguments Details Value Examples

View source: R/run_epidemia.R

Description

The Epidemic Prognosis Incorporating Disease and Environmental Monitoring for Integrated Assessment (EPIDEMIA) Forecasting System is a set of tools coded in free, open-access software, that integrate surveillance and environmental data to model and create short-term forecasts for environmentally-mediated diseases. This function, epidemiar::run_epidemia() is the central function to model and forecast a wide range of environmentally-mediated diseases.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
run_epidemia(
  epi_data = NULL,
  env_data = NULL,
  env_ref_data = NULL,
  env_info = NULL,
  casefield = NULL,
  groupfield = NULL,
  populationfield = NULL,
  obsfield = NULL,
  valuefield = NULL,
  fc_model_family = NULL,
  report_settings = NULL
)

Arguments

epi_data

Epidemiological data with case numbers per week, with date field "obs_date".

env_data

Daily environmental data for the same groupfields and date range as the epidemiological data. It may contain extra data (other districts or date ranges). The data must be in long format (one row for each date and environmental variable combination), and must start at absolutel minimum report_settings$env_lag_length days (default 180) before epi_data for forecasting.

env_ref_data

Historical averages by week of year for environmental variables. Used in extended environmental data into the future for long forecast time, to calculate anomalies in early detection period, and to display on timeseries in reports.

env_info

Lookup table for environmental data - reference creation method (e.g. sum or mean), report labels, etc.

casefield

The column name of the field that contains disease case counts (unquoted field name).

groupfield

The column name of the field for district or geographic area unit division names of epidemiological AND environmental data (unquoted field name). If there are no groupings (all one area), user should give a field that contains the same value throughout.

populationfield

Column name of the optional population field to give population numbers over time (unquoted field name). Used to calculated incidence if report_settings$report_value_type = "incidence". Also optionally used in Farrington method for populationOffset.

obsfield

Field name of the environmental data variables (unquoted field name).

valuefield

Field name of the value of the environmental data variable observations (unquoted field name).

fc_model_family

The family parameter passsed to mgcv::bam, and the extended families in family.mgcv can also be used. This sets the type of generalized additive model (GAM) to run: it specifies the distribution and link to use in model fitting. E.g. for a Poisson regression, the user would input "poisson()". If a cached model is being used, set the parameter to '"cached"'.

report_settings

This is a named list of all the report, forecasting, event detection and other settings. All of these have defaults, but they are not likely the defaults needed for your system, so each of these should be reviewed:

  • report_period = 26: The number of weeks that the entire report will cover. The report_period minus fc_future_period is the number of weeks of past (known) data that will be included. Default is 26 weeks.

  • report_value_type = "cases": How to report the results, either in terms of "cases" (default) or "incidence".

  • report_inc_per = 1000: If reporting incidence, what should be denominator be? Default is per 1000 persons.

  • epi_date_type = "weekISO": String indicating the standard (WHO ISO-8601 or CDC epi weeks) that the weeks of the year in epidemiological and environmental reference data use ("weekISO" or "weekCDC"). Required: epidemiological observation dates listed are LAST day of week.

  • epi_interpolate = FALSE: TRUE/FALSE flag for if the given epidemiological data be linearly interpolated for any explicitly missing values before modeling?

  • epi_transform = "none" (default if not set): Should the case counts be transformed just before regression modeling and backtransformed directly after prediction/forecast creation? The current only supported transformation is "log_plus_one", where log(cases + 1) is modeled and back-transformed by exp(pred) - 1 (though pmax(exp(pred) - 1, 0) is used in case of small predicted values).

  • model_run = FALSE: TRUE/FALSE flag for whether to only generate the model regression object plus metadata. This model can be cached and used later on its own, skipping a large portion of the slow calculations for future runs.

  • model_cached = NULL: The output of a previous model_run = TRUE run of run_epidemia() that produces a model (regression object) and metadata. The metadata will be used for input checking and validation. Using a prebuilt model saves on processing time, but will need to be updated periodically. If using a cached model, also set 'fc_model_family = "cached"'.

  • env_var: List environmental variables to actually use in the modelling. (You can therefore have extra variables or data in the environmental dataset.) Input should be a one column tibble, header row as 'obsfield' and each row with entries of the variables (must match what is in env_data, env_ref-data, and env_info). Default is to use all environmental variables that are present in all three of env_data, env_ref_data, and env_info.

  • env_lag_length = 181: The number of days of past environmental data to include for the lagged effects. The distributed lags are summarized using a thin plate basis function. Default is 181 days.

  • env_anomalies = FALSE: TRUE/FALSE indicating if the environmental variables should be replaced with their anomalies. The variables were transformed by taking the residuals from a GAM with geographic unit and cyclical cubic regression spline on day of year per geographic group.

  • fc_start_date: The date to start the forecasting, also the start of the early warning period. Epidemiological data does not have to exist just before the start date, though higher accuracy will be obtained with more recent data. The default is the week following the last known observation in /codeepi_data.

  • fc_future_period = 8: Number of future weeks from the end of the epi_data to produce forecasts, or if fc_start_date is set, the number of weeks from and including the start date to create forecasts. Synonymous with early warning period. Default is 8 weeks.

  • fc_clusters: Dataframe/tible of geographic units and a cluster id. This clusters, or groups, certain geographic locations together, to better model when spatial non-stationarity in the relationship between environmental variables and cases. See the overview and data & mdoeling vignettes for more discussion. Default is a global model, all geographic units in one cluster.

  • fc_cyclicals = FALSE: TRUE/FALSE flag on whether to include a smooth term based on day of year in the modeling (as one way of accounting for seasonality).

  • fc_cyclicals_by: Unit to run the 'fc_cyclicals' terms by. Either by 'cluster' (default; clusters given by ‘fc_clusters') or by ’group' (per geogroup in 'groupfield').

  • fc_splines: The type of splines that will be used to handle long-term trends and lagged environmental variables. If supplemental package 'clusterapply' is not installed, the default (and only choice) uses modified b-splines ('modbs'). If the package is installed, then 'tp' becomes an option and the default which uses thin plate splines instead.

  • fc_ncores: The number of physical CPU cores available. Will be used to determine the multi-threading (or not) for use in modeling and predicting.

  • ed_summary_period = 4: The number of weeks that will be considered the "early detection period". It will count back from the week of last known epidemiological data. Default is 4 weeks.

  • ed_method = 'none': Which method for early detection should be used ("farrington" is only current option, or "none").

  • ed_control = Controls passed along to the event detection method. E.g. for ‘ed_method = ’farrington'', these are passed to surveillance::farringtonFlexible(). Currently, these parameters are supported for Farrington: 'b', 'w', 'reweight', 'weightsThreshold', 'trend', 'pThresholdTrend', 'populationOffset', 'noPeriods', 'pastWeeksNotIncluded', 'thresholdMethod'. Any control not included will use surveillance package defaults, with the exception of 'b', the number of past years to include: epidemiar default is to use as many years are available in the data.

Details

For more a longer description of the package, run the following command to see the overview vignette: vignette("overview-epidemiar", package = "epidemiar")

For more details run the following command to see the vignette on input data and modeling parameters: vignette("data-modeling", package = "epidemiar")

Value

Returns a suite of summary and report data.

1. summary_data: Early detection and early warning alerts levels for each woreda. Early detection alerts (ed_alert_count) are alerts that are triggered during the early detection period, which is defined as the 4 most recent weeks of known epidemiology data. Similarly, early warning alerts (ew_alert_count) are alerts in the future forecast estimates. “High” level indicates two or more weeks in this period had incidences greater than the alert threshold, “Medium” means that one week was in alert status, and “Low” means no weeks had alerts (ed_sum_level and ew_level, respectively).

2. epi_summary: Mean disease incidence per geographic group during the early detection period.

3. modeling_results_data:These are multiple timeseries values for observed, forecast, and alert thresholds of disease incidence, over the report period, for each geographic unit. These data can be used in creating the individual geographic unit control charts.

4. environ_timeseries: These are multiple timeseries for the environmental variables during the report period for each geographic unit.

5. environ_anomalies: These data are the recent (during the early detection period) differences (anomalies) of the environmental variable values from the climatology/reference mean.

6. params_meta: This lists dates, settings, and parameters that run_epidemiar() was called with.

7. regression_object: This is the regression object from the general additive model (GAM, parallelized with BAM) regression. This is only for statistical investigation of the model, and is usually not saved (very large object).

For more details see the vignette on the output data: vignette("output-report-data", package = "epidemiar")

However, if model_run = TRUE, the function returns a list of two objects. The first, model_obj is the regression object from whichever model is being run. There is also model_info which has details on the parameters used to create the model, similar to params_meta in a full run.

Examples

1
2
"See model_forecast_script in epidemiar-demo for full example:
https://github.com/EcoGRAPH/epidemiar-demo"

EcoGRAPH/epidemiar documentation built on Nov. 13, 2020, 5:31 p.m.