knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Input Data Formats, Model Specifications, and Event Detection Parameters

Data & Data Formats

The epidemiar modeling and code requires 3 main sets of data:

plus several information/reference/specification inputs.

Epidemiology Data, epi_data

For the epidemiology data, you will need weekly case counts of the disease/illness per the geographic unit (group) with population values (to calculate incidence).

When calling the epidemiar function:

In the report_settings there is an additional parameters for epidemiological settings:

Missing Data

There should be a line for each week and geographic grouping, even for missing data (i.e. explicit missing data). Any missing data has the option of being filled in by linear interpolation inside of the epidemiar modeling function by using report_settings$epi_interpolate = TRUE (default is FALSE).

Environmental Data, env_data

For the environmental data, daily data is expected for each environmental variable for each geographic unit. Based on the lag length (report_settings$env_lag_length, default 180 days) chosen, you must have at least that number of days before the first epidemiology data date.

When calling the epidemiar function:

Non-daily or Missing Data

If you do not have daily data (e.g. weekly, or irregular data), or have implicit missing data, you can use the data_to_daily() function to add any missing rows. This function will also use linear interpolation to fill in values if possible (default 'interpolate = TRUE'). It is not recommended if you have a lot of missing/non-daily data. It will group on every field in the dataset that is not obs_date, or the user-given {valuefield}. Note: this will not fill out ragged data (different end dates of environmental variable data), but that will be handled inside of epidemiar.

Environmental Reference / Weekly Climate Data, env_ref_data

The environmental reference / climate data should contain a reference value (column "ref_value") of the environmental variable per geographic group for each week of the year. For example, this could be the historical mean for that week of the year.

If you have env_data, but do not yet have a reference/climatology built from it, you can use the env_daily_to_ref() function to create one in the format accepted by run_epidemiar() for env_ref_data. Because of processing time (especially for long histories), it is recommended that you run this infrequently to generate a reference dataset that is then saved to be read in later, rather than regenerated each time. The week_type of this function defaults to "ISO" for ISO8601/WHO standard week of year. This function also requires the env_info data, see below.

Reference Data

  1. Environmental variables, env_info This file lists the environmental variables and their aggregation method for to create weekly environmental data from daily data, e.g. rainfall could be the 'sum' of the daily values while LST would be the 'mean' value.

  2. {obsfield}: Give the field name of the environmental data variables, should match the environmental and environmental reference data.

  3. reference_method: 'sum' or 'mean', the aggregation method for to create weekly environmental data from daily data.
  4. report_label: Label to be used in creating the formatted report graphs. This column is not used until the formatting Rnw script, so depending on your setup and how you are have formatting reports after the report data is generated, you may not need this column.

  5. Shapefiles In order to create summaries from Google Earth Engine, you will need to upload assets of the shapefile of your study area. If you are not using GEE and have some other way of obtaining environmental data, you may not need this.

If you are creating a formatted report later and wish to have maps of the results, you may need shapefiles for this.

Setting up the Report and Model

Report level and epidemiological settings

Many of the settings are bundled into the named list report_settings argument. These all have defaults, but they are not likely the correct defaults for your dataset and modeling.

Setting up for Forecasting

*fc_model_family: The modeling utilizes mgcv::bam(), so the model form can be any accepted by it - any quadractically penalized GLM with the extended families in family.mgcv also being available. This is user set with the fc_model_family parameter. For example, you can run regression with a Poisson distribution (fc_model_family = "poisson()") or Gaussian (fc_model_family = "gaussian()" and note that you may also want to set epi_transform = "log_plus_one). This is required, with no default.

Besides fc_model_family, the rest of the forecasting controls (along with other settings) are bundled into the named list report_settings:

Environmental data-related forecasting settings:

Setting up for Event Detection

The event detection settings are also bundled into the named list report_settings:

Setting up Model Input (Optional)

Pre-generating a model can save substantial processing time, and users can expect faster report data generation time. The trade-off of potential hits to model accuracy in the age of the model versus the time range of the requested predictions should be examined, which would vary depending on the situation/datasets.



EcoGRAPH/epidemiar documentation built on Nov. 13, 2020, 5:31 p.m.