ed_process: Standard post-processing of EIDITH exports

View source: R/process.R

ed_processR Documentation

Standard post-processing of EIDITH exports

Description

This function takes raw data downloaded from EIDITH and puts it through various preprocessing and cleaning steps. In general there is no need to call this function directly - it is called by both ed_db_download() and the direct download functions.

Usage

ed_process(dat, endpt)

Arguments

dat

The data as exported from EIDITH and imported via the ed_get() functions (without preprocessing).

endpt

The name of the API URL endpoints: one of "Event", "Animal", "Specimen", "Test", or "TestIDSpecimenID" (for test-specimen cross referencing). Note these are different than the names of the tables stored locally (which are lowercase and plural).

Details

Steps taken to clean the data include:

  • Converting variable names from camelCase to snake_case to make it easy to distinguish between raw and cleaned data.

  • Converting some variable names to clearer ones: all _id variables are numeric primary keys, other identifiers now go by _id_name.

  • Where there are multiple _id_name-type columns that are very similar except for a small set of cases, we drop all but one for ease of use. These can be retrieved from raw data if needed.

  • Dropping columns that are entirely blank

  • Dropping redundant columns

  • Cleaning up whitespace and capitalization variability

  • Re-arranging table order to put the most pertinent information first.

  • Normalizing all animal taxonomic information to match the ITIS database.

  • Coercing some free-form entries (e.g. specimen_type) to a standard set of categories

  • Converting yes/no fields to TRUE/FALSE

  • Fixing spelling errors

  • Extracting common TRUE/FALSE variables from free-form text of viral interpretation (Genbank numbers and whether virus is known).


ecohealthalliance/eidith documentation built on Aug. 30, 2022, 7:45 a.m.