knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(AKaerial)
Data quality and integrity are properties of a data set that make it accurate, usable, and useful for its intended purpose. Data would be considered high quality and high integrity if they were treated consistently and accurately throughout the data life cycle, in a format that facilitates use, and collected logically and appropriately for their intended purpose. In contrast, low quality or low integrity data will generally have little to no effort put into consistency or accuracy, exist in an unusable (untidy or not machine-readable) state, and have been collected under variable interpretations of study design or protocol. Low quality data are often logically inappropriate for use in any decision making due to the questionable and dynamic information content contained therein.
Long term (> 10 years) data sets often exist in a low quality, low integrity state for a variety of reasons:
AKaerial attempts to standardize aerial survey data files and begin a consistent treatment of data fields moving forward, but doing so will inevitably result in the discovery of errors, inconsistencies, and unusable information in past data. The current workflow includes archiving the raw historic data file, generating a report that details where and how data do not conform to defined standards, fixing any errors that do not require interpretation and do not result in a loss of information and documenting any changes, and finally creating an archivable data file that is explicitly usable in AKaerial for the purpose of generating index estimates and visualizing results.
In order to be processed correctly using GreenLight, an input data file must be in the correct file format and contain the minimum set of columns (named correctly).
If these 3 conditions are minimally satisfied (additional columns are allowed, but won't be tested), the file can be checked against the predefined standards using GreenLight.
The GreenLight function runs with only 5 arguments.
path.name
- The directory location of the data file to be checked.area
- The abbreviated name for the spatial location of the project. This is important because GreenLight needs to pull the appropriate species list from the object sppntable
. There are minor regional differences in the species list by project, including omission of non-focal species. Current acceptable values are:
* ACP - Arctic Coastal Plain
* BLSC - Black Scoter
* CRD - Copper River Delta
* VIS - Aircraft Visibility
* WBPHS - Waterfowl Breeding Population Habitat Survey ("North American")
* YKD - Yukon Kuskokwim Delta MBM duck stratification
* YKG - Yukon Kuskokwim Delta MBM goose stratificationreport
- Should a report be generated? This will generally be TRUE
until the file returns "green," at which point no further checks need to be run. raw2analysis
- Should GreenLight attempt to write the "archive" copy of the data? This will fail if the file returns a "red" status. A file receives a "red" status if there are errors in the data that require interpretation that potentially change the information content that was intended by the observer. For example:
* Misspelling a species code as SDEI will trigger a "red" status since it can be interpreted as STEI or SPEI. This would require a re-transcription of the .wav file.
* Entering a non-numeric value in the Num column (or any other numeric columns). The only way to trace the correct number is through re-transcription.
* Entering any value other than single, pair, flkdrake, or open in the Obs_Type column. These are the only 4 entries with known treatments in the analysis step. archive.dir
- Where should the archived data file be written to (if raw2analysis == TRUE
)? Defaults to 3 levels above the current file location due to MBM archive structure. A typical use of GreenLight would be to first check the file for errors.
file="C:/Raw_Survey_Data/Observer_Transcribed_Data/YKG_2020_RawObs_CFrost.csv"
GreenLight(path.name=file, area="YKG", report=TRUE, raw2analysis = FALSE)
In this case, a report is generated one directory level above the data file that will detail the status (green, yellow, red) of the file and the location(s) of any errors. The errors can be fixed by the observer or data collector re-transcribing the data, then re-running the command until a "green" status is attained.
The naming convention for the input file is particularly important here. The function will take the name of any newly-generated files from the name of the input file, so it should be in the format PROJECT_YEAR_TYPE_OBSERVER.csv. In this case, it will append _QAQC_
and the system date on the end of the name of the new report.
Once the file is "green," report
is changed to FALSE
and raw2analysis
is changed to TRUE
. Optionally, archive.dir
can be set.
GreenLight(path.name=file, area="YKG", report=FALSE, raw2analysis = TRUE, archive.dir = "C:/MyArchive")
This will generate the archive data file and the associated QCLog report that details any changes made between raw and archive files. The naming convention for the input file is particularly important here as well. The new data file will replace the TYPE
with QCObs
. The new report will replace TYPE
with QCLog
.
A QAQC .html report will be generated once a file has successfully passed through the GreenLight function. It will have content in the following headings:
A QCLog .html report is generated when an archive .csv is requested. It will contain the same headings as a QAQC report, but only those that were changed by CommonFix during the file generation. These are most commonly lower to uppercase issues and outdated species codes.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.