knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This document assumes you understand the ecocomDP model. To learn about the model and its scope, refer to the Model Overview Vignette and the Ecological Informatics Article.
This document delineates the practices that all creators of ecocomDP formatted datasets should adopt in order for the community to build a cohesive and interoperable collection. It contains detailed descriptions of practices, definitions of concepts, and solutions to many common issues.
Early sections address important considerations for anyone thinking about converting source datasets to the ecocomDP model, then focus shifts to an examination of the model components in greater detail. These shared practices are written with the intention to help simplify the conversion process.
If you are new to the conversion process, we recommend reading the Getting Started and Concepts sections, reviewing the Create and Model Overview vignettes, and referring back to this document as questions arise. A thorough understanding of the ecocomDP model and some foundational concepts will greatly simplify the conversion process.
Each ecocomDP dataset (Level-1; L1) is created from a raw source dataset (Level-0; L0) by a unique conversion script. Inputs are typically from the APIs of data repositories and monitoring networks, and outputs are a set of archivable files. The derived ecocomDP dataset is delivered to users in a consistent format by read_data()
and the conversion script provides a fully reproducible and automated routine for updating the derived dataset whenever a new version of the source data are released.
knitr::include_graphics('./workflow.png')
Not all source datasets are good candidates for ecocomDP. Features of a good candidate include datasets that:
A thorough understanding of the L0 dataset is required before actually performing any transformations. To gain understanding of an L0 dataset we recommend:
Major issues at this point may suggest the amount of work required to convert the L0 dataset to the ecocomDP model is not worth it.
After gaining a sufficient understanding of the L0 dataset, you are ready to assess, and hopefully resolve, any issues that are obvious from the start. To help draw out some of these apparent issues, you may want to create a high-level plan for combining the L0 tables (i.e. Row-wise bindings or joined with shared keys?) and mapping their columns to the L1 ecocomDP model. Here are some solutions (ordered by priority) for resolving issues at this stage in the creation process:
Work with the L0 author/manager to fix the issues - Fixing the issue here both communicates best practices for future data curation and immediately improves data quality.
message()
function to alert ecocomDP script maintainers to sections of code that could improve in future L0 dataset updates.Modify L0 components - Modifying L0 components is only permitted in rare cases. This list highlights the L0 components and specific scenarios in which you should modify them:
Omit L0 data - ecocomDP is a flexible model but can’t handle everything. Convert as much as possible and drop the remainder. If content is dropped, then describe what and why using comments in the conversion script. Some guidelines for when you should drop content:
If the above options don’t solve the issue, then don’t convert it. There are many more datasets out there in the world to convert!
When an L0 dataset is really valuable, but issues with the dataset (e.g. changes in temporal or spatial resolution across observations; edi.251.2) prevent conversion, the best option may be to convert a subset of the observations to the ecocomDP format. Follow these steps for omitting rows from the L0 dataset:
message(paste0("This L1 dataset is derived from a version of ", source_id, "with omitted rows."))
below the create_eml()
function call in the create_ecocomDP()
function definition.You may decide that only a subset of the data tables within an L0 dataset are well-suited to the ecocomDP format. In this case you have the option to omit entire data tables and only convert those that will fit the model.
When determining which tables to convert and which to omit, first identify which table(s) contain the “core” observation information. This will be the backbone of the intermediate "flat" table. Once the flat table is instantiated around the core observation, determine the shared keys by which to join the other L0 tables. The following types of variables are examples of common keys shared between data tables:
If you encounter tables that can’t be joined to the core observation information, possibly because they focus on a different time/location/taxon entirely, omit these problematic tables. Apply the following changes to the conversion script to highlight the table omission:
# atmospheric_gas_concentrations table did not share a key with the bird_count table and was omitted
).message(paste0("This L1 dataset is derived from a trimmed version of ", source_id, "with omitted tables"))
below the create_eml()
function call in the create_ecocomDP()
function definition.Write a conversion script to create an ecocomDP dataset from a standard set of minimal inputs (i.e. arguments to the create_ecocomDP()
function). The conversion script should have some tolerance (i.e. error handling) for re-running the script at a later time on a changed source dataset. The script should utilize functionality that will either handle a revised source dataset or alert the ecocomDP script maintainers and provide them with enough information to quickly diagnose and fix any problems.
Currently, only the EDI Data Repository is recognized by the ecocomDP project. If you would like support extended to your data repository, then please place a request in the ecocomDP project issue tracker and we will add the supporting code to index and read ecocomDP datasets from your repository.
To convert an L0 dataset, implement the following processes in your conversion script:
create_*()
functions to parse out the relevant derived L1 tables.write_tables()
validate_data()
create_eml()
For details on the processes within the conversion script, see the Create vignette.
library()
calls, the main function definition, and supporting function definitions outside of the create_ecocomDP()
function.library()
calls for each (e.g. library(dplyr)
)create_ecocomDP <- function(...) {...}
and only use the allowed set of arguments:path
- Where the ecocomDP tables will be writtensource_id
- Identifier of the source datasetderived_id
- Identifier of the derived dataseturl
- The URL by which the derived tables and metadata can be accessed by a data repository. This argument is used when automating the repository publication step, but not used when manually publishing.message()
) and section headers (ctrl-shift-R in RStudio) at the beginning of each major code block to help maintainers with debugging, should it be needed.data$temp
not data[[3]]
). The order of columns may change in revised L0 datasets which will cause problems if indexed by position but not if indexed by name.rm(list = ls())
in the scripts. This will remove the global environment needed by automated maintenance routines.Refer to this section to resolve specific issues while creating the L1 tables. For more in-depth descriptions on these tables and their columns, see the Model Overview vignette.
Store the core observations being analyzed in this table. Observations must be linked to a taxon and to a location. Linking to ancillary observations is optional.
tidyr::pivot_longer()
) into columns of values and taxon_names. Manually add the variable_name and unit columns to describe the measurement. Use the column description and units from the L0 metadata, if applicable.Store identifying information about a place (longitude, latitude, elevation) in this table. This table is self-referencing so that sites can be nested.
validate_data()
), then see the observation_ancillary section.knitr::include_graphics('./coords_final_wide2.png')
Store identifying information about a taxon in this table.
Store summary info about the L1 dataset in this table.
calc_*()
funcs can be used. Do not use this vector of modified and arbitrary dates in any of the datetime columns of the other tables, these should remain YYYY formatted.Store ancillary information about an observational event for context (e.g. plot or transect identifiers, measurement times, environmental conditions, field notes, etc.).
Store additional information about a place that does not change frequently (e.g. lake area or depth, experimental treatment, habitat). Features that change frequently are more closely related to the observational event, and are thus kept in the observation_ancillary table. Ancillary observations are linked through the location_id, and one location_id may have many ancillary observations about it.
Store additional info about an organism that does not change frequently (e.g. trophic level). Features that change frequently are probably observations. Ancillary observations are linked through the taxon_id, which may have many ancillary observations about it.
Link information from the variable_name columns of the observation, observation_ancillary, location_ancillary, and taxon_ancillary tables to external definitions in this table. See Ontologies and Vocabularies section for details.
Other unforeseen issues with the L0 dataset may manifest themselves as you begin to create an ecocomDP (L1) formatted dataset. See the following suggestions for handling some common issues:
warning()
, with a description of the issue, in the create_ecocomDP()
function so that the next time this function runs, as a part of the maintenance routine, the logfile will prompt the maintainer to check for implementation of the L0 fix and adjust the create_ecocomDP()
function accordingly.A dataset often has multiple levels of observation (i.e. spatial scales of observation). Only one of those levels of observation is considered meaningful. We use the term “meaningful” to denote the finest spatial scale at which the observations are meant to be interpreted.
We use the concept of meaningful level of observation to help assign location_ids:
knitr::include_graphics('./MLO_final_long2.png')
The Frequency of Survey refers to the temporal frequency of events over the course of a study.
knitr::include_graphics('./LevelOfSurvey_final_long2.png')
Value added information is anything that an ecocomDP creator intentionally adds to the L1 dataset to improve its findability, accessibility, interoperability, or reusability (see FAIR Principles). This information does not come from the L0 dataset directly, but instead is derived from it. Since the rule of thumb regarding L1 dataset creation is to rearrange but not alter the L0 dataset, supplementing a dataset with value added information must be done carefully and within the constraints of the following rules:
annotation_dictionary()
function).Users commonly filter on these. If a study involves a human induced experiment/manipulation (i.e. not a natural experiment/manipulation, e.g. hurricane), then add "Manipulative experiment" as an annotation using the “is_about” argument of the create_eml()
function, e.g.:
create_eml(..., is_about = c(`Manipulative experiment` = "http://purl.dataone.org/odo/ECSO_00000506")
Sometimes an L1 dataset needs to be removed from circulation. To do this for an archived L1 dataset, publish an identical revision of the dataset with the following changes added manually to the metadata:
knitr::include_graphics('./deprecated.jpg')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.