knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, eval = FALSE)
This article demonstrates how to create an EML document for a data package containing multiple data entities. To follow along with this example, please download the Banet-Example.
We will create a nested list from our metadata templates and then use the EML R package's write_eml
function to convert our list into a valid EML document.
Our EML file will contain the following elements:
- eml -access -dataset - creator - contact - associated parties - title - abstract - keyword set - license - methods - maintenance - project - coverage - data table
We have two main sections within our EML document:
For a more in depth description of EML please see the EML Specification.
The following libraries are needed to create a working EML document.
library(EMLaide) library(tidyverse) library(readxl) library(EML)
To follow along with this example please download the files from the Banet-Example. Within the directory, there are two subdirectories ("data" and "metadata") which contain all the necessary data and metadata to create a valid EML document using EMLaide.
At minimum, four files are needed to use our tools:
Because our data set is composed of multiple data entities, we are creating a dataframe with each row representing a different data entity with the following information:
EMLaide::evaluate_edi_package()
or EMLaide::upload_edi_package()
to evaluate or upload your EML document to EDI from R. This dataframe is the input to the add_datatable
function which function generates attribute metadata from the attribute_info files and physical information describing the datatable from the filepath and datatable_url information.
Example Dataframe Structure for datatable_metadata
datatable_metadata <- dplyr::tibble(filepath = c("data/enclosure-study-growth-rate-data.csv", "data/enclosure-study-gut-contents-data.csv", "data/microhabitat-use-data-2018-2020.csv", "data/seining-weight-lengths-2018-2020.csv", "data/snorkel-index-data-2015-2020.csv"), attribute_info = c("metadata/enclosure-study-growth-rates-metadata.xlsx", "metadata/enclosure-study-gut-contents-metadata.xlsx", "metadata/microhabitat-use-metadata.xlsx", "metadata/seining-weight-length-metadata.xlsx", "metadata/snorkel-index-metadata.xlsx"), datatable_description = c("Growth Rates - Enclosure Study", "Gut Contents - Enclosure Study", "Microhabitat Data", "Seining Weight Lengths Data", "Snorkel Survey Data"), datatable_url = paste0("https://raw.githubusercontent.com/FlowWest/edi.749.1/main/data/", c("enclosure-study-growth-rate-data.csv", "enclosure-study-gut-contents-data.csv", "microhabitat-use-data-2018-2020.csv", "seining-weight-lengths-2018-2020.csv", "snorkel-index-data-2015-2020.csv")))
Each row contains all information needed for a data entity to be added to the dataset element of a data package. If you only have one datatable keep this structure or use a named list with the same information.
knitr::kable(datatable_metadata)
datatable_metadata <- dplyr::tibble(filepath = c(system.file("extdata", "Banet-Example", "data", "enclosure-study-growth-rate-data.csv", package = "EMLaide", mustWork = TRUE), system.file("extdata", "Banet-Example", "data", "enclosure-study-gut-contents-data.csv", package = "EMLaide", mustWork = TRUE), system.file("extdata", "Banet-Example", "data", "microhabitat-use-data-2018-2020.csv", package = "EMLaide", mustWork = TRUE), system.file("extdata", "Banet-Example", "data", "seining-weight-lengths-2018-2020.csv", package = "EMLaide", mustWork = TRUE), system.file("extdata", "Banet-Example", "data", "snorkel-index-data-2015-2020.csv", package = "EMLaide", mustWork = TRUE)), attribute_info = c(system.file("extdata", "Banet-Example", "metadata", "enclosure-study-growth-rates-metadata.xlsx", package = "EMLaide", mustWork = TRUE), system.file("extdata", "Banet-Example", "metadata", "enclosure-study-gut-contents-metadata.xlsx", package = "EMLaide", mustWork = TRUE), system.file("extdata", "Banet-Example", "metadata", "microhabitat-use-metadata.xlsx", package = "EMLaide", mustWork = TRUE), system.file("extdata", "Banet-Example", "metadata", "seining-weight-length-metadata.xlsx", package = "EMLaide", mustWork = TRUE), system.file("extdata", "Banet-Example", "metadata", "snorkel-index-metadata.xlsx", package = "EMLaide", mustWork = TRUE)), datatable_description = c("Growth Rates - Enclosure Study", "Gut Contents - Enclosure Study", "Microhabitat Data", "Seining Weight Lengths Data", "Snorkel Survey Data"), datatable_url = paste0("https://raw.githubusercontent.com/FlowWest/edi.749.1/main/data/", c("enclosure-study-growth-rate-data.csv", "enclosure-study-gut-contents-data.csv", "microhabitat-use-data-2018-2020.csv", "seining-weight-lengths-2018-2020.csv", "snorkel-index-data-2015-2020.csv")))
excel_path <- system.file("extdata", "Banet-Example", "metadata", "data-package-metadata.xlsx", package = "EMLaide", mustWork = TRUE) sheets <- readxl::excel_sheets(excel_path) metadata <- purrr::map(sheets, function(x) readxl::read_excel(excel_path, sheet = x)) names(metadata) <- sheets abstract_docx <- system.file("extdata", "Banet-Example", "metadata","abstract.docx", package = "EMLaide", mustWork = TRUE) methods_docx <- system.file("extdata", "Banet-Example", "metadata", "methods.docx", package = "EMLaide", mustWork = TRUE)
The following code loads the "data-package-metadata.xlsx", "abstract.docx", and "methods.docx". Each sheet of the excel workbook pertains to a different metadata element and will be the input to the add_[blank]
functions used throughout this example.
excel_path <- "Banet-Example/metadata/data-package-metadata.xlsx" sheets <- readxl::excel_sheets(excel_path) metadata <- lapply(sheets, function(x) readxl::read_excel(excel_path, sheet = x)) names(metadata) <- sheets abstract_docx <- "metadata/abstract.docx" methods_docx <- "metadata/methods.docx"
In addition to these files, we will need a unique EDI data package identifier. We use the function reserve_edi_id
to generate a EDI id. You must already have an account associated with EDI to do this.
edi_number <- reserve_edi_id(user_id = "your user id", password = "your user password ")
You can also reserve this data package identifier on the EDI data repository under tools.
For this example, we will use the following identifier.
edi_number <- "edi.750.1"
We will use magrittr::%>%
with our add_[blank]
functions to append each EML element to a list. The %>%
is a pipe like operator which takes the left-hand side as the first argument of the function appearing on the right-hand side.
For details on appropriate inputs to the functions see documentation at ?add_[blank]
.
The add_methods()
and add_abstract()
functions take in the methods_docx
and the abstract_docx
. The add_datatable()
function takes in the datatable_metadata
defined and described above. Every other function takes in one or more sheets from the metadata
object. For template items with multiple rows, the add_[blank]
functions map through each row and adds a named nested list for each row to the dataset element.
The code below adds all dataset elements.
dataset <- list() %>% add_pub_date() %>% add_title(metadata$title) %>% add_personnel(metadata$personnel) %>% add_keyword_set(metadata$keyword_set) %>% add_abstract(abstract_docx) %>% add_license(metadata$license) %>% add_method(methods_docx) %>% add_maintenance(metadata$maintenance) %>% add_project(metadata$funding) %>% add_coverage(metadata$coverage, metadata$taxonomic_coverage) %>% add_datatable(datatable_metadata)
When units aren't standard add_datatable()
will give a message like the following: "We identified the following custom unit: fishPerSchool , please make sure to add information on this custom unit in additional metadata information:"
. We must formally define each of these custom units and add them to the EML document as an additional metadata section.
The code below defines 4 custom units and uses the EML::set_unitList()
function to format them into a unitList that can be added to our EML document.
custom_units <- data.frame(id = c("fishPerEnclosure", "thermal unit", "day", "fishPerSchool"), unitType = c("density", "temperature", "dimensionless", "density"), parentSI = c(NA, NA, NA, NA), multiplierToSI = c(NA, NA, NA, NA), description = c("Fish density in the enclosure, number of fish in total enclosure space", "thermal unit of energy given off of fish", "count of number of days that go by", "Number of fish counted per school")) unitList <- EML::set_unitList(custom_units)
The code below adds all of the elements we generated above and an access element into an eml
list.
add_access
adds an access section at the beginning of our EML document. The
add_access
default is public principal with a read permission. dataset
list from above contains all elements of the dataset
section of the EML. This includes the datatables
, abstract
, methods
, and all the other metadata sections appended above. addtionalMetadata
contains the unitList
that we generated to hold our custom units. eml <- list(packageId = edi_number, system = "EDI", access = add_access(), dataset = dataset, additionalMetadata = list(metadata = list(unitList = unitList)))
Once all of our information is appended to our eml list we can use the write_eml
and eml_validate
functions from the EML package to convert our list to EML and check validity.
EML::write_eml(eml, "edi.750.1.xml") EML::eml_validate("edi.750.1.xml")
To evaluate your document in R using EDI's EML Congruence Checker you can use evaluate_edi_package()
. To use this function you must have the data entities text files publicly accessible by a URL. This URL must be added in the datatable_metadata
section above. If you do not have a URL available then you can upload the EML document and the dataset on the EDI data portal.
evaluate_edi_package(user_id = "Your User Id", password = "Your password", eml_file_path = "edi.750.1.xml")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.