knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

pkEML

A R package to convert Ecological Metadata Language (EML) documents to tables, and, optionally, normalizes them into tables suited for import into a relational database system, such as the LTER-core-metabase schema. Data managers can use pkEML to aid migration of their metadata archives, while researchers working on meta-analyses can use pkEML to quickly gather metadata details from a large set of datasets. While pkEML was developed with the LTER network in mind, any EML users may find its functionalities useful.

How to say "pkEML": spell out each letter. pk is meant to stand for primary key, but one can also intepret as peak-EML or pack-EML.

Installation

# Requires the remotes package
install.packages("remotes")
remotes::install_github("atn38/pkEML")

How to use pkEML

library(pkEML)

Step 1: Assemble a "corpus" of EML documents

A corpus of EML documents can correspond to a research program's metadata archives, or any set of assembled metadata documents. A corpus is the unit of input data that pkEML operates on. So, if we are talking about getting geographic coverage metadata, or any other metadata element, the default is grabbing these things from a whole corpus in one go. You can of course use pkEML to grab metadata from a single EML document, but there are perhaps other methods to do so, such as the package metajam from NCEAS, or the function purrr::pluck().

pkEML has a function to quickly and conveniently download all EML documents from an Environment Data Initiative (EDI) repository's "scope":

pkEML::download_corpus(scope = "knb-lter-ble", path = getwd())

This downloads into the specified directory all EML documents from the most recent revisions of datasets under the "knb-lter-ble" scope in EDI, which is the Beaufort Lagoon Ecosystems LTER program.

If working with a more heterogeneous set, use your favorite method to download the EML documents into a directory.

Step 2: Import the EML corpus into R

emls <- import_corpus(path = getwd())

import_corpus outputs a nested list of EML documents represented under the emld format. Each list item is a EML document and named after the full packageId in the metadata body (not the .xml file name in the directory).

Step 3: Convert EML corpus to tables

dfs <- EML2df(corpus = emls)

dfs is a nested list of data.frames. Each data.frame will contain assembled information from all your datasets, each on key metadata groups such as dataset-level information, entity-level, attribute-level, attribute codes (enumeration and missing), geographical/temporal/taxonomic coverage, and so on. These data.frames are de-normalized, meaning all occurrences in EML are recorded and there may be loads of repeated information. For example, key personnel from your research program will be listed as contributors on many datasets, core sampling locations will be listed many times, and so on.

lapply(dfs, colnames)

Step 4: Normalize tables

# tbls <- normalize_tables(dfs)

Customized usage of pkEML

Stop when you have what you want

One can stop at any point in the above sequence, of course. A logical place to stop would be after running EML2df on your EML corpus. At this point, you've got a set of rich tables to do a lot with.

Getting specific metadata elements

EML2df simply wraps around a set of more granular get_ functions. These are all exported functions and can be used to get specific metadata elements in table form:

datasets <- get_dataset(corpus = emls)
taxonomy <- get_coverage_tax(corpus = emls)

Getting a particular metadata element (that I didn't write a get_ function for)

Even more potentially powerful is the adaptable get_multilevel_element and get_datasetlevel_element functions. These take a EML corpus, an EML element name, and a parse function as arguments. For example, get_coverage_geo is actually just a wrapper around get_multilevel_element:

# get_coverage_geo(corpus = emls) 

# is equivalent to 

# get_multilevel_element(corpus = emls, element_names = c("coverage", "geographicCoverage"), parse_function = parse_geocov) 

geographicCoverage is an element that can be used to describe any combination of datasets, entities, and attributes in EML. get_multilevel_element grabs all occurrences of the geographicalCoverage node at each level, then runs them through the parse_geocov function, while preserving the context of where each occurrence was -- which dataset, which entity, which attribute.

Ditto for get_datasetlevel_element, a very similar function but works only at the dataset level, since there are many EML elements unique to this level.

To grab an EML element without a ready-made get_ function, just write a custom parse function and pass to these generic functions.

Getting help

Report an issue and I'll try my best to get to you.



atn38/MetaInbase documentation built on Feb. 26, 2023, 9:56 a.m.