listviewer

If the listviewer package is installed it can provide a convenient way to view and edit EML:

f <- system.file("tests/eml.xml", package = "emld")
eml <- read_eml(f)
listviewer::jsonedit(eml)

Parsing an EML file

f <- system.file("xsd/test/eml-i18n.xml", package = "EML")
eml <- read_eml(f)

Here we request all temporalCoverage elements occurring in the anywhere in the eml document:

temporalCoverage <- eml_get(eml, "temporalCoverage")
temporalCoverage

Any EML element can be extracted in this way. Let's try an example metadata file for a dataset that documents 11 seperate dataTables:

hf001 <- system.file("examples/hf001.xml", package="EML") 
eml_HARV <- read_eml(hf001)

How many dataTable entities are there in this dataset?

dt <- eml_get(eml_HARV, "dataTable")
length(dt)

We can iterate over our list of dataTable elements to extract relevant metadata, such as the entityName or the download url:

entities <- eml_get(eml_HARV, "dataTable.entityName")
urls <- sapply(dt, eml_get, "url")

Note that the latter example is the same as providing the more verbose arbument that specificies exactly where the url of interest is located:

urls <- sapply(dt, function(x) x@physical[[1]]@distribution[[1]]@online@url)

this verbose syntax can be useful if there are multiple url elements in each dataTable metadata, and we are trying to get only certain ones and not others. Specifying the exact path in this way can also improve the speed of the command. For these reasons, programmatic use should consider this format, while the much simpler eml_get example shown above is practical for most interactive applications.

Although the default return type for eml_get is just the S4 object (whose print method displays the corresponding XML structure used to represent that metadata), for a few commonly accessed complex elements, eml_get returns a more convenient data.frame. For instance, the attributeList describing the metadata for every column in an EML document is returned as a pair of data.frames, one for all the attributes, and an second optional data.frame defnining the levels for the factors, if any are used. Let's take a look:

Here we get the attributeList for each dataTable in the dataset. We check the length to confirm we get one attributeList for each dataTable

attrs <- eml_get(dt, "attributeList") 
length(attrs)
attrs[[1]]

(Note, we could have passed this argument the original eml_HARV instead of dt here, since we know all attributeList elements are inside dataTable elements, but this is a bit more explicit and a bit faster.)

This returned data.frame object containing the attribute metadata for the first table (hence the [[1]], though attrs contains this metadata for all 11 tables now.) This is the same result we would have gotten using the more explicit call to the helper function get_attributes():

get_attributes(eml_HARV@dataset@dataTable[[1]]@attributeList)


ropensci/EML documentation built on June 11, 2022, 10:32 a.m.