knitr::opts_chunk$set(echo = TRUE)
Darwin Core can best be thought of as a set of terms for describing information about biodiversity data that are given special significance so that their meaning is comparable between different data sets. Every data set in the ecological science is unique and every one will have some features about it that will require some explanation when providing this data to other parties. The Darwin Core standard is a way of describing features of your data set using a controlled vocabulary such that when you talk about feature 'x' in your data set, that this 'x' can be understood to mean a particular well-defined and standardised thing. Darwin Core is not the only standard used for the description of biological data, there are plenty of others and many, including Darwin Core itself, are maintained by working groups associated with TDWG a Biodiversity Information Standards voluntary organization (formally called the Taxonomic Databases Working Group). Darwin Core is the standard that forms the basis for the Global Biodiversity Information Facility and, for many researchers, GBIF will represent the most common way to make their data findable and sharable with the wider community.
The Darwin Core standard itself should not be confused with Darwin Core archive files. Darwin Core archive files are a file format type that bundles together data and the metadata describing said data into one package. The data in these archive files are described according to Darwin Core standards (hence the name) and, as a result, represent a convenient standards-compliant format for sharing biodiversity data according to FAIR principles. Darwin Core archive files are the preferred method of publishing data in GBIF and a full how-to guide to producing Darwin core files can be found on the GBIF Integrated Publishing Toolkit website. However, this document will demonstrate ways to both generate your own Darwin Core archive files an existing data set or importing data from an existing Darwin core archive file using an alternative solution provided by the Living Norway R package.
Darwin core archive files are simply a ZIP folder with a specific set of files inside:
Data files These are a collection of files that contain the data. The format of this data is expected to be in a text-based delimited format such as comma-seperated values (CSV) or tab-seperated values. Information such as the encoding of the text and the type of delimiter used in the data files is often stored in the metafile. The first line of the data files may contain column headings.
Core data file Every Darwin Core archive needs exactly one core data file. This is the main data table of the dataset and serves as the definition of the reference sampling unit for the data set. One column of the core data file must be an ID column with each row of that column containing a unique ID code for the record (unique to the dataset): this ID code will serve as a reference code for any extension data files included in the archive. Data being published to GBIF can have core data files that belong to one the following types:
Sample event data corresponds to data that have been collected according to a defined protocol or experimental design at particular times or locations. Here the data set is built around a series of sampling events at which measurements are recorded.
Species checklist data corresponds to a list of named species (or other taxon level) that represent a catolgue for some purpose. This could for example be a list of species of particular conservation for a particular area or a list of potential invasive species.
Extension data files A Darwin Core archive file may optionally have any number of data tables that may contain additional information to extend the information in the core data file. For example, in a Darwin Core archive with an event-based core data file, we might also have an accompanying data table that describes the number of individuals of each species found at each of the sampling events. Like the core data file, each extension data table must contain one column that represents an ID column. Unlike the core data table type however, the extension ID columns refer to the IDs in the core data table that the extension data table rows are providing information about. The ID values in the ID column of the extension data tables do not therefore need to be unique because multiple rows in an extension table may be referring to a single entry in the core data table. In the example described above, multiple species may have been recorded at the same sampling event, and the extension data table will therefore contain multiple rows (one for each of the species recorded) each containig the ID for the relevant sampling event in the ID column of the extension data table. For data being submitted to publication to GBIF, extension data tables must belong to one of the supported extension types.
To make it easier to both create and manipulate the contents of Darwin Core archive files, the Living Norway package provides a number of functions and tools for data managers and researchers. The aim of the package and this documentation is to provide a more approachable interface to producing Darwin Core archives without requiring extensive knowledge of both the Darwin Core standard and EML such that the dissemination of data to meet FAIR standards can be more easily integrated into a researcher's workflow.
Before we can use any of the functions contained within the Living Norway package we must first install it. At the current time the easiest way to install the Living Norway package is to import the devtools package and use the "install_github" function to install the package directly from the project's GitHub repository. We hope to distribute future releases of the Living Norway package over CRAN and, once this is achieved, then installation of the Living Norway package will follow that standard R package installation procedure. In the meantime, the following code will install the necessary packages:
# Install the Living Norway package from the Git repository #devtools::install_github("https://github.com/LivingNorway/LivingNorwayR") # Import the tools into R library(LivingNorwayR)
Once the Living Norway package is installed and loaded, a number of classes are added to R that allow for the easier manipulation of Darwin Core archive data and the terms associated with them. What do we mean by 'classes'? All variables in R belong to a particular class of object. You will already be familiar with some R's base classes that can used for the handling of information such as data frames, lists, and vectors. The Living Norway package simply contains more class definitions that allow for easier manipulation of the information in Darwin Core archive files:
The event core table type is supported by the 'GBIFEvent' class in the Living Norway package. Similarly the the occurrence core table type is supported by the 'GBIFOccurrence' class and the species checklist core table type is supported by the 'GBIFTaxon' class. The 'getGBIFCoreClasses()' returns a full list of the GBIF core classes handled by the Living Norway R package along with their definition information (represented as 'DwCTerms' objects):
getGBIFCoreClasses()
The Living Norway packages also supports a large number of classes to handle the broad types of data tables that can be used as extension tables in Darwin Core archive files submitted to GBIF. The full list of extensions possible to use for GBIF-compliant Darwin Core archive files can be found on the Darwin Core Archive Validator website. The names of the Living Norway classes that can handle these extensions can be found by calling the 'getGBIFExtensionClasses()' function.
names(getGBIFExtensionClasses())
In order to import a Darwin Core archive file we need to first get hold of a Darwin Core archive file for a dataset that we wish to import. Typically one can download these files manually from using biodiversity database indexing facilities such as GBIF. For this example we will use an example dataset from the "Extensive monitoring of breeding birds" program (TOV-E), for which the Darwin Core archive file is housed on the IPT server at the Norwegian Institute for Nature Research. The overview for this dataset can be found on the GBIF dataset portal. To minimise the number of files needed to be distributed as part of this exercise, we can use the R code below to download and store the Darwin Core archive file in a temporary location.
# Create a temporary directory to store intermediate files used in this workshop tempDirLoc <- tempdir() # The URL where the Darwin Core file for the TOV-E bird survey data is housed datasetURL <- "https://ipt.nina.no/archive.do?r=tove_birdsampling" # Download the Darwin Core file to the temporary directory localDataLoc <- file.path(tempDirLoc, "TOVEData.zip") download.file(datasetURL, localDataLoc, mode = "wb")
Now that we have the Darwin Core archive file stored locally, we can now import it using the "initializeDwCArchive" function in the Living Norway package. We do this by calling the 'initializeDwCArchive' function. This function can be called in one of two different ways: it's first argument can be a location of a Darwin Core archive file (with an optional second argument being a default file encoding for importing the data) and is the easiest way for importing data that already exist as Darwin Core archive, or it can be called using Darwin Core tables constructed using the Living Norway package for the times when you are using the package to construct your own Darwin Core archives. This latter way to call the function will be covered in the later section on archive file creation.
# Create a DwCArchive object from the downloaded Darwin Core archive file TOVEArchive <- initializeDwCArchive(localDataLoc, "UTF-8")
In the code block above we specify the file UTF-8 file encoding. By default the encoding will be set to your system and, in most cases will not need to be changed from these defaults. In this example the metadata files in the archive contain a number of Norwegian characters that may be incorrectly imported if we use the default values of your system.
Now that the archive has been imported into a DwCArchive object then we can have a look at a summary of the contents:
# Print a summary of the data in the archive object
TOVEArchive
The top section of the output gives a brief summary of the metadata associated with the project. What follows the metadata is a summary of each of the data tables contained in the archive: firstly the core data table followed by each extension data table in the archive. The first line of the summary of each data table gives the name of the file in the archive (in this case the core file was called 'event'), followed by the column that serves as an ID column for the data table, and finally the class of data table (in the case of this example the core data table is of type 'Event' which corresponds to the GBIF event core category of archive files). Below this line is a list of each of the columns in the data table that have been linked to a standardised definition such as Darwin Core. These are columns that have a meaning that corresponds to a definition determined by a standardisation committee and for which the definition is publically accessible. Each row of this definition summary table starts with the location where the term definition can be found (in most cases an Internationalized Resource Identifier) followed by the column number and, if the data table has column names, the name of the column corresponding to this column number. Below the definition list is a snippet of the data contained in the data table (usually just the first six rows).
We can select out the core data table from the Darwin Core archive and even retrieve it as a data frame. Extracting information about objects can be done by using the R6 'method functions' defined for the classes defining those objects. Under this object model method functions are used by using the '$' notation after the object you want to call the method function for, followed by the name of the method function you want to call. For example, the DwCArchive class has the method function 'getCoreTable' that allows the user to extract the core table from the archive object. Once a particular data table is extracted then the user can use the method functions defined by the DwCGeneric class that allows for further manipulation of the individual data tables.
# Retrieve the core table from the archive object TOVEEventTable <- TOVEArchive$getCoreTable() # TOVEEvent table is an object of type GBIFEvent which is derived from DwCGeneric class(TOVEEventTable) # Export the contents of the event table to a data frame TOVEEventTableDF <- TOVEEventTable$exportAsDataFrame() # Lets look at the top few rows of the data frame extracted from the core data table head(TOVEEventTableDF)
Similarly, extension tables can be extracted from the archive by using the 'getExtensionTables' method function of the DwCArchive class. This function has one argument that is either an integer vector containing the indeces of the extension tables (in the order that they are displayed in the summary of the archive file) or a character vector giving the names of the tables in the archive file (those names are found in the first summary line of each data table when displaying the archive summary). This function returns a list of the data tables requested. Therefore to extract just the one we want, we'll need to further index the list with the '[[1]]' notation to extract the first element of the returned list.
# Retrieve the extension table from the archive object: two ways to do this # 1. Retrieve the extension table by using its index TOVEOccTable <- TOVEArchive$getExtensionTables(1)[[1]] # 2. Retrieve the extension table by using its name TOVEOccTable <- TOVEArchive$getExtensionTables("occurrence")[[1]] # The getExtensionTables functions returns a list of the data tables that are requested. Therefore to extract just the first element of this list we need # to use the extra '[[1]]' list extraction notation. # TOVEOccTable is an object of type GBIFOccurrence which is dervied from DwCGeneric class(TOVEOccTable) # Export the contents of the occurrence table to a data frame TOVEOccTableDF <- TOVEOccTable$exportAsDataFrame() # Lets look at the top few rows of the data frame extracted from the extension data table head(TOVEOccTableDF)
It is possible to extract elements from the archive metadata. The 'getMetadata' method of the DwCArchive class returns a DwCMetadata object. From this object it is possible to access elements of the metadata.
# Retrieve the metadata from the archive object TOVEMetadata <- TOVEArchive$getMetadata() # Retrieve the title of the data set TOVEMetadata$getTitle() # Retrieve the abstract/summary of the data set TOVEMetadata$getAbstract() # Retrieve the information on the data set creators TOVEMetadata$getCreatorInfo()
So far we've talked about how to import data being provided as a Darwin Core archive file but a key feature of the Living Norway package is to make it easier for researchers to go from their data (stored as data frames) to a complete Darwin Core archive. To demonstrate this, we use the data frames created that we extracted in the code blocks above and use this as a starting point for creating a Darwin core archive file. For most researchers this is a common starting point as their data sets are often represented by a collection of data tables that are either already data frame or that can be easily read into data frames. So, from this data frame starting point, we have two tables: 'TOVEEventTableDF' and 'TOVEOccTableDF'. Our first decision, is to designate one of the tables as the core data table. In this instance we already know that it is 'TOVEEventDF' but, in the case where you are processing your own data sets, the core data table is always the one that the sampling unit is based around. For most ecological experimental designs, the table denoting the sampling events (handled in the Living Norway package by the GBIFEvent class) is a natural core table. Under this format, each sampling event and its details, such as location and time of sampling, is described in the core table. Often what was recorded at each sampling event is described in the extension files. In situations where the only thing measured at each sampling event is the occurrence of a species then the table describing these occurrences would be designated as the core data table (corresponding to the GBIFOccurrence class in the Living Norway package). In some situations the dataset may not be based around occurrences or sampling events but simply a check list of species (or other taxonomic groups=. This latter case can be handled by using a core data table that lists these taxa and is handled in the Living Norway package with the 'GBIFTaxon' class.
So we've decided that the core data table should be the 'TOVEEventTableDF' data frame and that this data table is an event-based core. To initialise a GBIF-complaint event table we can use the 'initializeGBIFEvent' function. This function requires two arguments: the data frame making up the table and a column (given as either a column name or column number) to represent the ID information. If the table is going to be used as a core table then this ID column must contain unique values for each row and will serve as a key to link extension data tables to the core table. After these two mandatory arguments are given then the user must specify how the columns (if any) relate to definitions in the Darwin Core standard for that data table type. This can be done in one of two ways: either the data frame can have column names that correspond to Darwin Core terms relevant to the data table type and then the user can simply add the argument 'nameAutoMap = TRUE' to the initialisation function and it will look for any column names that correspond to Darwin Core terms, or the user can add arguments with names corresponding to each Darwin Core term and set that argument either as a column name or column number. The 'getGBIFEventMembers' function returns a list of all the Darwin Core terms associated with GBIF-compliant event data tables.
# Look at the Darwin Core terms associated with event data tables (here we've shortened it to the first 6 entries so that the output is not too long) getGBIFEventMembers()[1:6] # Call the initialisation function using the two different methods: # 1. Using the automatic mapping method newTOVEEventTable <- initializeGBIFEvent(TOVEEventTableDF, "id", nameAutoMap = TRUE) # 2. Using the manual mapping method newTOVEEventTable <- initializeGBIFEvent(TOVEEventTableDF, "id", # What follows is a list of arguments giving Darwin Core terms followed by the column name (or number) # in the data frame that corresponds to those terms. In this example it doesn't make much sense to # call the initialisation function in this manner because all the column names correspond to # Darwin terms anyway (so much easier to use the first method). However, this alternative method # can be useful if the column names are different from the Darwin Core terms that they represent or # for data frame that don't have column names (in which case column numbers can be given as the # argument values instead). type = "type", modified = "modified", datasetName = "datasetName", ownerInstitutionCode = "ownerInstitutionCode", informationWithheld = "informationWithheld", dataGeneralizations = "dataGeneralizations", eventID = "eventID", samplingProtocol = "samplingProtocol", sampleSizeValue = "sampleSizeValue", sampleSizeUnit = "sampleSizeUnit", samplingEffort = "samplingEffort", eventDate = "eventDate", eventTime = "eventTime", year = "year", month = "month", day = "day", locationID = "locationID", country = "country", countryCode = "countryCode", stateProvince = "stateProvince", municipality = "municipality", locality = "locality", minimumElevationInMeters = "minimumElevationInMeters", maximumElevationInMeters = "maximumElevationInMeters", decimalLatitude = "decimalLatitude", decimalLongitude = "decimalLongitude", geodeticDatum = "geodeticDatum", coordinateUncertaintyInMeters = "coordinateUncertaintyInMeters") # We can then check to see if the terms are mapped correctly # Terms with NA in the column index represent Darwin Core terms associated with the event data table that are not mapped # Not all terms need to be mapped to make a valid Darwin Core archive newTOVEEventTable$getTermMapping() # We can also check that the correct ID column is being used # Returns the index of the ID column newTOVEEventTable$getIDIndex() # Returns the name of the ID column (if the data table has column names) newTOVEEventTable$getIDName()
The next step is to perform the same term mapping for each of the extension data tables. In this instance we know that our extension data table should be of 'GBIFOccurrence'. However, how do you find the relevant extension data type for your extension data? GBIF supports a large number of extension data types and to go into them all would be outside the scope of this document. However, the 'getGBIFExtensionClasses' function will retrive the definition information associated with each extension class with a small description. More detailed descriptions of the extension classes can be found on the Darwin Core Validator website but the output from the 'getGBIFExtensionClasses' function will at least give some hints as to which extension class may be suitable for your data type.
# Look at some of the supported GBIF extension classes (here we've shorted it to the first six entries so that the output is not too long) getGBIFExtensionClasses()[1:6]
We can initialise the occurrence extension table for our data set in a similar manner to how we initialised the event core data type.
# Look at the Darwin Core terms associated with occurrence data tables (here we've shortened it to the first 6 entries so that the output is not too long) getGBIFOccurrenceMembers()[1:6] # Call the initialisation function newTOVEOccTable <- initializeGBIFOccurrence(TOVEOccTableDF, "id", nameAutoMap = TRUE) # We can then check to see if the terms are mapped correctly # Terms with NA in the column index represent Darwin Core terms associated with the event data table that are not mapped # Not all terms need to be mapped to make a valid Darwin Core archive newTOVEOccTable$getTermMapping()
In Darwin Core archive files the standard format for handling metadata is EML, a flavour of XML specifically designed for the handling of ecological metadata. We can extract the EML file for the TOV-E data from the Darwin Core archive file using the following code:
# Extract the EML file from the Darwin Core archive file unzip(localDataLoc, "eml.xml", exdir = tempDirLoc) # Print the first few lines of the EML file to get an idea of the structure of the file cat(readLines(con = file.path(tempDirLoc, "eml.xml"), n = 20, encoding = "UTF-8"), sep = "\n")
It is also possible to look at the entire EML file in a seperate browser by running the following code:
browseURL(file.path(tempDirLoc, "eml.xml"))
From the EML file we can generate a DwCMetadata object by calling the 'initializeDwCMetadata' function. This function takes one argument, which is the location of the file to import the metadata information from. This can be an EML file or a Darwin Core archive file.
# Initialise the metadata from the EML file extracted from the Darwin Core archive newTOVEMetadata <- initializeDwCMetadata(file.path(tempDirLoc, "eml.xml"), fileType = "eml" # This line is not required if the file has the ".xml" file extension ) # Alternatively the metadata object can be imported directly from the Darwin Core archive file newTOVEMetadata <- initializeDwCMetadata(localDataLoc, fileType = "darwincore" # This line is not required if the file has the ".zip" file extension )
However, these methods assume that the researcher already has an EML file that can be used for initialisation of the metadata object. In many instances, the researcher will not have the EML readily available, and in situations where the researcher is unfamiliar with the standard, the process to create an EML can be rather laborious. To help alleviate this, the Living Norway adds a number of functions to aid in the creation of EML files. Instead of formatting the metadata according to the EML standard, the researcher can instead use R markdown to create a text document describing the dataset. In this text document the researcher can simply add tagging functions to sections of text that they wish to be exported to the EML file. (It is important to note here that there are a few different R packages that one can use to make valid EML (see: EML with R) all of which have different challenges associated with them.)
R markdown is a simple text-based documentation language and can be thought of as an extension of simple text files with some extra support for formatting and display of text elements. In addition R markdown allows for the embedding of R code within the document which can be used to draw figures or make tables from data. This can be very useful for describing or displaying aspects of data sets.
A very simple minimal example of the use of markdown to generate create metadata documentation can be found at the Living Norway Git Repository.
# Initialise the metadata from the R markdown file hosted at TODO download.file("https://raw.githubusercontent.com/LivingNorway/LivingNorwayR/master/vignettes/LNWorkshopExample_TOV-E_Metadata.rmd", file.path(tempDirLoc, "LNWorkshopExample_Metadata.rmd")) createdTOVEMetadata <- initializeDwCMetadata(file.path(tempDirLoc, "LNWorkshopExample_Metadata.rmd"), fileType = "rmarkdown" # This line is not required if the file has the ".rmd" or ".md" file extension ) # Export the newly created metadata as an EML file createdTOVEMetadata$exportToEML(file.path(tempDirLoc, "newMetadata.xml"))
and again the created metadata can be viewed in a browser using the following command:
browseURL(file.path(tempDirLoc, "newMetadata.xml"))
Now we have all the components that we need to package the data tables and the metadata together into one Darwin Core archive object. This can be done through calling the initializeDwCArchive
function giving the core table as the first argument, a list of all the extension tables as the second argument, and a DwCMetadata
object as the third argument.
newTOVEArchive <- initializeDwCArchive(newTOVEEventTable, list(newTOVEOccTable), newTOVEMetadata)
Finally, we can then export it as a to whichever location we wish to store it to.
newTOVEArchive$exportAsDwCArchive(file.path(tempDirLoc, "newDwCArchive.zip"))
This Darwin Core archive file can now serve as a useful interchange format that ensures that all the data and metadata are packaged together and that they are both described used known biodiversity standards. Thus satisfying the basic tenants of FAIR data sharing and allowing your data to be indexed by biodiversity databases such as GBIF.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.