library(knitr)
hook_output <- knit_hooks$get("output")
knit_hooks$set(output = function(x, options) {
lines <- options$output.lines
if (is.null(lines)) {
    return(hook_output(x, options))  # pass to default hook
}
x <- unlist(strsplit(x, "\n"))
more <- "..."
if (length(lines)==1) {        # first n lines
    if (length(x) > lines) {
        # truncate the output, but add ....
        x <- c(head(x, lines), more)
    }
    } else {
        x <- c(more, x[lines], more)
    }
    # paste these lines together
    x <- paste(c(x, ""), collapse = "\n")
    hook_output(x, options)
})

The safedata package makes it easy to search for and use datasets collected at the SAFE Project. It provides an interface to download data files and packaged record metadata and then functions to load data worksheets and add taxonomic and spatial data where available.

For further information on the publication and structure of data through the SAFE Project and within the safedata package, see the Overview vignette: vignette("overview", package = "safedata").

Installing safedata

The safedata package is available from CRAN:

install.packages("safedata")

The development version can also be installed from GitHub:

devtools::install_github("ImperialCollegeLondon/safedata")

Package dependencies

The safedata package requires the following packages:

The SAFE data directory

The safedata package makes use of a local directory to store downloaded data, index and metadata files (vignette("overview", package = "safedata") for details) . These files are needed for the safedata functions to work correctly, so the first step in using safedata is to set the location of the directory and the package will remind you to do this when it is loaded.

library(safedata)

Initialising a SAFE data directory

If this is the first time you are loading safedata -- or if you simply want to have two separate SAFE data directories -- then you need to create a new, empty directory.

set_safe_dir('my_safe_directory', create=TRUE)
## Safe data directory created

This will create the directory and download the current index files. You cannot use an existing directory: the package wants to start with a fresh, empty directory. Note that the directory path is stored in options():

options('safedata.dir')
## $safedata.dir
## [1] "my_safe_directory"

Using an existing SAFE data directory

Once you have a SAFE data directory, the same function is used to tell the safedata package where to look for index and data files:

set_safe_dir('my_safe_directory')
## Checking for updates
##  - Index up to date
##  - Gazetteer up to date
##  - Location aliases up to date
## Validating directory

You will see that this function checks with the SAFE Project website for updates to the key index files. This can be turned off for offline use (set_safe_dir('~/my_safe_directory', update=FALSE)). The function also validates local data files: it checks the MD5 hash of local data file copies against the MD5 of the published file. You can suppress validation (set_safe_dir('~/my_safe_directory', validate=FALSE)), but this is not advised: altering the contents of a published data file undermines reproducible research.

safe_dir <- system.file('example_data_dir', package='safedata')
set_safe_dir(safe_dir)

Finding data

You can browse datasets published by the SAFE Project at either:

If you have done this, or have a dataset DOI from another source, then you can look up the dataset directly.

However, if you want to search the dataset metadata or the taxa and locations covered by datasets then there are a set of search functions built into the safedata package.

Search functions

The safedata package contains a set of search functions to explore datasets. These functions make use of a metadata index stored on the SAFE Project website and so need an internet connection to work. These search functions provide structured access to the same metadata shown in project description text but also provide extended taxonomic and spatial searches.

The functions are:

All of these functions return a safe_record_set objects, which is just a data frame containing validated record ids and access information and so you can use the normal data frame indices (e.g. recs[1,]) to select particular records.

soil_datasets <- search_text('soil')
print(soil_datasets)

Taxon search details

Published datasets contain a taxonomic index of any organisms referred to within the data - see here for details of the Taxa worksheet containing the index.

The taxa in this index, along with all of the parent taxa in the taxonomic hierarchy leading up to that those taxa, are added to a taxonomic database on the SAFE Project website. The search_taxa() function searches that index to identify all the datasets that contain a particular taxon.

print(ants <- search_taxa('Formicidae'))

The taxonomic index is built around the GBIF backbone taxonomic database and include the following core taxonomic levels: kingdom, phylum, class, order, family, genus, species and subspecies. It is also possible to search by GBIF ID.

ants <- search_taxa(gbif_id=4342)

Spatial search details

Datasets also have to provide a full index of sampling locations used in the data. Sampling locations are either linked to existing sampling locations included in the SAFE gazetteer or users can identify new sampling locations and provide location data if possible.

The search_spatial() function allows users to search for datasets by sampling locations. Accepted location names from the gazetteer can be used to search for datasets but users can also provide their own search geometries using the Well Known Text format. The search includes simple GIS capabilities to look for sampling within a given distance of the query location.

# Datasets that include sampling within experimental block A
within_a <- search_spatial(location='BL_A')
# Datasets that sampled within 2 km of the Maliau Basin Field Study Centre
near_maliau <- search_spatial(wkt='POINT(116.97394 4.73481)', distance=2000)

Note that WKT coordinates should be supplied as WGS84 longitude and latitude - typically the output of GPS receivers - but the database uses the local UTM 50N projected coordinate system for all distance calculations and GIS operations.

Look up a specific dataset

Datasets are identified by their record number, which is the number included in both the dataset DOI and Zenodo URL. All of the following point to the same dataset:

Note that all metadata is available for all records, regardless of whether they are open, embargoed or restricted. This includes field descriptions and taxon and location sampling so that users can assess whether a dataset is going to be useful even if it is not yet openly available.

Once you have the details of a dataset you are interestested in then you can validate a dataset reference to access metadata and available data, using the validate_record_ids() function. This function does the following:

  1. checks that the record is valid,
  2. checks whether the record number is a record id, referring to a specific version of a dataset, or a concept id, which identifies all the versions of a dataset. In the example code below, two of the values are record ids, so the appropriate concept id is located and printed, and one is a concept id, so no specific version number is given.
  3. checks whether the data are currently available, and
  4. provides an interface to download and import the related data files.

Like the search functions, the output is safe_record_set object. Note that you can validate multiple references at once.

recs <- validate_record_ids(c('https://doi.org/10.5281/zenodo.3247631',
                              '10.5281/zenodo.3266827',
                              'https://zenodo.org/record/3266821'))
print(recs)

In addition, all of the main functions in safedata that expect to be passed a dataset id will run validate_record_ids() on their inputs, so you can simply use those URLs directly with those functions without needing to specifically create a safe_record_set yourself.

Displaying dataset metadata

Printing a safe_record_set object displays a deliberately compact summary of a set of record ids. There are three function that show the detailed metadata for records at three levels:

show_concepts()

The show_concepts() function displays concept level metadata about a set of record ids. This includes the (most recent) dataset title and a short summary of the versions available under the dataset concept. Note that the output is not restricted just to the set of record ids given to the function: it shows metadata for all versions for each of the concept ids included.

show_concepts(recs)

show_record()

This function shows metadata for a specific version of a dataset: if you give it a concept ID then it will display the available versions for that concept. Otherwise, the function prints out information about the dataset with that record id: it includes the dataset title, status and other dataset level metadata and then a summary of the data worksheets contained in the dataset.

Note that - because a safe_record_set is just a data frame with some extra information attached - you can use the usual data frame indexing to select a row to pass to other functions. Running show_record() also requires an internet connection: the package downloads a JSON file of the record metadata and stores it in the SAFE data directory.

show_record(recs[3,])

show_worksheet()

This function shows metadata for a named worksheet within a specific record. The default is to show a compact table of field names, field types and truncated descriptions:

show_worksheet(recs[3,], 'Data')

There is also an extended display (extended_fields=TRUE) that will print out a list of all the available metadata for each field.

show_worksheet(recs[3,], 'Data', extended_fields=TRUE)

Downloading data

Once you have found records for which you want to explore the actual data, then you first need to download the data files for the dataset from Zenodo. This uses the download_safe_files() function and you can either give that a URL or number for a dataset or pass it an existing safe_record_set. The function will check which datasets are currently available and download them to the SAFE data directory. The default behaviour is to present a brief report on the number and size of available files to be downloaded before actually doing anything:

download_safe_files(within_a)
## 26 files requested from 26 records
##  - 0 local (0 bytes)
##  - 4 embargoed or restricted (2.2 Mb)
##  - 22 to download (43.6 Mb) 
## 
## 1: Yes
## 2: No
## 
## Selection:

By default, the download_safe_files() function downloads all of the files associated with the record. This will include external data files which may contain primary data that is not suited to the Excel format or additional information. Although many external files are likely to be readable in R, thesafedata does not currently provide a mechanism to load them automatically. The function will also download the JSON metadata for the specified datasets.

The function will warn you if the local copies of data files have been altered and the refresh=TRUE argument can be used to restore data files to the version of record. Note that this will delete local changes.

Loading data

The load_safe_data() function is used to load a named data worksheet from a dataset into a safedata object. This is just a data frame with some additional attribute data and it will in general behave just like any other data frame - the additional attributes are used for further data processing and adding brief metadata to the str and print methods.

Some data formatting takes place based on field types: categorical variables are converted to factors; dates and datetimes are converted to POSIXct and times are converted to chron::time objects.

beetle_abund <- load_safe_data(1400562, 'Ant-Psel')
str(beetle_abund)
print(beetle_abund)

The display of safedata objects is kept deliberately simple to avoid cluttering the screen with metadata. You can always view additional metadata for a loaded worksheet by using the show functions directly on the loaded safedata object:

show_concepts(beetle_abund)
show_record(beetle_abund)
show_worksheet(beetle_abund)

Dataset taxa

There are three functions that can be used to work with the taxa in a dataset:

  1. get_taxa(): This function loads a dataframe containing all of the taxa used within a dataset, with fields including the core GBIF taxonomic levels, the taxonomic label used within the dataset and the taxonomic status of the each taxon. You can load a taxonomic dataframe from a safe_record_set row or using an existing loaded safedata object.

    r beetle_taxa <- get_taxa(beetle_abund) str(beetle_taxa)

  2. add_taxa(): This function adds taxonomic details to an already loaded data worksheet.

    r beetle_morph <- load_safe_data(1400562, 'MorphFunctTraits') beetle_morph <- add_taxa(beetle_morph) str(beetle_morph)

  3. get_phylogeny(): This function creates a phylogeny from the taxonomic data in a dataset, returning an object of class phylo (see packageape).

    r library(ape) beetle_phylo <- get_phylogeny(1400562) plot(beetle_phylo, show.node.label=TRUE, font=1, no.margin=TRUE)

In addition, the get_taxon_coverage function can be used to get a taxon table of all taxa currently referenced in datasets.

all_taxa <- get_taxon_coverage()
str(all_taxa)

Dataset locations

Nearly all SAFE datasets will include observations at spatial locations, and these datasets must include a Locations worksheet used as an spatial index for research activities. There are three functions that can be used to work with locations in a dataset. All of these functions use the sf package to represent the GIS geometry of locations and which provides an extensive toolset for further spatial analysis.

  1. load_gazetteer(): The gazetteer is one of the three key index files saved in the SAFE data directory and updated when set_safe_dir() is run. It includes sampling locations drawn from across a wide range of projects running at SAFE and is intended to hold all locations that are likely to see repeated sampling. Locations included the gazetteer can be used directly as known locations in datasets, although data providers can also include new locations.

    r gazetteer <- load_gazetteer() print(gazetteer)

  2. get_locations(): This function returns an sf object containing the locations used within a dataset. For known locations, the GIS data for the location are taken directly from the gazetteer. If the locations are new sampling sites, then GIS data provided in the dataset is used. Note that it is possible for dataset providers to create a new locations without including GIS data - these will be represented using empty GIS geometries.

    By default, the returned sf object will only include the location name used in the dataset, the gazetteer name for known sampling sites and an indication of whether the location is new or known, but gazetteer_info=TRUE can be used to include the gazetteer attributes for known locations.

    r library(sf) beetle_locs <- get_locations(1400562) print(beetle_locs) fragments <- subset(gazetteer, type=='SAFE forest fragment') par(mar=c(3,3,1,1)) plot(st_geometry(fragments), col='khaki', graticule=TRUE) plot(st_geometry(beetle_locs), add=TRUE, col='red', pch=4)

  3. add_locations(): This functions adds location data to an already loaded worksheet. The result is a safedata object that is also an sf object.

    r beetle_env <- load_safe_data(1400562, 'EnvironVariables') beetle_env <- add_locations(beetle_env) print(beetle_env) plot(beetle_env['Cover'], key.pos=4, breaks=seq(0,100, by=5))



ImperialCollegeLondon/safe_data documentation built on Jan. 29, 2020, 10:19 p.m.