library(knitr) # Knit hook to truncate output. hook_output <- knit_hooks$get("output") knit_hooks$set(output = function(x, options) { lines <- options$output.lines if (is.null(lines)) { return(hook_output(x, options)) # pass to default hook } x <- unlist(strsplit(x, "\n")) more <- "..." if (length(lines) == 1) { # first n lines if (length(x) > lines) { # truncate the output, but add .... x <- c(head(x, lines), more) } } else { x <- c(more, x[lines], more) } # paste these lines together x <- paste(c(x, ""), collapse = "\n") hook_output(x, options) }) # Comments as #> not ## knitr::opts_chunk$set( comment = "#>" )
# All the API and download calls for remote data used in building this # vignette is mocked and served up locally from the inst/api_data directory. safedata:::mock_api(on = TRUE)
The safedata
package makes it easy to search for and use datasets collected at
the SAFE Project. It provides an interface to download data files and packaged
record metadata and then functions to load data worksheets and add taxonomic and
spatial data where available.
For further information on the publication and structure of data through the
SAFE Project and within the safedata
package, see the Overview vignette:
vignette("overview", package = "safedata")
.
safedata
The safedata
package is available from CRAN:
install.packages("safedata")
The development version can also be installed from GitHub:
devtools::install_github("ImperialCollegeLondon/safedata@develop")
The safedata
package requires the following packages:
readxl
, to read data directly from Excel datasets,jsonlite
, httr
and curl
, to communicate with the SAFE API and to read
downloaded JSON data,chron
, to represent time of day data, igraph
, to handle taxonomic data structures, andsf
, to handle spatial data about sampling locations.The safedata
package makes use of a local directory to store downloaded data,
index and metadata files (vignette("overview", package = "safedata")
for
details) . These files are needed for the safedata
functions to work
correctly, so the first step in using safedata
is to set the location of the
directory and the package will remind you to do this when it is loaded.
library(safedata)
If this is the first time you are loading safedata
-- or if you simply want to
have two separate SAFE data directories -- then you need to create a new, empty
safedata directory. You will need the URL of the server that holds the dataset
metadata for your data community.
# Create a temporary location for use in the vignette safe_dir_path <- file.path(tempdir(), "my_safe_directory") # Create a new safedata directory create_safedata_dir( safe_dir_path, url = "http://example.safedata.server" )
This will create the directory and download the current index files. You cannot
use an existing directory: the package wants to start with a fresh, empty
directory. Note that the directory path is stored in options()
:
getOption("safedata.dir")
Once you have a SAFE data directory, the set_safedata_dir
function is used to tell the
safedata
package where to look for index and data files:
set_safedata_dir(safe_dir_path)
You will see that this function checks with the metadata server for updates
to the key index files. This can be turned off for offline use
(set_safedata_dir('~/my_safe_directory', update=FALSE)
). The function also
validates local data files: it checks the MD5 hash of local data file copies
against the MD5 of the published file.
You can browse datasets published using the safedata
system either at the
project Zenodo site or from its metadata server. For example, for the SAFE Project
these links are:
If you have done this, or have a dataset DOI from another source, then you can look up the dataset directly.
However, if you want to search the dataset metadata or the taxa and locations
covered by datasets then there are a set of search functions built into the
safedata
package.
The safedata
package contains a set of search functions to explore datasets.
These functions make use of a metadata index stored on the SAFE Project website
and so need an internet connection to work. These search functions provide
structured access to the same metadata shown in project description text but
also provide extended taxonomic and spatial searches.
The functions are:
search_text()
: free text search of dataset and worksheet titles and descriptions.search_fields()
: searches worksheet field names and descriptions and field types.search_authors()
: searches dataset authorssearch_dates()
: searches for overlap with the start and end date of a dataset.search_taxa()
: searches for datasets containing particular taxa.search_spatial()
: searches for datasets sampling at or near a particular location.All of these functions return a safe_record_set
objects, which is just a data
frame containing validated record ids and access information and so you can use
the normal data frame indices (e.g. recs[1,]
) to select particular records.
soil_datasets <- search_text("soil")
print(soil_datasets)
Published datasets contain a taxonomic index of any organisms referred to within the data - see here for details of the Taxa worksheet containing the index.
The taxa in this index, along with all of the parent taxa in the taxonomic
hierarchy leading up to that those taxa, are added to a taxonomic database on
the SAFE Project website. The search_taxa()
function searches that index to
identify all the datasets that contain a particular taxon.
ants <- search_taxa("Formicidae")
print(ants)
The taxonomic index is built around the GBIF and NCBI taxonomic databases. The GBIF database only uses the following core taxonomic levels: kingdom, phylum, class, order, family, genus, species and subspecies. Although the NCBI uses a much wider range of taxonomic levels, we only include these core backbone ranks for NCBI taxa, although we add superkingdom, which is used for bacteria and viruses.
It is also possible to search the taxon index using the taxon index number from one of GBIF or NCBI. For example, to search for the GBIF ID for Formicidae (4234) .
ants <- search_taxa(taxon_id = 4342, taxon_auth = "GBIF")
Datasets also have to provide a full index of sampling locations used in the data. Sampling locations are either linked to existing sampling locations included in the SAFE gazetteer or users can identify new sampling locations and provide location data if possible.
The search_spatial()
function allows users to search for datasets by sampling
locations. Accepted location names from the gazetteer can be used to search for
datasets but users can also provide their own search geometries using the Well
Known Text format. The search includes simple GIS capabilities to look for
sampling within a given distance of the query location.
# Datasets that include sampling within the location of experimental block A within_a <- search_spatial(location = "BL_A") # Datasets that sampled within 2 km of the Maliau Basin Field Study Centre near_maliau <- search_spatial(wkt = "POINT(116.97394 4.73481)", distance = 2000)
Note that WKT coordinates should be supplied as WGS84 longitude and latitude - typically the output of GPS receivers - but the database uses the local projected coordinate system for all distance calculations and GIS operations.
It is possible to combine searches using logical operators (&
and |
), which
simply find the intersection and unions of sets of SAFE records.
# Three searches fish <- search_taxa("Actinopterygii") odonates <- search_taxa("Odonata") ewers <- search_authors("Ewers") # Logical combinations aquatic <- fish | odonates aquatic_ewers <- aquatic & ewers all_in_one <- (fish | odonates) & ewers
Another approach is to restrict the records that will be searched in the online database. You can pass a search result into a second search result to only search within those records. This cuts down on the amount of information that has to be retrieved from the server, so might be faster on poor networks. However, this obviously only works for repeated narrowing of a search, so has less functionality.
fish <- search_taxa("Actinopterygii") ewers <- search_authors("Ewers", ids = fish)
Datasets are identified by their record number, which is the number included in both the dataset DOI and Zenodo URL. All of the following point to the same dataset:
Note that all metadata is available for all records, regardless of whether they are open, embargoed or restricted. This includes field descriptions and taxon and location sampling so that users can assess whether a dataset is going to be useful even if it is not yet openly available.
Once you have the details of a dataset you are interestested in then you can
validate a dataset reference to access metadata and available data, using the
validate_record_ids()
function. This function does the following:
Like the search
functions, the output is safe_record_set
object. Note that
you can validate multiple references at once.
recs <- validate_record_ids(c( "https://doi.org/10.5281/zenodo.3247631", "10.5281/zenodo.3266827", "https://zenodo.org/record/3266821" )) print(recs)
In addition, all of the main functions in safedata
that expect to be passed a
dataset id will run validate_record_ids()
on their inputs, so you can simply
use those URLs directly with those functions without needing to specifically
create a safe_record_set
yourself.
Printing a safe_record_set
object displays a deliberately compact summary of a
set of record ids. There are three function that show the detailed metadata for
records at three levels:
show_concepts()
The show_concepts()
function displays concept level metadata about a set of
record ids. This includes the (most recent) dataset title and a short summary of
the versions available under the dataset concept. Note that the output is not
restricted just to the set of record ids given to the function: it shows
metadata for all versions for each of the concept ids included.
show_concepts(recs)
show_record()
This function shows metadata for a specific version of a dataset: if you give it a concept ID then it will display the available versions for that concept. Otherwise, the function prints out information about the dataset with that record id: it includes the dataset title, status and other dataset level metadata and then a summary of the data worksheets contained in the dataset.
Note that - because a safe_record_set
is just a data frame with some extra
information attached - you can use the usual data frame indexing to select a row
to pass to other functions. Running show_record()
also requires an internet
connection when a record is first examined: the package downloads a JSON file
of the record metadata and stores it in the SAFE data directory.
show_record(1400562)
show_worksheet()
This function shows metadata for a named worksheet within a specific record. The default is to show a compact table of field names, field types and truncated descriptions:
show_worksheet(1400562, "EnvironVariables")
There is also an extended display (extended_fields=TRUE
) that will print out a
list of all the available metadata for each field.
show_worksheet(1400562, "EnvironVariables", extended_fields = TRUE)
Once you have found records for which you want to explore the actual data, then you need to download the data files for the dataset from Zenodo to work with it. There are two ways to do this
The download_safe_files()
function can be used to download the data files for the
data records in a safe_record_set
, such as those created using
a search. Alternatively, you can use the URL of a safedata record or simply use the
record id for the dataset.
The function will check which datasets are currently available and download them to
the safedata
directory. The default behaviour is to present a brief report on the
number and size of available files to be downloaded and ask for confirmation before
actually doing anything. The confirm=FALSE
argument can be used to download
without confirmation, which can be useful in non-interactive use. The function will
also download the JSON metadata for the specified datasets.
download_safe_files(within_a) ## 26 files requested from 26 records ## - 0 local (0 bytes) ## - 4 embargoed or restricted (2.2 Mb) ## - 22 to download (43.6 Mb) ## ## 1: Yes ## 2: No ## ## Selection:
By default, the download_safe_files()
function downloads all of the files
associated with the record. In addition to the core Excel file, this will also
download any additional data files which may contain primary data that is
either not suited to the Excel format or which provides additional
information, such as PDFs. Although some additional file formats are likely to
be readable in R, thesafedata
package does not currently provide a mechanism
to load them automatically.
download_safe_files(3697804, xlsx_only = FALSE) # 2 files requested from 1 records # - 1 local (11.3 Kb) # - 0 embargoed or restricted (0 bytes) # - 1 to download (74.1 Kb) # # 1: Yes # 2: No # # Selection: 1 # 2 files for record 3697804: 1 to download # - Downloaded: Sampling_area_borders.xlsx
The function will warn you if the local copies of data files have been altered
and the refresh=TRUE
argument can be used to restore data files to the version
of record. Note that this will delete local changes.
The load_safe_data
function is used to load a named data worksheet into R.
If the datafile containing the worksheet is not currently downloaded, then
load_safe_data
will automatically download the file if available. In this
example, confirm=FALSE
is passed onto download_safe_files
to remove the
interactive confirmation before downloading.
beetle_abund <- load_safe_data(1400562, "Ant-Psel", confirm = FALSE)
safedata
objectOnce a worksheet is loaded, it is stored as an object of class safedata
.
This is just a data frame with some additional attribute data and it will in
general behave just like any other data frame - the additional attributes are
used for further data processing and adding brief metadata to the str
and
print
methods.
Some data formatting takes place based on field
types:
categorical variables are converted to factors; dates and datetimes are
converted to POSIXct
and times are converted to chron::time
objects.
str(beetle_abund) print(beetle_abund)
The display of safedata
objects is kept deliberately simple to avoid
cluttering the screen with metadata. You can always view additional metadata for
a loaded worksheet by using the show
functions directly on the loaded
safedata
object:
show_concepts(beetle_abund) show_record(beetle_abund) show_worksheet(beetle_abund)
There are a number of functions that can be used to work with the taxa in a dataset. Some generate simple tables of taxa:
get_taxa()
: This function loads a dataframe containing all of the taxa
used within a dataset, with fields including the core GBIF taxonomic levels,
the taxonomic label used within the dataset and the taxonomic status of the
each taxon. You can load a taxonomic dataframe from a safe_record_set
row
or using an existing loaded safedata
object.
r
beetle_taxa <- get_taxa(beetle_abund)
str(beetle_taxa)
add_taxa()
: This function adds taxonomic details to an already loaded data worksheet.
r
beetle_morph <- load_safe_data(1400562, "MorphFunctTraits")
beetle_morph <- add_taxa(beetle_morph)
str(beetle_morph)
The get_taxon_coverage
function can be used to get a taxon table of all
taxa currently referenced in all datasets.
r
all_taxa <- get_taxon_coverage()
The other functions convert taxon tables into graphs (vertices and edges) and phylogenetic trees using the GBIF taxonomic backbone to represent phylogeny. The data validation should ensure that the taxa in a dataset can be connected as a phylogenetic tree, but this is not always the case. For this reason, these functions use a more general graph conversion and then checks whether the result can be converted to a phlyogenetic tree. Technically, that is checking that the graph is a connected, simple, directed acyclic graph.
The get_taxon_graph
function converts a taxon table to a graph. The
function also does some extra validation, and will give warnings when the
taxon table has to be adjusted to represent the taxonomic hierarchy and
worksheet taxa properly. The result is an object of class igraph
and the
vertices have attributes containing the original table data.
The igraph_to_phylo
function tests whether a taxon graph can be converted
to a phylogeny and returns a phylo
object (package ape
). The tips and
internal nodes are labelled with taxon names but the phylo
structure does
not store other node information.
get_phylogeny()
: This is simply a wrapper that runs both functions above
in turn to go straight to a phylogeny.
r
library(ape)
beetle_phylo <- get_phylogeny(1400562)
plot(beetle_phylo, show.node.label = TRUE, font = 1, no.margin = TRUE)
Nearly all SAFE datasets will include observations at spatial locations, and
these datasets must include a Locations worksheet used as an spatial index for
research activities. There are three functions that can be used to work with
locations in a dataset. All of these functions use the sf
package to represent
the GIS geometry of locations and which provides an extensive toolset for
further spatial analysis.
load_gazetteer()
: The gazetteer is one of the three key index files saved
in the SAFE data directory and updated when set_safedata_dir()
is run. It
includes sampling locations drawn from across a wide range of projects
running at SAFE and is intended to hold all locations that are likely to see
repeated sampling. Locations included the gazetteer can be used directly as
known locations in datasets, although data providers can also include
new locations.
r
gazetteer <- load_gazetteer()
print(gazetteer)
get_locations()
: This function returns an sf
object containing the
locations used within a dataset. For known locations, the GIS data for the
location are taken directly from the gazetteer. If the locations are new
sampling sites, then GIS data provided in the dataset is used. Note that it
is possible for dataset providers to create a new locations without including
GIS data - these will be represented using empty GIS geometries.
By default, the returned sf
object will only include the location name
used in the dataset, the gazetteer name for known sampling sites and an
indication of whether the location is new or known, but
gazetteer_info=TRUE
can be used to include the gazetteer attributes for
known locations.
r
library(sf)
beetle_locs <- get_locations(1400562)
print(beetle_locs)
fragments <- subset(gazetteer, type == "SAFE forest fragment")
par(mar = c(3, 3, 1, 1))
plot(st_geometry(fragments), col = "khaki", graticule = TRUE)
plot(st_geometry(beetle_locs), add = TRUE, col = "red", pch = 4)
add_locations()
: This functions adds location data to an already loaded
worksheet. The result is a safedata
object that is also an sf
object.
r
beetle_env <- load_safe_data(1400562, "EnvironVariables")
beetle_env <- add_locations(beetle_env)
print(beetle_env)
plot(beetle_env["Cover"], key.pos = 4, breaks = seq(0, 100, by = 5))
If you want access to a dataset that is currently embargoed or restricted, then
you can approach the authors listed on the Zenodo record to ask for permission
to use the data. If you are then provided with the files for the dataset, then
they need to be inserted into your local SAFE data directory so that they can be
accessed by the safedata
functions.
The insert_dataset
function does this: it will check to see if a set of files
are part of specified Zenodo record and then copy them into the correct places
in the current SAFE data directory.
files <- system.file("api_data", "template_ClareWfunctiondata.xlsx", package = "safedata" ) insert_dataset(1237719, files) dat <- load_safe_data(1237719, "Data") str(dat)
At the moment, the safedata
package only handles loading data from the core
Excel files. We do intend to add functions and recipes to access other files
stored within a dataset in the future. At the moment, if you have downloaded all
the files associated with a dataset then the get_file_details
function allows
you to quickly see a list of the files in a dataset, whether you have downloaded
a local copy and the absolute file path of those files.
get_file_details(1237719)
# Turn off URL mocking and tidy up safedata:::mock_api(on = TRUE) unlink(safe_dir_path, recursive = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.