Services examples

Specimen occurrence services {.tabset}

Specimen records constitute the core of data served by the NBA. Museum specimens can represent a whole variety of different objects such as plants, animals or single parts thereof, DNA samples, fossils, rocks or meteorites. For detailed information of the data model, please refer to the official documentation in the NBA.

All specimen occurrence services are accessible using the methods within the SpecimenClient class. For a list of available endpoints, please refer to the class documentation (?Specimen). Below, we will give details about all services, grouped by category.

Query services

query

Querying for specimens is accomplished with the query method in the class SpecimenClient. For simple queries, query parameters of type list can be passed via the parameter queryParams, for example we can query specimens of the Family Ebenaceae that were collected in Europe:

library('nbaR')

# instantiate specimen client
sc <- SpecimenClient$new()

# specify query params in named list
qp <-
  list(
    identifications.defaultClassification.family = "Ebenaceae",
    gatheringEvent.continent = "Europe"
  )

# query
res <- sc$query(queryParams = qp)

If we now want to know all the countries that the specimens were collected in, we can access the Specimen objects in res$content$resultSet as follows:

sapply(res$content$resultSet, function(x)x$item$gatheringEvent$country)

Note that passing query parameters as a named list only allows for limited queries; the logical conjunction between parameters is for example always AND. More complex queries can be accomplished using the QuerySpec object:

# get all specimens with genus name starting with 'Hydro'
qc <-
  QueryCondition$new(field = "identifications.defaultClassification.genus",
                     operator = "STARTS_WITH",
                     value = "Hydro")
qs <- QuerySpec$new(conditions = list(qc))
res <- sc$query(qs)

download_query

The query function is limited to retrieve 50000 specimen at once (this is determined in the parameter index.max_result_window, the value is retrievable using the getSettings method in the metadata section). In order to provide access for a larger amount of data, the query_download takes the same arguments as query, but download the data as a gzip stream under the hood. Unlike query, query_download returns a list of specimen objects instead of a ResultSet. Example:

## get the first 100000 specimen objects (not possible with query method)
res <- sc$download_query(QuerySpec$new(size=100000))

Data access services

Several access methods offer the convenient retrieval of specimens matching a certain identifier or being part of a certain collection. Below we give examples of how to use the currently implemented data access services for specimen records:

Access to collections

Some of the specimens available via the NBA are categorised thematically into special collections, such as the Siebold-, Dubois- or Jongmans collection. The function get_named_collection lists all available special collections and the identifiers of the specimens within a collection can be queries with get_ids_in_collection

sc$get_named_collections()$content
sc$get_ids_in_collection("siebold")$content

count

For any given query (with QuerySpec or not), returns the count of matches instead of specimen objects:

# Example with QuerySpec:
# how many specimens are there in the 'Botany' collection?
qc <- QueryCondition$new(field='collectionType', operator='EQUALS', value='Botany')
qs <- QuerySpec$new(conditions=list(qc))

# get the number of specimens
sc$count(qs)$content

Note that the count of matches for a given query is also returned by the query function. However, count is more lightweight as it returns an integer instead of a ResultSet containing Specimen objects.

exists

Check if a record exists, based on its unitID:

# use SpecimenClient instantiated above
res <- sc$exists('ZMA.INS.1255440')                                        

# content is boolean
res$content

find

Return a single specimen given its identifier (Note: the identifier of a specimen is different from the unitID, see also here):

id <- "RMNH.MAM.17209.B@CRS"
res <- sc$find(id)

# content is single specimen object
res$content

find_by_ids

Same as find, but takes multiple IDs:

ids <- "RMNH.INS.657083@CRS,L.1589244@BRAHMS"
res <- sc$find_by_ids(ids)

find_by_unit_id

Find a specimen by its unitID:

unitID <- "RMNH.MAM.1513"
res <- sc$find_by_unit_id(unitID)

Aggregation services

Aggregation services group available data according to different criteria.

get_distinct_values

This method takes a specific field as an argument and returns all possible values and the frequency for that field in the data. Below we get all possible values for the country in which a specimen was collected.

sc$get_distinct_values("gatheringEvent.country")

Note: By default, get_distinct_values lists only the first 10 hits. The above query thus does not reflect the distinct values in the hole dataset. This number can be increased with e.g. setting the size parameter in a QuerySpec object passed to the method.

sc$get_distinct_values("gatheringEvent.country",
                       querySpec = QuerySpec$new(size = 10000))

count_distinct_values

Instead of returning all different values for a given field, this method does a mere count:

sc$count_distinct_values("gatheringEvent.country")

get_distinct_values_per_group

Suppose you want to add another filter to the above query of retrieving all distinct values for a given field, such as splitting the results by their respective source system:

sc$get_distinct_values_per_group("sourceSystem.name", "gatheringEvent.country")

count_distinct_values_per_group

If no return of the actual values is needed, a simple count per group is done as follows:

sc$count_distinct_values_per_group("sourceSystem.name",
                                   "gatheringEvent.country")$content

Metadata services

Specimen Metadata services include the same standard metadata services as for the other data types:

# get all paths for the Specimen datatype 
sc$get_paths()$content

# get info e.g. for field collectionType
sc$get_field_info()$content$collectionType

# get all settings
sc$get_settings()

# get specific setting
sc$get_setting("index.max_result_window")$content

# check if operator is allowed
sc$is_operator_allowed("gatheringEvent.continent", "STARTS_WITH")$content

Available paths and fields

All fields can be retrieved with get_paths and specific information on the fields, such as allowed operators etc. with get_field_info

sc$get_paths()$content
sc$get_field_info()$content

Settings

Multimedia-specific settings can be retrieved with get_settings and a specific setting with get_setting:

sc$get_settings()$content
sc$get_setting("index.max_result_window")$content

Operators

To test if a certain operator can be used for a multimedia query:

sc$is_operator_allowed("identifications.defaultClassification.genus",
                       "STARTS_WITH")$content

DwCA download services

In addition to query services that return JSON formatted data, the NBA also offers the export of Darwin Core Archive (DwCA) files. These files are by default zip archives, please refer to our official API documentation for more information.

Static download

Static download services offer the download of predefined datasets. The sets that are available for download can be queried with dwca_get_data_set_names:

sc$dwca_get_data_set_names()$content

A dataset can then be downloaded using dwca_get_data_set. A filename can be given as argument, if none is given, the DwCA archive is written to download-YYYY-MM-DDThh:mm.zip in the current working directory.

# download dataset 'porifera' to temporary file
filename <- tempfile(fileext=".zip")
sc$dwca_get_data_set('porifera', filename=filename)                                                                               

Dynamic download

The dynamic download function dwca_query allows for download of arbitrary sets, defined by the user's query. The arguments to this methods are similar to query, plus the filename:

# download all specimen of genus 'Hydrochoerus'
filename <- tempfile(fileext = ".zip")
qs <-
  QuerySpec$new(conditions = list(
    QueryCondition$new(
      field = "identifications.defaultClassification.genus",
      operator = "EQUALS",
      value = "Hydrochoerus"
    )
  ))
sc$dwca_query(querySpec = qs, filename = filename)

Taxonomic data services {.tabset}

Query Services

query

Query for taxon document with given search criteria. Example:

# query for taxa of genus 'Sedum' that are in the Netherlands Soortenregister (NSR)
tc <- TaxonClient$new()
qc <-
  QueryCondition$new(field = "acceptedName.genusOrMonomial",
                     operator = "EQUALS",
                     value = "Sedum")
qc2 <-
  QueryCondition$new(field = "sourceSystem.code",
                     operator = "EQUALS",
                     value = "NSR")
qs <- QuerySpec$new(conditions = list(qc, qc2))
tc$query(qs)

download_query

The query function is limited to retrieve 50000 taxa at once (this is determined in the parameter index.max_result_window, the value is retrievable using the getSettings method in the metadata section). In order to provide access for a larger amount of data, the query_download takes the same arguments as query, but download the data as a gzip stream under the hood. Unlike query, query_download returns a list of taxon objects instead of a ResultSet.

Data access services

count

For a given query, do not return Taxon objects but the mere count. Example

# get counts for taxa of genus 'Sedum' that are in the
#  Netherlands Soortenregister (NSR)
qc <-
  QueryCondition$new(field = "acceptedName.genusOrMonomial",
                     operator = "EQUALS",
                     value = "Sedum")
qc2 <-
  QueryCondition$new(field = "sourceSystem.code",
                     operator = "EQUALS",
                     value = "NSR")
qs <- QuerySpec$new(conditions = list(qc, qc2))
tc$count(qs)

find

Returns a taxon object given its identifier:

tc$find("27706109@COL")$content

find_by_ids

Given a string with comma-separated identifiers, returns a list of taxon objects:

ids <- "27706109@COL,27704140@COL,27706110@COL,27706111@COL,27706108@COL"
res <- tc$find_by_ids(ids)

Aggregation services

get_distinct_values

This method takes a specific field as an argument and returns all possible values and the frequency for that field in the data. Example: get all data source systems for taxon objects:

tc$get_distinct_values("sourceSystem.name")$content

Metadata services

Taxon Metadata services include the same standard metadata services as for the other data types:

# get all paths for the Taxon datatype 
tc$get_paths()$content

# get info e.g. for field collectionType
tc$get_field_info()$content$synonyms.author

# get all settings
tc$get_settings()

# get specific setting
tc$get_setting("index.max_result_window")$content

# check if operator is allowed
tc$is_operator_allowed("synonyms.author", "EQUALS")$content

Available paths and fields

All fields can be retrieved with get_paths and specific information on the fields, such as allowed operators etc. with get_field_info

tc$get_paths()$content
tc$get_field_info()$content

Settings

Multimedia-specific settings can be retrieved with get_settings and a specific setting with get_setting:

tc$get_settings()$content
tc$get_setting("index.max_result_window")$content

Operators

To test if a certain operator can be used for a mutimedia query:

tc$is_operator_allowed("identifications.defaultClassification.genus",
                       "STARTS_WITH")$content

DwCA download services

The taxonomic information in the NBA is also available as Darwin-Core archive files.

Static download

Static download services offer the download of predefined datasets. The sets that are available for download can be queried with dwca_get_data_set_names:

tc$dwca_get_data_set_names()$content

A dataset can then be downloaded using dwca_get_data_set. A filename can be given as argument, if none is given, the DwCA archive is written to download-YYYY-MM-DDThh:mm.zip in the current working directory.

# download dataset 'nsr' to temporary file
filename <- tempfile(fileext=".zip")
tc$dwca_get_data_set('nsr', filename=filename)                                                                               

Dynamic download

The dynamic download function dwca_query allows for download of arbitrary sets, defined by the user's query. The arguments to this methods are similar to query, plus the filename:

# download all taxa for genus 'Clematis'
filename <- tempfile(fileext = ".zip")
qs <-
  QuerySpec$new(conditions = list(
    QueryCondition$new(
      field = "defaultClassification.genus",
      operator = "EQUALS",
      value = "Clematis"
    )
  ))
tc$dwca_query(querySpec = qs, filename = filename)

Geographic data services {.tabset}

Query Services

The GeoArea query service allows for detailed search within the fields of a GeoArea object. As for the other data types, query parameters can be either given as a list or as a QuerySpec object. Below, we make a simple query to get the GeoArea object for the Netherlands:

# instantiate client for geo areas
gc <- GeoClient$new()

# query for GeoArea of the Netherlands
qc <-
  QueryCondition$new(field = "locality",
                     operator = "EQUALS",
                     value = "Netherlands")
qs <- QuerySpec$new(conditions = list(qc))
res <- gc$query(qs)

# get item
res$content$resultSet[[1]]$item

Data access services

get_geo_json_for_locality

This is a convenience function to directly extract the a GeoJSON object for a specific locality. GeoJSON is a popular format for storing geographical point- and polygon data. To e.g. extract the GeoJSON polygon representation for the Netherlands:

loc <- "Netherlands"
res <- gc$get_geo_json_for_locality(loc)

Results are returned as a list by default, but can be easily converted to a JSON string, e.g.

jsonlite::toJSON(res$content)

count

For any given query (with QuerySpec or not), returns the count of matches instead of GeoArea objects:

# return count of all GeoAreas
res <- gc$count()
res$content

Aggregation services

get_distinct_values

This function returns all values present for a certain field, and their counts:

gc$get_distinct_values("areaType")$content

Metadata services

Geo Metadata services include the same standard metadata services as for the other data types:

# get all paths for the GeoArea datatype 
gc$get_paths()$content

# get info e.g. for field 'areaType'
gc$get_field_info()$content$areaType

# get all settings
gc$get_settings()

# get specific setting
gc$get_setting("index.max_result_window")$content

# check if operator is allowed
gc$is_operator_allowed("locality", "STARTS_WITH")$content

Multimedia services {.tabset}

Multimedia services are accessible with a MultimediaClient, instantiated as follows:

mc <- MultimediaClient$new()

Query Services

As for the other data types, the query method enables simple and complex queries using a list or a QuerySpec object to specify query parameters.

# example of multimedia query passing parameters as a list
mc$query(queryParams = list(collectionType = 'Cnidaria'))$content

# example of multimedia query using QuerySpec: get the first 100
# multimedia items associated with a specimen with name starting with "Ba"
qc <-
  QueryCondition$new(field =
                       "identifications.scientificName.fullScientificName",
                     operator =
                       "STARTS_WITH",
                     value =
                       "Qu")
qs <- QuerySpec$new(conditions = list(qc), size = 100)
res <- mc$query(qs)

# check if scientific names indeed start with 'Qu'
sapply(res$content$resultSet, function(x)
  x$item$identifications[[1]]$scientificName$fullScientificName)

Data access services

count

As for the other data types, a count function returns counts instead of the actual objects:

# count all multimedia documents
mc$count()$content

Aggregation services

get_distinct_values

This function returns all values present for a certain field, and their counts. Example: retrieve all different licenses and their counts"

mc$get_distinct_values("license")$content

Metadata services

Available paths and fields

All fields can be retrieved with get_paths and specific information on the fields, such as allowed operators etc. with get_field_info

mc$get_paths()$content
mc$get_field_info()$content

Settings

Multimedia-specific settings can be retrieved with get_settings and a specific setting with get_setting:

mc$get_settings()$content
mc$get_setting("index.max_result_window")$content

Operators

To test if a certain operator can be used for a mutimedia query:

mc$is_operator_allowed("identifications.scientificName.fullScientificName",
                       "STARTS_WITH")$content

Metadata services {.tabset}

Metadata services provide miscellaneous information about the data available via the NBA. Note that there is also type-specific metadata for each data type (e.g. Specimen) which can be retrieved with the specific client of that class. Here we show the available methods for the MetadataClient which gives general, non-type specific metadata. The client is instantiated in the standard way:

mc <- MetadataClient$new()

Controlled vocabularies

The vocabularies for some fields are controlled by dictionaries with allowed values. For the sex of a museum specimen, for instance, only the terms male, female, mixed and hermaphrodite are allowed to be assigned to the specimen. The fields for which controlled lists are available can be retrieved as follows:

mc$get_controlled_lists()$content

and for each field that has a controlled vocabulary, there is a separate function to retrieve the allowed values:

mc$get_controlled_list_taxonomic_status()$content
mc$get_controlled_list_specimen_type_status()$content
mc$get_controlled_list_sex()$content
mc$get_controlled_list_phase_or_stage()$content

Miscellaneous

Dates

To maintain data integrity, dates have to be coded in specific formats in our systems. For instance, yyyy-MM-dd is a valid format. Allowed formats can be retrieved as follows:

mc$get_allowed_date_formats()$content
Services list

The method get_rest_services returns a list of all services in the NBA as objects of type RestService.

# get the endPoint of the first rest service in the services list
mc$get_rest_services()$content[[1]]
Settings

Similar to the document-specific metadata services, we can get general settings with the MetaDataClient:

#get all settings
mc$get_settings()$content

# get value for specific setting
mc$get_setting("operator.contains.min_term_length")$content


naturalis/nbaR documentation built on Nov. 12, 2023, 4:47 p.m.