It can sometimes be useful to get information about datasets rather than the data which they contain.
In this article, I will explain the various methods available to access dataset meta-data stored within the GBIF registry.
dataset_search()
: Use this function if you want meta-data, counts, facets, and not necessarily all of the results.dataset_export()
: Use this function if you only want meta-data and all of the results as a table.If you want just a (non-random) sample of datasets from the registry, you can run dataset_search()
with no arguments, which will return 100 datasets of various types. However, running dataset_export()
with no filters, will download all of the dataset meta-data in the registry.
dataset_search() # dataset_export() # beware this will download 93K datasets!
There are a few types of datasets GBIF supports. The most well known is the occurrence dataset. You can search for occurrence type datasets using the type
filter.
dataset_search(type = "OCCURRENCE") # dataset_export(type = "OCCURRENCE") # download all of the meta-data
Checklists are also another common type of dataset mediated by GBIF. You can use the multiple values separator ";", in order to get both checklist and occurrence types.
dataset_search(type = "OCCURRENCE;CHECKLIST") # dataset_export(type = "OCCURRENCE;CHECKLIST") # download both types
You might be wondering what the other possible types of datasets are called. With the facets interface, it is possible to get group-by counts for most of the dataset_search()
filters.
dataset_search(facet="type",limit=0)$facets
$type name count 1 CHECKLIST 49628 2 OCCURRENCE 38523 3 SAMPLING_EVENT 3036 4 METADATA 386
Use facetLimit
to control the number of results returned with the facets interface.
dataset_search(facet="publishingCountry",facetLimit=200,limit=0)$facets
name count 1 CH 47107 2 FR 15559 3 DE 9012 4 CO 2812 <...> 136 RS 1 137 SO 1 138 SS 1 139 TR 1 140 WF 1
Facets can also be used with other filters. For example to get the top countries publishing occurrence datasets.
dataset_search(facet="publishingCountry",type="OCCURRENCE",facetLimit=200,limit=0)$facets
Here are some more examples of using dataset_search()
filters:
# datasets published by Ukraine dataset_search(publishingCountry = "UA") # checklist datasets with a CC0 license dataset_export(type="CHECKLIST", license = "CC0_1_0") # Be aware that not all publishers fill in a subType dataset_search(subType="TAXONOMIC_AUTHORITY") # Get datasets hosted by Norway dataset_search(hostingCountry = "NO") # counts of datasets hosted by Norway but published by other countries dataset_search(facet="publishingCountry",hostingCountry = "NO",limit=0,facetLimit=100)$facets # get all datasets within the GRIIS porject dataset_export(projectId = "GRIIS") # keywords used by the GRIIS project dataset_search(facet="keyword",projectId="GRIIS",limit=0,facetLimit=100)$facets # datasets with data collected between 1600 and 1800 dataset_search(decade = "1600,1800") # group-by license counts of occurrence type datasets dataset_search(facet="license",type="OCCURRENCE",limit=0,facetLimit=10)$facets # search for dataset by doi of the dataset dataset_search(doi="10.15468/aomfnb") # datasets hosted by Scandinavia dataset_search(hostingCountry = "IS;FI;DK;NO;SE") dataset_search(facet="hostingCountry",hostingCountry = "IS;FI;DK;NO;SE",limit=0,facetLimit=5)$facets # all datasests in the VertNet network dataset_export(networkKey = "99d66b6c-9087-452f-a9d4-f15f2c2d0e7e") # all datasets with the keyword "DEPOBIO" hosted by France dataset_export(keyword="DEPOBIO",hostingCountry="FR") # number of occurrences dataset_export(keyword="DEPOBIO",hostingCountry="FR")$occurrenceRecordsCount |> sum() # datasets published by Cornell Lab of Ornithology dataset_search(publishingOrg = "e2e717bf-551a-4917-bdc9-4fa0f342c530")
I haven't yet mentioned dataset_suggest()
, which will return less data than dataset_search()
, but is practically the same function. Most of the time you will be wanting to use dataset_search()
and dataset_export()
rather than dataset_suggest()
. The endpoint for dataset_suggest()
was designed for allowing the GBIF website to function efficiently, and isn't really that interesting for rgbif users, but there might be some edge cases where it is useful.
dataset()
returns other meta-data not necessarily found with a dataset_search()
. Most of the time you will want to use dataset_search()
, but there are times when you have to use dataset()
. For example, when you want to search by machine tag.
# return all datasets tagged as "citizen science" dataset(machineTagNamespace="citizenScience.gbif.org")
There are various other dataset functions that you might find useful. Particularly if you know the datasetKey uuid, there are a group of functions that can be used.
# get details of a single dataset dataset_get("38b4c89f-584c-41bb-bd8f-cd1def33e92f") # get the details of how the dataset is being ingested by GBIF dataset_process("38b4c89f-584c-41bb-bd8f-cd1def33e92f",limit=3) # what networks does the dataset belong to? dataset_networks("3dab037f-a520-4bc3-b888-508755c2eb52") # what datasets compose the dataset? Not many datasets have constituents. dataset_constituents("7ddf754f-d193-4cc9-b351-99906754a03b",limit=3) # what contacts did the publishers give for the dataset? dataset_contact("7ddf754f-d193-4cc9-b351-99906754a03b") # only works for CHECKLIST type datasets dataset_metrics("7ddf754f-d193-4cc9-b351-99906754a03b")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.