GBIF's occurrence search is a powerful and versatile tool for accessing GBIF mediate data. This vignette will provide an overview of the occ_search()
function and provide examples and advice of how to use it effectively and also when not to use it.
The function
occ_search()
(and related legacy functionocc_data()
) should not be used for serious research. Users sometimes find it easier to useocc_search()
rather thanocc_download()
because they do not need to supply a username or password, and also do not need to wait for a download to finish. However, any serious research project should always useocc_download()
instead.
occ_search()
is a quick way to get a non-random sample of occurrences from the GBIF mediated data. It is useful for quickly exploring the data, but it is not suitable for serious research because users are limited to 100,000 records per search combination.
And, even if your search returns fewer than 100,000 records, it is still not recommended to use occ_search()
to retrieve all the records for a serious research project. This is because it is not possible to cite the data obtained this way in an easy way.
Here are some examples of some good usages of occ_search()
:
occ_count()
and article here)And here are some examples of bad usages of occ_search()
:
occ_search()
data for citable researchOne of the more useful fields to search on is basisOfRecord
, which gives roughly the origin of the occurrence record. Most records on GBIF are either PRESERVED_SPECIMEN
(museum/herbarium records) or HUMAN_OBSERVATION
(usually citizen science, but sometimes research observations).
Other interesting basisOfRecord
values are FOSSIL_SPECIMEN
and LIVING_SPECIMEN
(zoos or botanical gardens), because people typically want to exclude these from their downloads.
Keep in mind that the basisOfRecord
values are not guaranteed to be filled in accurately by the publisher. Sometimes records are misclassified or given a basisOfRecord
that you would not expect or have a complicated provenance.
occ_search(basisOfRecord="PRESERVED_SPECIMEN") # museum and herbarium records occ_search(basisOfRecord="HUMAN_OBSERVATION") # citizen science and research observations occ_search(basisOfRecord="FOSSIL_SPECIMEN") # fossil records occ_search(basisOfRecord="LIVING_SPECIMEN") # zoo and botanical garden records occ_search(basisOfRecord="PRESERVED_SPECIMEN;HUMAN_OBSERVATION") # museum/herbarium and citizen science/research observations occ_search(basisOfRecord="MACHINE_OBSERVATION") # machine observations (e.g. camera traps, acoustic recorders, etc.)
Users are sometimes attracted to occ_search()
because it is possible to supply a scientificName
rather than a taxonKey
. Note, that in the background a call is made the species match service (similar to name_backbone()
) in order to retrieve a GBIF taxonKey. Because of this, a user can sometimes rarely receive back poorly matched occurrences, particularly if authorship is not supplied.
occ_search(scientificName="Caloptery splendens") # Or better occ_search(scientificName="Calopteryx splendens (Harris, 1780)")
Is equivalent to doing the following:
occ_search(taxonKey=name_backbone("Calopteryx splendens")$usageKey) # OR occ_search(taxonKey=1427067)
If your name happens to be a homotypic synonym of another name, you may get back occurrences for the other name or no results or a higher-rank match results. Therefore, it is usually safer to use the GBIF taxonKey.
Some fields in the GBIF mediated data are "interpreted" by GBIF, meaning that they are standardized in some way. For example, the field basisOfRecord
is standardized to a controlled vocabulary. Therefore, only a few values are returned no matter what the publisher has supplied. For instance, "pinned insect", "fish specimen", and "herbarium sheet", will all get mapped to PRESERVED_SPECIMEN
by GBIF.
Other fields are "non-interpreted", meaning that they are not standardized in any way. For example, the field recordedBy
is a free text field. If you search for recordedBy="John Smith"
, you may not get back occurrences where the recordedBy
field is some variant such as J. Smith
, Smith, J.
, Smith, John
, etc.
One strategy for determining whether a search term is free text is by using occ_count(facet=<"search term">)
. See article of occ_count()
here.
occ_count(facet="recordedBy") occ_count(facet="basisOfRecord")
If many unique values are returned, then it is likely that the field is free text.
Some search parameters are often NULL
or not supplied from the publisher. In general, occ_search()
terms that are not required fields or not filled by GBIF during interpretation are often NULL
. For example, even though coordinateUncertaintyInMeters
theoretically applies to all occurrences with coordinates, it is often NULL
because the publishers choose not to supply this information or it is unknown. Similarly, sex
might often be left NULL
more than what would be expected naively.
Other columns with more NULL
s than one might expect :
stateProvince
elevation
establishmentMeans
coordinateUncertaintyInMeters
Keep in mind that specifying any filter will remove all records with NULL
in the filter.
Location searching can sometimes be challenging for new users. Particularly, searching for stateProvince
can be tricky because the field is free text when one might expect it to be from a controlled vocabulary. stateProvince="California"
will not return occurrences where the publisher supplied has values such as CA
, Calif.
, or Cal.
. Additionally, records with coordinates falling within California may not have been supplied with a stateProvince
value by the publisher.
occ_search(stateProvince="California") occ_search(stateProvince="CA")) # will return different number of records occ_search(stateProvince="CA;California")) # search both variants at the same time
A usually better choice than searching by stateProvince
is to search by gadmGid
. The term gadmGid
is a GBIF interpreted field that is filled by GBIF when coordinates are available. Looking up the gadmGid
s can be done on the GBIF occurrence search page.
occ_search(gadmGid="USA.5_1") # search for California occ_search(gadmGid="JPN.12_1") # search for Hokkaido Japan occ_search(gadmGid="USA.5_1;USA.6_1") # search for California and Colorado occ_search(gadmGid="PHL.10_1") # Bataan Philippines occ_search(gadmGid="USA") # United States "just land" without EEZ area
Searching by country
is typically straightforward because the field is standardized and filled in by GBIF when coordinates are available. Two letter country codes are used when searching occurrences. These codes can be looked up using enumeration_country()
.
occ_search(country="US") # search for United States occ_search(country="JP") # search for Japan occ_search(country="PH") # search for Philippines occ_search(country="SW") # search for Sweden occ_search(country="US;JP") # search for United States and Japan
Searching by continent
is also possible, but unlike country
, this value is not filled in when coordinates are available, and instead relies on the publisher filling in this field. So if the publisher has not filled in a value, then this field will be NULL
, even if it obviously lies on a continent.
The field is however standardized by GBIF, so that the values are mapped to supplied values are all mapped to a controlled vocabulary(e.g. "Europa, Euroopa,EUR,Eu" -> EUROPE, "Afrique,"Afr.","AF" -> AFRIKA).
occ_search(continent="EUROPE") # search for Europe occ_search(continent="AFRIKA") # search for Africa occ_search(continent="EUROPE;AFRIKA") # search for Europe and Africa
If you need to get all occurrences from a certain continent, I would use the gadmGid
filter or supply a bounding box or WKT polygon to geometry
. When using geometry
make sure that your polygon is wound in the correct order (anti-clockwise). When in doubt, using the GBIF web UI to draw and debug the polygon can be a good option. Only POLYGON and MULTIPOLYGON are accepted WKT types.
occ_search(geometry="POLYGON((13.42436 69.86167,4.6469 67.01976,-8.26114 67.2205,-19.62021 67.81281,-28.39768 64.25374,-27.88135 53.09437,-17.55493 44.99691,-16.52228 30.81969,3.61426 32.57676,19.62021 30.37524,38.72411 32.14062,54.21375 33.87246,66.60546 43.14228,72.80133 50.54193,70.21972 62.16009,38.20778 72.6752,23.23447 73.42765,13.42436 69.86167))") # rough polygon around Europe
Sometimes it can be useful to select everything but a certain region&occurrence_status=present), also known as a "polygon with hole in it". This can be done by formatting your WKT with enough interpolated points.
POLYGON( (-180 -90,-90 -90,0 -90,90 -90,180 -90,180 90,90 90,0 90,-90 90,-180 90,-180 -90), (-5 -5,-5 5,5 5,5 -5,-5 -5) )
Some records on GBIF can be quite old (1600s), so it is sometimes useful to filter by year
to remove these records. Year is typically the collection event or the observation event of the record. Almost all occurrences on GBIF supply a year
value. Therefore filtering by year
is typically safe from un-intentional mass data filtering from NULL
values.
occ_search(year=1998) # search for occurrences from 1998 occ_search(year="1998,2024") # search for occurrences from 1998-2024 occ_search(year="1900;2000") # search for occurrences from 1900 and 2000 occ_search(year="1950,2024") # search for somewhat modern records
Sometimes users are coming to GBIF looking for a specific museum record, but they don't know the gbifid
of the record. In these cases, searching by occurrenceId
, catalogNumber
, recordNumber
or institutionCode
can be useful. Keep in mind that many of these fields and may not be unique across all of GBIF. For example, a few institutions might use the same institutionCode
, but actual be different institutions. Usually combining a few of these values can get you close to the record you are looking for.
occ_search(institutionCode="KU") occ_search(catalogNumber="KU 110")
New users might not be aware that some data publishers supply additional data beyond simple "when-what-where" data. Richer extra data usually comes in the form of dwcaExtensions
. While occ_search()
does not return the values from these extensions, it is possible to filter by extension type to see what dataset publishers have published extensions of interest.
occ_search(dwcaExtension="http://rs.gbif.org/terms/1.0/Multimedia") occ_search(dwcaExtension="http://rs.tdwg.org/dwc/terms/MeasurementOrFact") occ_search(dwcaExtension="http://rs.gbif.org/terms/1.0/DNADerivedData")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.