spocc_duplicates: A note about duplicate occurrence records

Description

Description

BEWARE: spocc provides you a nice interface to many data providers for species occurrence data. However, in cases where you request data from GBIF in addition to other data sources, there could be duplicate records. This is because GBIF is, to use an ecology analogy, a top predator, and pulls in data from lower nodes in the food chain. For example, iNaturalist provides data to GBIF, so if you search for occurrence records for Pinus contorta from iNaturalist and GBIF, you could get, for example, 20 of the same records.

We are working on a way to programatically flag and/or remove these duplicate records. As you could imagine, this is rather difficult as data is often lost in translation, significant digits could change from provider to provider for the same data, etc.

Still, we think a single R interface to many occurrence record providers will provide a consistent way to work with occurrence data, making analyses and vizualizations more repeatable across providers.

We are working on a set of tools for cleaning data, as well as removing duplicates in the spocc_clean function - so keep an eye on that.

Do get in touch with us if you have concerns, have ideas for eliminating duplicates, etc, at support@ropensci.org, or at the issue tracker for the spocc package https://github.com/ropensci/spocc/issues/new


spoccutils documentation built on Sept. 12, 2016, 10:35 a.m.