knitr::opts_chunk$set(echo = TRUE) # for development
devtools::load_all() # for development
The databases of biological collections are becoming increasingly available online, providing an unprecedented amount of species records for biodiversity-related studies. Managing the information associated with species records is an important but difficult task. The notation of collectors' names, numbers, and dates varies between collections, and sometimes within them. In addition, it is often difficult to validate the localities, geographical coordinates and identifications associated with individual species records, especially when working with thousands or millions of them. Thus, having tools to process and validate large amounts of records can be quite handy.
plantR is an R package that was developed to manage, standardize, and validate the information associated with species records from biological collections (e.g., herbaria). It can be used for data coming from a single collection or different biodiversity databases, such as GBIF. Moreover, plantR can be used by collection curators to manage their databases and by final users of species records (e.g., taxonomists, ecologists, and conservationists), allowing the comparison of data across collections.
The package plantR provides tools to standardize the information from typical fields associated with species records, such as collectors' and species names. In addition, plantR proposes a comprehensive and reproducible workflow to apply those tools while handling records from biological collections, which includes the following steps:
import or download of species records for a list of species names, collections codes or other search fields;
batch standardization of typical fields (e.g., collector name);
validation of the locality and geographical coordinates of the records, based on maps and gazetteers;
spell-checking and validation of botanical families and species names using different taxonomic backbones (e.g. Flora do Brasil);
assessment of the confidence level of species identifications, based on a global list of plant taxonomists;
retrieval of duplicated specimens across collections, including the homogenization of the information within duplicates;
summary of species data and validation steps, and (fast) export of the validated records by groups (e.g. families or countries).
The tools provided by plantR do not edit the columns with the original information. All the outputs of each editing or validation step are stored in new, separated columns added to the original data. This is important for the collection curation process because it allows the comparison between the original and edited information. However, it increases the number of columns in the dataset, which may become a problem while managing and saving big datasets.
plantR was initially developed to manage plant records from herbaria. Therefore, some of the tools offered by the package are exclusive to plants, particularly the checking of species names. However, some of its main features are expected to work for other groups of organisms as well, as long as the data structure is similar.
Currently, the download of records is available for the Global Biodiversity Information Facility (GBIF), and speciesLink, but the user can also provide their own dataset as an input. Future versions of the package may include the download from data stored in JABOT.
Name editing and standardization cover most of the typical variation in the notation of people's names, trying to provide standardized outputs in the TDWG format. The same applies to collection codes, collector numbers and dates. However, plantR does not handle all possibilities of notation. So, some double-checking and corrections may be needed depending on the user's goals.
Regarding the validation of geographical coordinates. In the case of invalid or missing coordinates, we assume that the locality information associated with the record (e.g. country, state, county) is correct (i.e. locality prevails over coordinates), and so working coordinates are taken from a gazetteer. It is important to note that if the locality information is indeed mistaken (e.g., wrong county name), then even if the original coordinates are good, they will not be validated (record locality and coordinate locality don't match) and may be replaced by coordinates taken from the gazetteer.
Currently, geographical validation can be performed at the county level for Latin American countries and at the country level for the rest of the world. We provide a gazetteer to retrieve and check localities and geographical coordinates, which is currently biased towards Latin American countries, particularly Brazil. Therefore, the validation of geographical coordinates provided by other R packages (e.g. CoordinateCleaner) may be more appropriate for studies extending outside Latin America.
Taxonomic validation is performed based on (i) the correction of plant family and species names (i.e. synonyms, typos) and (ii) the confidence level on the species identification, based on a dictionary of plant taxonomist names from all over the world. For (i), names are currently checked against the Flora do Brasil project using the R packages flora. Previous version used also The Plant List, via the package Taxonstand. But since The Plant List was superseeded, this option is no longer a default. Future versions may include comparisons against other backbones, e.g., the World Flora Online or Tropicos.
During the assessment of the taxonomic confidence level of the identifications, we did not attempt to set priorities for different specialists within a given family. That is, all species names determined by a specialist within their family of expertise are taken as being correct. Although we recognize that there are specialists for genera within a family, the validation process is currently performed only at the family level. In the case of conflicting species identification among family specialists for duplicates across collections, we assume the most recent identification as being the valid one.
plantR provide tools for searching for duplicated records across collections. This search makes more sense when data from different collections are combined and it performs well even when using relatively large datasets (i.e., millions of records).
However, the retrieval of duplicates greatly depends on the completeness of the input information, the notation standards and if plantR is able to handle those differences in notation across collections. In addition, true duplicates may not be found due to typos and false duplicates may be returned if the duplicate search fields are too flexible.
The package can be installed and loaded from GitHub with:
install.packages("remotes") library("remotes") install_github("LimaRAF/plantR") library("plantR")
Users can provide their own dataset, import it from a GBIF DwC-A zip file
(function readData()
) or download data directly from R using one of
plantR download functions. They include the function rspeciesLink()
:
occs_splink <- rspeciesLink(species = "Euterpe edulis")
This function can also be used to search from records based on localities,
collections, and other options (see ?rspeciesLink
for details).
plantR also provides the function rgbif2()
, which is a wrapper of the
function rgbif()
of the rgbif package, with a standardized output:
occs_gbif <- rgbif2(species = "Euterpe edulis")
It is important to make sure that the field names of the input data follow the
DarwinCore format. In plantR this is performed using the function
formatDwc()
, which joins data from different sources (e.g. GBIF and
speciesLink) and standardizes their field names:
occs <- formatDwc(splink_data = occs_splink, gbif_data = occs_gbif)
The names of the collections, collectors, and identifiers, as well as the
collection numbers and dates, can be edited using the function formatOcc()
:
occs <- formatOcc(occs)
The locality information associated with the occurrence data (e.g., country or
city names) can be standardized using the function formatLoc()
:
occs <- formatLoc(occs)
The geographical coordinates are prepared using function formatCoord()
, which
guarantees that they are in a good format for validation (i.e., decimal
degrees). This function also retrieves missing coordinates from a gazetteer
based on the locality information:
occs <- formatCoord(occs)
In this example, although we have downloaded data for a single species (i.e.,
Euterpe edulis Mart.), there are differences in the notation of botanical
family and species names, some of them being synonyms. To obtain only valid
names, we use the function formatTax()
:
occs <- formatTax(occs)
Once the new columns with the edited and standardized information are available,
the records can be validated. The first validation step regards the locality
information, which is done using the function validateLoc()
:
occs <- validateLoc(occs)
The second validation step regards the geographical coordinates of the records,
which is performed using the function validateCoord()
:
occs <- validateCoord(occs)
The next validation step regards the confidence level in the species
identification, which is one of the main plantR features and executed by
function validateTax()
:
occs <- validateTax(occs)
Note that the function returns up to 10 names of determiners not taken as
specialists of the family. The argument miss.taxonomist
can be used to include
missing names of taxonomists (e.g.,
miss.taxonomist = c("Arecaceae_Reis, A.")
).
Another main feature of plantR is the search for duplicates across
herbaria (i.e., same biological specimen with accession numbers in two or more
collections). It uses different combinations of search strings to find direct
and indirect links between records. Besides the search itself, the user can also
homogenize information within groups of duplicates, such as species names or
geographical coordinates. This tool is performed using the function
validateDup()
:
occs <- validateDup(occs)
Once the editing and validation steps are finished, plantR provides tools
for summarizing the occurrence data, using the function summaryData()
. In this
example, the taxonomic summary is quite uninformative, since we have only one
species.
summ <- summaryData(occs)
plantR also provides an overview of the validation results (function
summaryFlags()
):
flags <- summaryFlags(occs)
The package plantR can also build species checklists with vouchers using
the function checkList()
:
checkList(occs, n.vouch = 3, type = "short")
Finally, plantR exports data into a local folder, using function
saveData()
, which can be used to save compressed '.csv' files based on
different grouping fields (e.g., botanical family, country, biological
collection). The export is performed using function fwrite()
from package
data.table which is quite fast even for large datasets.
If you use this package, please cite it as:
Lima, R.A.F., Sánchez-Tapia, A., Mortara, S.R., ter Steege, H., Siqueira, M.F. (2021). plantR: An R package and workflow for managing species records from biological collections. bioRxiv: 2021.04.06.437754. https://doi.org/10.1101/2021.04.06.437754
If you use the function prepSpecies()
, please also cite the following packages
(depending on the database used):
Carvalho, G. (2020) flora: Tools for Interacting with the Brazilian Flora 2020. R package version 0.3.4. https://CRAN.R-project.org/package=flora
If you use the function rgbif2()
, please also cite the following package:
Chamberlain, S. et al. (2021) rgbif: Interface to the Global Biodiversity Information Facility API. R package version 3.5.2. https://CRAN.R-project.org/package=rgbif>.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.