View source: R/taxonSortPBDBocc.R
taxonSortPBDBocc | R Documentation |
Functions for sorting out unique taxa from Paleobiology Database occurrence downloads, which should accept several different formats resulting from different versions of the PBDB API and different vocabularies available from the API.
taxonSortPBDBocc( data, rank, onlyFormal = TRUE, cleanUncertain = TRUE, cleanResoValues = c(NA, "\"", "", "n. sp.", "n. gen.", " ", " ") )
data |
A table of occurrence data collected from the Paleobiology Database. |
rank |
The selected taxon rank; must be one of 'species', 'genus', 'family', 'order', 'class' or 'phylum'. |
onlyFormal |
If TRUE (the default) only taxa formally accepted by the Paleobiology Database are returned. If FALSE, then the identified name fields are searched for any additional 'informal' taxa with the proper taxon. If their taxon name happens to match any formal taxa, their occurrences are merged onto the formal taxa. This argument generally has any appreciable effect when rank = species. |
cleanUncertain |
If TRUE (the default) any
occurrences with an entry in the respective
'resolution' field that is *not* found in the
argument |
cleanResoValues |
The set of values that can be found in a 'resolution' field that do not cause a taxon to be removed, as they do not seem to indicate taxonomic uncertainty. |
Data input for taxonSortPBDBocc
are expected to be from version 1.2 API
with the 'pbdb' vocabulary. However, datasets are
passed to internal function translatePBDBocc
,
which attempts to correct any necessary field names and field contents used by
taxonSortPBDBocc
.
This function can pull either just the 'formally' identified
and synonymized taxa in a given table of occurrence
data or pull in addition occurrences listed under informal
taxa of the sought taxonomic rank. Only formal taxa
are sorted by default; this is controlled by argument onlyFormal
.
Pulling the informally-listed taxonomic
occurrences is often necessary in some groups that have received
little focused taxonomic effort, such that many
species are linked to their generic taxon ID and never received
a species-level taxonomic ID in the PBDB.
Pulling both formal and informally listed taxonomic occurrences
is a hierarchical process and performed in
stages: formal taxa are identified first, informal taxa are
identified from the occurrences that are
'leftover', and informal occurrences with name labels
that match a previously sorted formally listed
taxon are concatenated to the 'formal' occurrences for that same taxon,
rather than being listed under separate elements
of the list as if they were separate taxa.
This function is simpler than similar functions that inspired it
by using the input"rank" to both filter occurrences and directly
reference a taxon's accepted taxonomic placement, rather than a
series of specific if()
checks. Unlike some similar functions
in other packages, such as version 0.3 paleobioDB
's
pbdb_temp_range
, taxonSortPBDBocc
does not check
if sorted taxa have a single 'taxon_no' ID number. This makes the blanket
assumption that if a taxon's listed name in relevant fields is identical,
the taxon is identical, with the important caveat that occurrences with
accepted formal synonymies are sorted first based on their accepted names, followed by
taxa without formal taxon IDs. This should avoid
linking the same occurrences to multiple taxa by mistake, or assigning
occurrences listed under separate formal taxa to the same taxon
based on their 'identified' taxon name, as long as all
formal taxa have unique names (note: this is an untested assumption).
In some cases, this procedure is helpful, such as when
taxa with identical generic and species names are listed under
separate taxon ID numbers because of a difference in the
listed subgenus for some occurrences (example,
"Pseudoclimacograptus (Metaclimacograptus) hughesi' and
'Pseudoclimacograptus hughesi' in the PBDB as of 03/01/2015).
Presumably any data that would be affected by differences
in this procedure is very minor.
Occurrences with taxonomic uncertainty indicators in
the listed identified taxon name are removed
by default, as controlled by argument cleanUncertain
.
This is done by removing any occurrences that
have an entry in primary_reso
(was
"genus_reso
" in v1.1 API) when rank
is a
supraspecific level, and species_reso
when rank = species
,
if that entry is not found in
cleanResoValues
. In some rare cases, when
onlyFormal = FALSE
, supraspecific taxon names may be
returned in the output that have various 'cruft' attached, like 'n.sp.'.
Empty values in the input data table ("") are converted to NAs, as they may be due to issues with using read.csv to convert API-downloaded data.
Returns a list where each element is different unique taxon obtained by the sorting function, and named with that taxon name. Each element is composed of a table containing all the same occurrence data fields as the input (potentially with some fields renamed and some field contents change, due to vocabulary translation).
David W. Bapst, but partly inspired by Matthew Clapham's cleanTaxon
(found at
this location
on github) and R package paleobioDB
's pbdb_temp_range
function (found at
this location
on github.
Peters, S. E., and M. McClennen. 2015. The Paleobiology Database application programming interface. Paleobiology 42(1):1-7.
Occurrence data as commonly used with paleotree
functions can
be obtained with link{getPBDBocc}
. Occurrence data sorted by
this function might be used with functions occData2timeList
and plotOccData
. Also, see the example graptolite dataset
at graptPBDB
# Note that most examples here using getPBDBocc() # use the argument 'failIfNoInternet = FALSE' # so that functions do not error out # but simply return NULL if internet # connection is not available, and thus # fail gracefully rather than error out (required by CRAN). # Remove this argument or set to TRUE so functions DO fail # when internet resources (paleobiodb) is not available. # getting occurrence data for a genus, sorting it # firest example: Dicellograptus dicelloData <- getPBDBocc("Dicellograptus", failIfNoInternet = FALSE) if(!is.null(dicelloData)){ dicelloOcc2 <- taxonSortPBDBocc( data = dicelloData, rank = "species", onlyFormal = FALSE ) names(dicelloOcc2) } # try a PBDB API download with lots of synonymization #this should have only 1 species # *old* way, using v1.1 of PBDB API: # acoData <- read.csv(paste0( # "https://paleobiodb.org/data1.1/occs/list.txt?", # "base_name = Acosarina%20minuta&show=ident,phylo")) # # *new* method - with getPBDBocc, using v1.2 of PBDB API: acoData <- getPBDBocc("Acosarina minuta", failIfNoInternet = FALSE) if(!is.null(acoData)){ acoOcc <- taxonSortPBDBocc( data = acoData, rank = "species", onlyFormal = FALSE ) names(acoOcc) } ########################################### #load example graptolite PBDB occ dataset data(graptPBDB) #get formal genera occGenus <- taxonSortPBDBocc( data = graptOccPBDB, rank = "genus" ) length(occGenus) #get formal species occSpeciesFormal <- taxonSortPBDBocc( data = graptOccPBDB, rank = "species") length(occSpeciesFormal) #yes, there are fewer 'formal' # graptolite species in the PBDB then genera #get formal and informal species occSpeciesInformal <- taxonSortPBDBocc( data = graptOccPBDB, rank = "species", onlyFormal = FALSE ) length(occSpeciesInformal) #way more graptolite species are 'informal' in the PBDB #get formal and informal species #including from occurrences with uncertain taxonomy #basically everything and the kitchen sink occSpeciesEverything <- taxonSortPBDBocc( data = graptOccPBDB, rank = "species", onlyFormal = FALSE, cleanUncertain = FALSE) length(occSpeciesEverything)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.