taxonSortPBDBocc: Sorting Unique Taxa of a Given Rank from Paleobiology...
In paleotree: Paleontological and Phylogenetic Analyses of Evolution

taxonSortPBDBocc

R Documentation

Sorting Unique Taxa of a Given Rank from Paleobiology Database Occurrence Data

Description

Functions for sorting out unique taxa from Paleobiology Database occurrence downloads, which should accept several different formats resulting from different versions of the PBDB API and different vocabularies available from the API.

Usage

taxonSortPBDBocc(
  data,
  rank,
  onlyFormal = TRUE,
  cleanUncertain = TRUE,
  cleanResoValues = c(NA, "\"", "", "n. sp.", "n. gen.", " ", "  ")
)

Arguments

`data`	A table of occurrence data collected from the Paleobiology Database.
`rank`	The selected taxon rank; must be one of 'species', 'genus', 'family', 'order', 'class' or 'phylum'.
`onlyFormal`	If TRUE (the default) only taxa formally accepted by the Paleobiology Database are returned. If FALSE, then the identified name fields are searched for any additional 'informal' taxa with the proper taxon. If their taxon name happens to match any formal taxa, their occurrences are merged onto the formal taxa. This argument generally has any appreciable effect when rank = species.
`cleanUncertain`	If TRUE (the default) any occurrences with an entry in the respective 'resolution' field that is not found in the argument `cleanResoValue` will be removed from the dataset. These are assumed to be values indicating taxonomic uncertainty, i.e. 'cf.' or '?'.
`cleanResoValues`	The set of values that can be found in a 'resolution' field that do not cause a taxon to be removed, as they do not seem to indicate taxonomic uncertainty.

Details

Data input for taxonSortPBDBocc are expected to be from version 1.2 API with the 'pbdb' vocabulary. However, datasets are passed to internal function translatePBDBocc, which attempts to correct any necessary field names and field contents used by taxonSortPBDBocc.

This function can pull either just the 'formally' identified and synonymized taxa in a given table of occurrence data or pull in addition occurrences listed under informal taxa of the sought taxonomic rank. Only formal taxa are sorted by default; this is controlled by argument onlyFormal. Pulling the informally-listed taxonomic occurrences is often necessary in some groups that have received little focused taxonomic effort, such that many species are linked to their generic taxon ID and never received a species-level taxonomic ID in the PBDB. Pulling both formal and informally listed taxonomic occurrences is a hierarchical process and performed in stages: formal taxa are identified first, informal taxa are identified from the occurrences that are 'leftover', and informal occurrences with name labels that match a previously sorted formally listed taxon are concatenated to the 'formal' occurrences for that same taxon, rather than being listed under separate elements of the list as if they were separate taxa. This function is simpler than similar functions that inspired it by using the input"rank" to both filter occurrences and directly reference a taxon's accepted taxonomic placement, rather than a series of specific if() checks. Unlike some similar functions in other packages, such as version 0.3 paleobioDB's pbdb_temp_range, taxonSortPBDBocc does not check if sorted taxa have a single 'taxon_no' ID number. This makes the blanket assumption that if a taxon's listed name in relevant fields is identical, the taxon is identical, with the important caveat that occurrences with accepted formal synonymies are sorted first based on their accepted names, followed by taxa without formal taxon IDs. This should avoid linking the same occurrences to multiple taxa by mistake, or assigning occurrences listed under separate formal taxa to the same taxon based on their 'identified' taxon name, as long as all formal taxa have unique names (note: this is an untested assumption). In some cases, this procedure is helpful, such as when taxa with identical generic and species names are listed under separate taxon ID numbers because of a difference in the listed subgenus for some occurrences (example, "Pseudoclimacograptus (Metaclimacograptus) hughesi' and 'Pseudoclimacograptus hughesi' in the PBDB as of 03/01/2015). Presumably any data that would be affected by differences in this procedure is very minor.

Occurrences with taxonomic uncertainty indicators in the listed identified taxon name are removed by default, as controlled by argument cleanUncertain. This is done by removing any occurrences that have an entry in primary_reso (was "genus_reso" in v1.1 API) when rank is a supraspecific level, and species_reso when rank = species, if that entry is not found in cleanResoValues. In some rare cases, when onlyFormal = FALSE, supraspecific taxon names may be returned in the output that have various 'cruft' attached, like 'n.sp.'.

Empty values in the input data table ("") are converted to NAs, as they may be due to issues with using read.csv to convert API-downloaded data.

Value

Returns a list where each element is different unique taxon obtained by the sorting function, and named with that taxon name. Each element is composed of a table containing all the same occurrence data fields as the input (potentially with some fields renamed and some field contents change, due to vocabulary translation).

Author(s)

David W. Bapst, but partly inspired by Matthew Clapham's cleanTaxon (found at this location on github) and R package paleobioDB's pbdb_temp_range function (found at this location on github.

References

Peters, S. E., and M. McClennen. 2015. The Paleobiology Database application programming interface. Paleobiology 42(1):1-7.

Examples

# Note that most examples here using getPBDBocc()
    # use the  argument 'failIfNoInternet = FALSE'
    # so that functions do not error out 
    # but simply return NULL if internet
    # connection is not available, and thus
    # fail gracefully rather than error out (required by CRAN).
# Remove this argument or set to TRUE so functions DO fail
    # when internet resources (paleobiodb) is not available.



# getting occurrence data for a genus, sorting it
# firest example: Dicellograptus

dicelloData <- getPBDBocc("Dicellograptus", 
    failIfNoInternet = FALSE)
    
if(!is.null(dicelloData)){  

dicelloOcc2 <- taxonSortPBDBocc(
   data = dicelloData, 
   rank = "species",
   onlyFormal = FALSE
   )
names(dicelloOcc2)

}

# try a PBDB API download with lots of synonymization
	#this should have only 1 species
# *old* way, using v1.1 of PBDB API:
# acoData <- read.csv(paste0(
#	"https://paleobiodb.org/data1.1/occs/list.txt?",
#	"base_name = Acosarina%20minuta&show=ident,phylo"))
#
# *new* method - with getPBDBocc, using v1.2 of PBDB API:
acoData <- getPBDBocc("Acosarina minuta", 
    failIfNoInternet = FALSE)
    
if(!is.null(acoData)){  

acoOcc <- taxonSortPBDBocc(
   data = acoData, 
   rank = "species", 
   onlyFormal = FALSE
   )
names(acoOcc)

}



###########################################

#load example graptolite PBDB occ dataset
data(graptPBDB)

#get formal genera
occGenus <- taxonSortPBDBocc(
   data = graptOccPBDB,
   rank = "genus"
   )
length(occGenus)

#get formal species
occSpeciesFormal <- taxonSortPBDBocc(
   data = graptOccPBDB,
   rank = "species")
length(occSpeciesFormal)

#yes, there are fewer 'formal'
   # graptolite species in the PBDB then genera

#get formal and informal species
occSpeciesInformal <- taxonSortPBDBocc(
   data = graptOccPBDB, 
   rank = "species",
   onlyFormal = FALSE
   )
length(occSpeciesInformal)

#way more graptolite species are 'informal' in the PBDB

#get formal and informal species 
	#including from occurrences with uncertain taxonomy
	#basically everything and the kitchen sink
occSpeciesEverything <- taxonSortPBDBocc(
   data = graptOccPBDB, 
   rank = "species",
   onlyFormal = FALSE, 
   cleanUncertain = FALSE)
length(occSpeciesEverything)

paleotree documentation built on Aug. 22, 2022, 9:09 a.m.