Nothing
#' Sorting Unique Taxa of a Given Rank from Paleobiology Database Occurrence Data
#'
#' Functions for sorting out unique taxa from Paleobiology Database occurrence downloads,
#' which should accept several different formats resulting from different versions of the
#' PBDB API and different vocabularies available from the API.
#' @details
#' Data input for \code{taxonSortPBDBocc} are expected to be from version 1.2 API
#' with the 'pbdb' vocabulary. However, datasets are
#' passed to internal function \code{translatePBDBocc},
#' which attempts to correct any necessary field names and field contents used by
#' \code{taxonSortPBDBocc}.
#'
#' This function can pull either \emph{just} the 'formally' identified
#' and synonymized taxa in a given table of occurrence
#' data or pull \emph{in addition} occurrences listed under informal
#' taxa of the sought taxonomic rank. Only formal taxa
#' are sorted by default; this is controlled by argument \code{onlyFormal}.
#' Pulling the informally-listed taxonomic
#' occurrences is often necessary in some groups that have received
#' little focused taxonomic effort, such that many
#' species are linked to their generic taxon ID and never received
#' a species-level taxonomic ID in the PBDB.
#' Pulling both formal and informally listed taxonomic occurrences
#' is a hierarchical process and performed in
#' stages: formal taxa are identified first, informal taxa are
#' identified from the occurrences that are
#' 'leftover', and informal occurrences with name labels
#' that match a previously sorted formally listed
#' taxon are concatenated to the 'formal' occurrences for that same taxon,
#' rather than being listed under separate elements
#' of the list as if they were separate taxa.
#' This function is simpler than similar functions that inspired it
#' by using the input"rank" to both filter occurrences and directly
#' reference a taxon's accepted taxonomic placement, rather than a
#' series of specific \code{if()} checks. Unlike some similar functions
#' in other packages, such as version 0.3 \code{paleobioDB}'s
#' \code{pbdb_temp_range}, \code{taxonSortPBDBocc} does not check
#' if sorted taxa have a single 'taxon_no' ID number. This makes the blanket
#' assumption that if a taxon's listed name in relevant fields is identical,
#' the taxon is identical, with the important caveat that occurrences with
#' accepted formal synonymies are sorted first based on their accepted names, followed by
#' taxa without formal taxon IDs. This should avoid
#' linking the same occurrences to multiple taxa by mistake, or assigning
#' occurrences listed under separate formal taxa to the same taxon
#' based on their 'identified' taxon name, as long as all
#' formal taxa have unique names (note: this is an untested assumption).
#' In some cases, this procedure is helpful, such as when
#' taxa with identical generic and species names are listed under
#' separate taxon ID numbers because of a difference in the
#' listed subgenus for some occurrences (example,
#' "Pseudoclimacograptus (Metaclimacograptus) hughesi' and
#' 'Pseudoclimacograptus hughesi' in the PBDB as of 03/01/2015).
#' Presumably any data that would be affected by differences
#' in this procedure is very minor.
#'
#' Occurrences with taxonomic uncertainty indicators in
#' the listed identified taxon name are removed
#' by default, as controlled by argument \code{cleanUncertain}.
#' This is done by removing any occurrences that
#' have an entry in \code{primary_reso} (was
#' "\code{genus_reso}" in v1.1 API) when \code{rank} is a
#' supraspecific level, and \code{species_reso} when \code{rank = species},
#' if that entry is not found in
#' \code{cleanResoValues}. In some rare cases, when
#' \code{onlyFormal = FALSE}, supraspecific taxon names may be
#' returned in the output that have various 'cruft' attached, like 'n.sp.'.
#'
#' Empty values in the input data table ("") are converted
#' to NAs, as they may be due to issues
#' with using read.csv to convert API-downloaded data.
#' @param data A table of occurrence data collected from the Paleobiology Database.
#' @param rank The selected taxon rank; must be
#' one of 'species', 'genus', 'family', 'order',
#' 'class' or 'phylum'.
#' @param onlyFormal If TRUE (the default) only taxa formally accepted by the Paleobiology
#' Database are returned. If FALSE, then the identified name fields are searched for any
#' additional 'informal' taxa with the proper taxon. If their taxon name happens to match
#' any formal taxa, their occurrences are merged onto the formal taxa. This argument generally
#' has any appreciable effect when rank = species.
#' @param cleanUncertain If TRUE (the default) any
#' occurrences with an entry in the respective
#' 'resolution' field that is *not* found in the
#' argument \code{cleanResoValue} will be removed from
#' the dataset. These are assumed to be values
#' indicating taxonomic uncertainty, i.e. 'cf.' or '?'.
#' @param cleanResoValues The set of values that can
#' be found in a 'resolution' field that do not
#' cause a taxon to be removed, as they do not
#' seem to indicate taxonomic uncertainty.
#' @return
#' Returns a list where each element is different unique taxon obtained by the sorting function,
#' and named with that taxon name. Each element is composed of a table containing all the same
#' occurrence data fields as the input (potentially with some fields renamed and some field
#' contents change, due to vocabulary translation).
#' @seealso
#' Occurrence data as commonly used with \code{paleotree} functions can
#' be obtained with \code{link{getPBDBocc}}. Occurrence data sorted by
#' this function might be used with functions \code{\link{occData2timeList}}
#' and \code{\link{plotOccData}}. Also, see the example graptolite dataset
#' at \code{\link{graptPBDB}}
#' @author
#' David W. Bapst, but partly inspired by Matthew Clapham's \code{cleanTaxon} (found at
#' \href{https://github.com/mclapham/PBDB-R-scripts/blob/master/taxonClean.R}{this location}
#' on github) and R package \code{paleobioDB}'s \code{pbdb_temp_range} function (found at
#' \href{https://github.com/ropensci/paleobioDB/blob/master/R/pbdb_temporal_functions.R#L64-178 }{this location}
#' on github.
#' @references
#' Peters, S. E., and M. McClennen. 2015. The Paleobiology Database
#' application programming interface. \emph{Paleobiology} 42(1):1-7.
#' @examples
#' # Note that most examples here using getPBDBocc()
#' # use the argument 'failIfNoInternet = FALSE'
#' # so that functions do not error out
#' # but simply return NULL if internet
#' # connection is not available, and thus
#' # fail gracefully rather than error out (required by CRAN).
#' # Remove this argument or set to TRUE so functions DO fail
#' # when internet resources (paleobiodb) is not available.
#'
#' \donttest{
#'
#' # getting occurrence data for a genus, sorting it
#' # firest example: Dicellograptus
#'
#' dicelloData <- getPBDBocc("Dicellograptus",
#' failIfNoInternet = FALSE)
#'
#' if(!is.null(dicelloData)){
#'
#' dicelloOcc2 <- taxonSortPBDBocc(
#' data = dicelloData,
#' rank = "species",
#' onlyFormal = FALSE
#' )
#' names(dicelloOcc2)
#'
#' }
#'
#' # try a PBDB API download with lots of synonymization
#' #this should have only 1 species
#' # *old* way, using v1.1 of PBDB API:
#' # acoData <- read.csv(paste0(
#' # "https://paleobiodb.org/data1.1/occs/list.txt?",
#' # "base_name = Acosarina%20minuta&show=ident,phylo"))
#' #
#' # *new* method - with getPBDBocc, using v1.2 of PBDB API:
#' acoData <- getPBDBocc("Acosarina minuta",
#' failIfNoInternet = FALSE)
#'
#' if(!is.null(acoData)){
#'
#' acoOcc <- taxonSortPBDBocc(
#' data = acoData,
#' rank = "species",
#' onlyFormal = FALSE
#' )
#' names(acoOcc)
#'
#' }
#'
#' }
#'
#' ###########################################
#'
#' #load example graptolite PBDB occ dataset
#' data(graptPBDB)
#'
#' #get formal genera
#' occGenus <- taxonSortPBDBocc(
#' data = graptOccPBDB,
#' rank = "genus"
#' )
#' length(occGenus)
#'
#' #get formal species
#' occSpeciesFormal <- taxonSortPBDBocc(
#' data = graptOccPBDB,
#' rank = "species")
#' length(occSpeciesFormal)
#'
#' #yes, there are fewer 'formal'
#' # graptolite species in the PBDB then genera
#'
#' #get formal and informal species
#' occSpeciesInformal <- taxonSortPBDBocc(
#' data = graptOccPBDB,
#' rank = "species",
#' onlyFormal = FALSE
#' )
#' length(occSpeciesInformal)
#'
#' #way more graptolite species are 'informal' in the PBDB
#'
#' #get formal and informal species
#' #including from occurrences with uncertain taxonomy
#' #basically everything and the kitchen sink
#' occSpeciesEverything <- taxonSortPBDBocc(
#' data = graptOccPBDB,
#' rank = "species",
#' onlyFormal = FALSE,
#' cleanUncertain = FALSE)
#' length(occSpeciesEverything)
#'
#'
#'
# old examples involving compact vocabulary
#
#make sure works with compact vocab v1.2
#dicelloData <- read.csv(paste0("https://paleobiodb.org",
# "/data1.2/occs/list.txt?base_name=Dicellograptus",
# "&show=ident,phylo&vocab=com"))
#dicelloOccCom1 <- taxonSortPBDBocc(dicelloData,
# rank = "species", onlyFormal = FALSE)
#names(dicelloOccCom1)
#head(dicelloOccCom1[[1]])[,1:7]
#make sure works with compact vocab v1.2
#dicelloData <- read.csv(paste0("https://paleobiodb.org",
# "/data1.2/occs/list.txt?base_name=Dicellograptus",
# "&show=ident,phylo&vocab=com"))
#dicelloOccCom1 <- taxonSortPBDBocc(dicelloData,
# rank = "species", onlyFormal = FALSE)
#names(dicelloOccCom1)
#head(dicelloOccCom1[[1]])[,1:7]
#' @name taxonSortPBDBocc
#' @rdname taxonSortPBDBocc
#' @export
taxonSortPBDBocc <- function(data,rank, onlyFormal = TRUE,
cleanUncertain = TRUE,
cleanResoValues = c(NA, '"', "", "n. sp.", "n. gen."," "," ")
){
########################################
#this function inspired by Matt Clapham's taxonClean and paleobioDB's pbdb_temp_range
#onlyFormal = FALSE;rank = "species"
#onlyFormal = FALSE;rank = "genus"
#pull occurrences out of a data table and sort by unique taxa into a list structure
#translated vocabs!
data <- translatePBDBocc(data)
#Second, some warning checks to see if occurrences downloaded correctly:
# need phylo and ident data
#the following is taken with minor modification from code in paleobioDB package
if (!any("primary_name" == colnames(data)) | !any("genus" == colnames(data))){
stop(paste0("need to add 'show = c('phylo', 'ident')' to",
" pbdb_occurrences query\n *or*\n 'show = ident,phylo' to PBDB API query"))
}
#additional checks for rank
if(length(rank) != 1){stop("length of rank must be 1")}
if(!any(sapply(
c("species","genus","family","order","class","phylum"),
function(x) x == rank))
){
stop("rank must be one of 'species', 'genus', 'family', 'order', 'class' or 'phylum'")
}
#for inconsistencies between rank and onlyFormal - NOT TRUE, IGNORE THIS CHECK
#if(!onlyFormal & (rank != "species" | rank != "genus")){
# stop("Informal taxon does not exist for above genus level, please use 'onlyFormal = TRUE'")}
#need to replace any empty string values with NAs (due perhaps to use of read.csv with the API)
data[data == ""] <- NA
#now, pull taxa using the relevant variable from 'phylo' for formal ID
#this matches what paleobioDB now does
#sort occurrences by unique taxa in each level and then append valid names
if(rank == "species"){
#remove uncertain species taxonomy
if(cleanUncertain){
data <- data[data[,"species_reso"] %in% cleanResoValues,,drop = FALSE]
}
#first formal taxa
#get species names for taxa formally recognized as species level
whichFormal <- (data[,"accepted_rank"] == rank)
taxonVar <- as.character(data[,"accepted_name"])
taxaNames <- as.character(unique(taxonVar[whichFormal])) #get unique taxa
#drop empty entries (occurrences listed at a higher taxonomic level, formally)
taxaNames <- taxaNames[!is.na(taxaNames)]
#take rows these taxa are in and drop them into elements of a list
sortedOcc <- lapply(taxaNames,function(x) data[which(x == taxonVar),,drop = FALSE])
names(sortedOcc) <- taxaNames
if(!onlyFormal){
#now use taxon_rank to identify useful informal occurrence
taxonVar2 <- data[,c("primary_name","species_name")]
taxonVar2 <- apply(taxonVar2,1,paste,collapse = " ")
#which of these occs are useful as informal occs
stillUseful <- which(!whichFormal & (data[,"identified_rank"] == rank))
taxaNames2 <- as.character(unique(taxonVar2[stillUseful]))
#drop any weird empties
taxaNames2 <- taxaNames2[taxaNames2 != " " & !is.na(taxaNames2)]
sortedOcc2 <- lapply(taxaNames2,function(x) data[which(x == taxonVar2),,drop = FALSE])
names(sortedOcc2) <- taxaNames2
# merge sortedocc2 with sortedOcc
share1 <- sapply(taxaNames,function(x) any(x == taxaNames2))
share2 <- sapply(taxaNames2,function(x) any(x == taxaNames))
if(sum(!share1)>0){sortedOccU <- sortedOcc[!share1]}else{sortedOccU <- list()}
if(sum(!share2)>0){sortedOcc2U <- sortedOcc2[!share2]}else{sortedOcc2U <- list()}
#and for shared names
if(sum(share1)>0){
shared <- lapply(taxaNames[share1],function(x)
cbind(sortedOcc[[taxaNames == x]],sortedOcc2[[taxaNames2 == x]]))
names(shared) <- names(taxaNames[share1])
}else{shared <- list()}
sortedOcc <- c(sortedOccU,sortedOcc2U,shared)
}
}else{ #if not at the species rank
if(cleanUncertain){ #removing uncertain taxonomy if appended to primary_name
data <- data[data[,"primary_reso"] %in% cleanResoValues,,drop = FALSE]
}
taxonVar <- data[,rank] #then our taxonomic variable of interest is just so!
taxaNames <- as.character(unique(taxonVar)) #get unique taxa
taxaNames <- taxaNames[!is.na(taxaNames)]
#take rows these taxa are in and drop them into elements of a list
sortedOcc <- lapply(taxaNames,function(x) data[which(x == taxonVar),,drop = FALSE])
names(sortedOcc) <- taxaNames
if(!onlyFormal){
#now use taxon_rank to identify useful informal occurrence
taxonVar2 <- data[,"primary_name"]
#which of these occs are useful as informal occs
stillUseful <- which(is.na(taxonVar) & (data[,"identified_rank"] == rank))
taxaNames2 <- as.character(unique(taxonVar2[stillUseful]))
taxaNames2 <- taxaNames2[!is.na(taxaNames2)]
sortedOcc2 <- lapply(taxaNames2,function(x) data[which(x == taxonVar2),,drop = FALSE])
names(sortedOcc2) <- taxaNames2
# merge sortedocc2 with sortedOcc
share1 <- sapply(taxaNames,function(x) any(x == taxaNames2))
share2 <- sapply(taxaNames2,function(x) any(x == taxaNames))
if(sum(!share1)>0){sortedOccU <- sortedOcc[!share1]}else{sortedOccU <- list()}
if(sum(!share2)>0){sortedOcc2U <- sortedOcc2[!share2]}else{sortedOcc2U <- list()}
#and for shared names
if(sum(share1)>0){
shared <- lapply(taxaNames[share1],function(x)
cbind(sortedOcc[[taxaNames == x]],sortedOcc2[[taxaNames2 == x]]))
names(shared) <- names(taxaNames[share1])
}else{shared <- list()}
sortedOcc <- c(sortedOccU,sortedOcc2U,shared)
}
}
# sort occurrences by taxon name
sortedOcc <- sortedOcc[order(names(sortedOcc))]
return(sortedOcc)
}
translatePBDBocc <- function(data){
#translate PBDB occ data
if(any(colnames(data) == "taxon_name")){
#from PBDB API version 1.1 to 1.2
colnames(data)[colnames(data) == "taxon_name"] <- "identified_name"
colnames(data)[colnames(data) == "taxon_no"] <- "identified_no"
colnames(data)[colnames(data) == "taxon_rank"] <- "identified_rank"
colnames(data)[colnames(data) == "matched_name"] <- "accepted_name"
colnames(data)[colnames(data) == "matched_no"] <- "accepted_no"
colnames(data)[colnames(data) == "matched_rank"] <- "accepted_rank"
colnames(data)[colnames(data) == "genus_name"] <- "primary_name"
colnames(data)[colnames(data) == "genus_reso"] <- "primary_reso"
}
if(any("tna" == colnames(data))){
#need to translate data from 1 vocab to another
#do on by-function basis - this is for taxonSort
#for this just translate colname of relevant taxon variables
#also will need to translate 'taxon_rank' and 'matched_rank' contents
if(any("mna" == colnames(data))){
#then compact vocab v1.1
if(!all(c("tna","rnk","tid","mna","mra","mid","idt","ids",
"gnl","fml","odl","cll","phl","rst","rss") %in% colnames(data))){
stop("Not all fields founds for compact vocab v1.1")}
colnames(data)[colnames(data) == "tna"] <- "identified_name"
colnames(data)[colnames(data) == "tid"] <- "identified_no"
colnames(data)[colnames(data) == "rnk"] <- "identified_rank"
colnames(data)[colnames(data) == "mna"] <- "accepted_name"
colnames(data)[colnames(data) == "mid"] <- "accepted_no"
colnames(data)[colnames(data) == "mra"] <- "accepted_rank"
colnames(data)[colnames(data) == "idt"] <- "primary_name"
colnames(data)[colnames(data) == "ids"] <- "species_name"
colnames(data)[colnames(data) == "rss"] <- "primary_reso"
colnames(data)[colnames(data) == "rst"] <- "species_reso"
colnames(data)[colnames(data) == "gnl"] <- "genus"
colnames(data)[colnames(data) == "fml"] <- "family"
colnames(data)[colnames(data) == "odl"] <- "order"
colnames(data)[colnames(data) == "cll"] <- "class"
colnames(data)[colnames(data) == "phl"] <- "phylum"
}
if(any("idn" == colnames(data))){
#then compact vocab v1.2
if(!all(c( "idn","idr","iid","tna","rnk","tid","idg","ids",
"gnl","fml","odl","cll","phl","rsg","rss") %in% colnames(data))){
stop("Not all fields founds for compact vocab v1.1")}
colnames(data)[colnames(data) == "idn"] <- "identified_name"
colnames(data)[colnames(data) == "iid"] <- "identified_no"
colnames(data)[colnames(data) == "idr"] <- "identified_rank"
colnames(data)[colnames(data) == "tna"] <- "accepted_name"
colnames(data)[colnames(data) == "tid"] <- "accepted_no"
colnames(data)[colnames(data) == "rnk"] <- "accepted_rank"
colnames(data)[colnames(data) == "idg"] <- "primary_name"
colnames(data)[colnames(data) == "ids"] <- "species_name"
colnames(data)[colnames(data) == "rss"] <- "primary_reso"
colnames(data)[colnames(data) == "rsg"] <- "species_reso"
colnames(data)[colnames(data) == "gnl"] <- "genus"
colnames(data)[colnames(data) == "fml"] <- "family"
colnames(data)[colnames(data) == "odl"] <- "order"
colnames(data)[colnames(data) == "cll"] <- "class"
colnames(data)[colnames(data) == "phl"] <- "phylum"
#stop(paste0("need to add 'vocab = 'pbdbd'' to pbdb_occurrences",
# " query\n *or*\n 'vocab = pbdb' to your PBDB API query"))
}
# taxon rank translation vectors for compact vocab
taxRankPBDB <- c("subspecies","species","subgenus",
"genus","subtribe","tribe","subfamily",
"family","superfamily","infraorder","suborder",
"order","superorder","infraclass",
"subclass","class","superclass","subphylum",
"phylum","superphylum","subkingdom",
"kingdom","unranked clade","informal")
#
taxRankCOM <- 2:26
#change contents of "identified_rank" and "accepted_rank"
data$identified_rank <- sapply(data$identified_rank,function(x) taxRankPBDB[x == taxRankCOM])
data$accepted_rank <- sapply(data$accepted_rank,function(x) taxRankPBDB[x == taxRankCOM])
}
return(data)
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.