validateTax: Confidence on Species Identification

View source: R/validateTax.R

validateTaxR Documentation

Confidence on Species Identification

Description

This function assigns different categories of confidence level (i.e. high, medium, low or unknown) to the identification of species records, based on the name of the person who provided the species identification and on type specimens.

Usage

validateTax(
  x,
  col.names = c(family = "family.new", det.name = "identifiedBy.new", col.name =
    "recordedBy.new", types = "typeStatus", rec.ID = "numTombo", rec.type =
    "basisOfRecord"),
  special.collector = TRUE,
  generalist = FALSE,
  generalist.class = "medium",
  other.records = NULL,
  miss.taxonomist = NULL,
  taxonomist.list = "plantR",
  voucher.list = NULL,
  noName = c("semdeterminador", "anonymus", "anonymous", "anonimo", "incognito",
    "unknown", "s.d.", "s.n."),
  top.det = 10,
  print = TRUE
)

Arguments

x

a data frame with the species records.

col.names

vector. A named vector containing the names of columns in the input data frame for each of the information needed to assign confidence levels to species identifications. Default to the plantR output column names.

special.collector

Logical. Specimens collected by the family specialist but with empty determiner field, should be classified as high confidence level? Default to TRUE.

generalist

Logical. Should family generalists be considered for taxonomic validation? Default to FALSE.

generalist.class

Character. Confidence level to be assigned to family generalists. Default to "medium".

other.records

Character or Integer. The Confidence level (if character) or the number of downgrading steps to be assigned to records which are not preserved specimens. Default to NULL (all record types are treated the same).

miss.taxonomist

Vector. Any missing combination of family x taxonomist that should be added to the validation?

taxonomist.list

a data.frame containing the list of taxonomist names. The default is "plantR", the internal plantR global database of plant taxonomists (see Details).

voucher.list

Vector. One or more unique record identifiers (i.e. combination of collection code and number) that should be flagged with a high confidence level? Default to NULL.

noName

Vector. One or more characters (in lower cases) with the standard notation for missing data in the field 'det.name'. Default to some typical notation found in herbarium data.

top.det

Numerical. How many of the top missing identifiers should be printed? Default to 10.

print

logical. Should the table of missing identifiers be printed? Default to TRUE.

rec.ID

Character. The name of the columns containing the unique record identifier (see function getTombo()). Default to 'numTombo'.

Details

The input data frame x must contain at least the columns with the information on the record family and the name of the person that provided the species identification. Preferably, this data frame should also contain information on type specimens and collectors names. If the user provide a list of records to be flagged as having a high confidence level in the identification, the user must also provide the column where the record unique identifiers are stored. The names of these columns should be provided as a named vector to the argument col.names, as follows:

  • 'family': the botanical family (default: 'family.new')

  • 'det.name': the identifier name (default: 'identifiedBy.new')

  • 'col.name': the collector name (default: 'recordedBy.new')

  • 'types': type specimens (default: 'typeStatus')

  • 'rec.ID': the collector serial number (default: 'numTombo')

  • 'rec.type': the type of record (default: 'basisOfRecord')

As for other functions in plantR, using a data frame x that has already passed by the editing steps of the plantR workflow should result in more accurate outputs.

The function classifies as high confidence level all records whose species identifications were performed by a family specialist or any type specimens (isotype, paratypes, etc). By default, the names of family specialists are obtained from a global list of about 8,500 plant taxonomists names constructed by Lima et al. (2020) and provided with plantR. This list was built based on information from the Harvard University Herbaria, the Brazilian Herbaria Network and the American Society of Plant Taxonomists. The dictionary was manually complemented for missing names of taxonomists and it includes common variants of taxonomists names (e.g., missing initials, typos, married or maiden names).

If a column containing the Darwin Core field 'basisOfRecord' or equivalent is provided ('rec.type' in argument col.names), then by default, all occurrences that are not preserved specimens (i.e. human/machine observations, photos, living specimens, etc.) are classified as having a low confidence level.

Some specimens are collected by a specialist of the family, but the identifier information is missing. By default, we assume the same confidence level for these specimens as that assigned for specimens where the identifier is the family specialist. But users can choose otherwise by setting the argument special.collector to FALSE.

The arguments generalist and generalist.class define if taxonomists that provide identifications for many different families outside his specialty, often referred to as a generalist, should be considered in the validation and under which confidence level. There are some names of generalists in the plantR default taxonomist database; however, this list of generalist names is currently biased towards plant collectors in South America, particularly in Brazil.

The argument other.records controls what to do with types of records which are not preserved specimens (Darwin Core field basisOfRecord). If the argument is NULL (default), all record types are treated the same. Users can set the argument to one of the confidence levels (i.e. 'unknown', 'low', 'medium' or 'high') to assign the same class for all non preserved specimens or to a value (i.e., 1 or 2), which correspond to the number of downgrading steps among levels. For instance, if other.records is one, the 'high' level becomes 'medium' and the 'medium' level becomes 'low' ('unknown' and 'low' levels remain the same).

If you miss the validation from one or more taxonomists, you can include them using the argument miss.taxonomist. The format should be: the name of the family of specialty followed by an underscore and then the taxonomist name in the TDWG format (e.g. "Bignoniaceae_Gentry, A.H.").

A database of taxonomists different than the plantR default can be used. This database must be provided using the argument taxonomist.list and it must contain the columns 'family' and 'tdwg.name'. The first column is the family of specialty of the taxonomist and the second one is her/his name in the TDWG format. See plantR function prepName() on how to get names in the TDWG format from a list of people's names.

Finally, the user can provide a list of records that should be flagged as having a high confidence level on their identification. This list should be provided using the argument voucher.list and the information that should be provided is the record unique identifier (i.e. combination of collection code and number). It is important that the way in which the list of unique identifiers was generated matches the one used to construct the the identifiers in the input data frame x (see help of function getTombo()). If a list of records is provided, the user must also provide a valid column name in x containing the unique record identifiers in col.names.

Value

The input data frame x, plus a new column 'tax.check' containing the classes of confidence in species identifications.

References

Lima, R.A.F. et al. 2020. Defining endemism levels for biodiversity conservation: Tree species in the Atlantic Forest hotspot. Biological Conservation, 252: 108825.

See Also

prepName and getTombo.

Examples

(df <- data.frame(
family.new = c("Bignoniaceae", "Bignoniaceae","Bignoniaceae",
"Bignoniaceae","Bignoniaceae","Bignoniaceae"),
identifiedBy.new = c("Gentry, A.H.", "Hatschbach, G.", NA, NA, NA, "Hatschbach, G."),
recordedBy.new = c(NA, NA, NA, "Gentry, A.H.", NA, NA),
typeStatus = c(NA, NA, "isotype", NA, NA, NA),
numTombo = c("a_1","b_3","c_7","d_5","e_3","f_4"),
stringsAsFactors = FALSE))

validateTax(df)
validateTax(df, generalist = TRUE)
validateTax(df, voucher.list = "f_4")


LimaRAF/plantR documentation built on Jan. 1, 2023, 10:18 a.m.