validateDup: Prepare, Search and Merge Duplicate Specimens
In LimaRAF/plantR: Managing Species Records from Biological Collections

validateDup

R Documentation

Prepare, Search and Merge Duplicate Specimens

Description

This function search for duplicated specimens within and across collections and it can be used to homogenize the information of different groups of fields and to remove duplicates, leaving only one occurrence for each group of duplicata.

Usage

validateDup(
  occ.df,
  cat.code = "collectionCode.new",
  cat.numb = "catalogNumber",
  merge = TRUE,
  remove = FALSE,
  noYear = "s.d.",
  noName = "s.n.",
  noNumb = "s.n.",
  comb.fields = list(c("family", "col.last.name", "col.number", "col.loc"), c("family",
    "col.year", "col.number", "col.loc"), c("species", "col.last.name", "col.number",
    "col.year"), c("col.year", "col.last.name", "col.number", "col.loc")),
  ignore.miss = TRUE,
  dup.name = "dup.ID",
  prop.name = "dup.prop",
  prop = 0.75,
  rec.ID = "numTombo",
  info2merge = c("tax", "geo", "loc"),
  tax.names = c(family = "family.new", species = "scientificName.new", det.name =
    "identifiedBy.new", det.year = "yearIdentified.new", tax.check = "tax.check", status
    = "scientificNameStatus"),
  geo.names = c(lat = "decimalLatitude.new", lon = "decimalLongitude.new", org.coord =
    "origin.coord", prec.coord = "precision.coord", geo.check = "geo.check"),
  loc.names = c(loc.str = "loc.correct", res.gazet = "resolution.gazetteer", res.orig =
    "resol.orig", loc.check = "loc.check"),
  tax.level = "high",
  overwrite = FALSE,
  print.rm = TRUE
)

Arguments

`occ.df`	a data frame, containing typical fields from occurrence records from herbarium specimens
`cat.code`	character. The name of the column containing the code of the collection. Default to the plantR output column "collectionCode.new".
`cat.numb`	character. The name of the column containing the catalog number (a.k.a. accession number) of the record. Default to "catalogNumber".
`merge`	logical. Should duplicates be merged? Default to TRUE.
`remove`	logical. Should all duplicates be removed or only the duplicated entries from the same collection? Default to FALSE.
`noYear`	character. Standard for missing data in Year. Default to "n.d.".
`noName`	character. Standard for missing data in collector name. Default to "s.n.".
`noNumb`	character. Standard for missing data in collector number. Default to "s.n.".
`comb.fields`	list. A list containing one or more vectors with the information that should be used to create the duplicate search strings. Default to four vectors of information to be combined.
`ignore.miss`	logical. Should the duplicate search strings with missing/unknown information (e.g. 'n.d.', 's.n.', NA) be excluded from the duplicate search. Default to TRUE.
`dup.name`	character. The name of column in the input data frame with the duplicate group ID. Default to the plantR output 'dup.ID'.
`prop.name`	character. The name of column in the input data frame with the proportion of duplicates found within the group ID. Default to the plantR output 'dup.prop'.
`prop`	numerical. The threshold value of proportion of duplicated values retrieved (i.e. dup.prop) to enter the merging routine. Should be between zero and one. Default to 0.75.
`rec.ID`	character. The name of the columns containing the unique record identifier (see function `getTombo()`). Default to 'numTombo'.
`info2merge`	Vector. The groups of information (i.e. fields) to be merged. Currently, only taxonomic ('tax'), geographic ('geo') and/or locality ('loc') information can be merged. Default to merge all of them.
`tax.names`	Vector. A named vector containing the names of columns in the input data frame with the taxonomic information to be merged.
`geo.names`	Vector. A named vector containing the names of columns in the input data frame with the geographical information to be merged.
`loc.names`	Vector. A named vector containing the names of columns in the input data frame with the locality information to be merged.
`tax.level`	character. A vector with the confidence level of the identification that should be considered in the merge of taxonomic information. Default to "high".
`overwrite`	logical. Should the merged information be overwritten or stored in separate, new columns. Default to FALSE (new columns are created).
`print.rm`	logical. Should the number of records removed be printed? Default to TRUE.

Details

The function works similarly to a wrapper function, where the individuals steps of the proposed plantR workflow for preparing, searching, merging and removal of duplicates are performed altogether (see the plantR tutorial for details).

Value

The input data frame, plus the new columns with the formatted fields.

Author(s)

Renato A. F. de Lima

LimaRAF/plantR
Managing Species Records from Biological Collections

validateDup: Prepare, Search and Merge Duplicate Specimens
In LimaRAF/plantR: Managing Species Records from Biological Collections

Prepare, Search and Merge Duplicate Specimens

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Related to validateDup in LimaRAF/plantR...

R Package Documentation

Browse R Packages

We want your feedback!

LimaRAF/plantR Managing Species Records from Biological Collections

validateDup: Prepare, Search and Merge Duplicate Specimens In LimaRAF/plantR: Managing Species Records from Biological Collections

Prepare, Search and Merge Duplicate Specimens

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Related to validateDup in LimaRAF/plantR...

R Package Documentation

Browse R Packages

We want your feedback!

LimaRAF/plantR
Managing Species Records from Biological Collections

validateDup: Prepare, Search and Merge Duplicate Specimens
In LimaRAF/plantR: Managing Species Records from Biological Collections