validateDup: Prepare, Search and Merge Duplicate Specimens

View source: R/validateDup.R

validateDupR Documentation

Prepare, Search and Merge Duplicate Specimens

Description

This function search for duplicated specimens within and across collections and it can be used to homogenize the information of different groups of fields and to remove duplicates, leaving only one occurrence for each group of duplicata.

Usage

validateDup(
  occ.df,
  cat.code = "collectionCode.new",
  cat.numb = "catalogNumber",
  merge = TRUE,
  remove = FALSE,
  noYear = "s.d.",
  noName = "s.n.",
  noNumb = "s.n.",
  comb.fields = list(c("family", "col.last.name", "col.number", "col.loc"), c("family",
    "col.year", "col.number", "col.loc"), c("species", "col.last.name", "col.number",
    "col.year"), c("col.year", "col.last.name", "col.number", "col.loc")),
  ignore.miss = TRUE,
  dup.name = "dup.ID",
  prop.name = "dup.prop",
  prop = 0.75,
  rec.ID = "numTombo",
  info2merge = c("tax", "geo", "loc"),
  tax.names = c(family = "family.new", species = "scientificName.new", det.name =
    "identifiedBy.new", det.year = "yearIdentified.new", tax.check = "tax.check", status
    = "scientificNameStatus"),
  geo.names = c(lat = "decimalLatitude.new", lon = "decimalLongitude.new", org.coord =
    "origin.coord", prec.coord = "precision.coord", geo.check = "geo.check"),
  loc.names = c(loc.str = "loc.correct", res.gazet = "resolution.gazetteer", res.orig =
    "resol.orig", loc.check = "loc.check"),
  tax.level = "high",
  overwrite = FALSE,
  print.rm = TRUE
)

Arguments

occ.df

a data frame, containing typical fields from occurrence records from herbarium specimens

cat.code

character. The name of the column containing the code of the collection. Default to the plantR output column "collectionCode.new".

cat.numb

character. The name of the column containing the catalog number (a.k.a. accession number) of the record. Default to "catalogNumber".

merge

logical. Should duplicates be merged? Default to TRUE.

remove

logical. Should all duplicates be removed or only the duplicated entries from the same collection? Default to FALSE.

noYear

character. Standard for missing data in Year. Default to "n.d.".

noName

character. Standard for missing data in collector name. Default to "s.n.".

noNumb

character. Standard for missing data in collector number. Default to "s.n.".

comb.fields

list. A list containing one or more vectors with the information that should be used to create the duplicate search strings. Default to four vectors of information to be combined.

ignore.miss

logical. Should the duplicate search strings with missing/unknown information (e.g. 'n.d.', 's.n.', NA) be excluded from the duplicate search. Default to TRUE.

dup.name

character. The name of column in the input data frame with the duplicate group ID. Default to the plantR output 'dup.ID'.

prop.name

character. The name of column in the input data frame with the proportion of duplicates found within the group ID. Default to the plantR output 'dup.prop'.

prop

numerical. The threshold value of proportion of duplicated values retrieved (i.e. dup.prop) to enter the merging routine. Should be between zero and one. Default to 0.75.

rec.ID

character. The name of the columns containing the unique record identifier (see function getTombo()). Default to 'numTombo'.

info2merge

Vector. The groups of information (i.e. fields) to be merged. Currently, only taxonomic ('tax'), geographic ('geo') and/or locality ('loc') information can be merged. Default to merge all of them.

tax.names

Vector. A named vector containing the names of columns in the input data frame with the taxonomic information to be merged.

geo.names

Vector. A named vector containing the names of columns in the input data frame with the geographical information to be merged.

loc.names

Vector. A named vector containing the names of columns in the input data frame with the locality information to be merged.

tax.level

character. A vector with the confidence level of the identification that should be considered in the merge of taxonomic information. Default to "high".

overwrite

logical. Should the merged information be overwritten or stored in separate, new columns. Default to FALSE (new columns are created).

print.rm

logical. Should the number of records removed be printed? Default to TRUE.

Details

The function works similarly to a wrapper function, where the individuals steps of the proposed plantR workflow for preparing, searching, merging and removal of duplicates are performed altogether (see the plantR tutorial for details).

Value

The input data frame, plus the new columns with the formatted fields.

Author(s)

Renato A. F. de Lima

See Also

prepDup, getDup, mergeDup, rmDup


LimaRAF/plantR documentation built on Jan. 1, 2023, 10:18 a.m.