mergeDup: Merge Duplicate Information
In LimaRAF/plantR: Managing Species Records from Biological Collections

mergeDup

R Documentation

Merge Duplicate Information

Description

This function homogenize the information of different groups of fields (e.g. taxonomic, geographic or locality) for groups of duplicate specimens.

Usage

mergeDup(
  dups,
  dup.name = "dup.ID",
  prop.name = "dup.prop",
  prop = 0.75,
  rec.ID = "numTombo",
  info2merge = c("tax", "geo", "loc"),
  tax.names = c(family = "family.new", species = "scientificName.new", det.name =
    "identifiedBy.new", det.year = "yearIdentified.new", tax.check = "tax.check", status
    = "scientificNameStatus"),
  geo.names = c(lat = "decimalLatitude.new", lon = "decimalLongitude.new", org.coord =
    "origin.coord", prec.coord = "precision.coord", geo.check = "geo.check"),
  loc.names = c(loc.str = "loc.correct", res.gazet = "resolution.gazetteer", res.orig =
    "resol.orig", loc.check = "loc.check"),
  tax.level = "high",
  overwrite = FALSE
)

Arguments

`dups`	the input data frame.
`dup.name`	character. The name of column in the input data frame with the duplicate group ID. Default to the plantR output 'dup.ID'.
`prop.name`	character. The name of column in the input data frame with the proportion of duplicates found within the group ID. Default to the plantR output 'dup.prop'.
`prop`	numerical. The threshold value of proportion of duplicated values retrieved (i.e. dup.prop) to enter the merging routine. Should be between zero and one. Default to 0.75.
`rec.ID`	character. The name of the columns containing the unique record identifier (see function `getTombo()`). Default to 'numTombo'.
`info2merge`	Vector. The groups of information (i.e. fields) to be merged. Currently, only taxonomic ('tax'), geographic ('geo') and/or locality ('loc') information can be merged. Default to merge all of them.
`tax.names`	Vector. A named vector containing the names of columns in the input data frame with the taxonomic information to be merged.
`geo.names`	Vector. A named vector containing the names of columns in the input data frame with the geographical information to be merged.
`loc.names`	Vector. A named vector containing the names of columns in the input data frame with the locality information to be merged.
`tax.level`	character. A vector with the confidence level of the identification that should be considered in the merge of taxonomic information. Default to "high".
`overwrite`	logical. Should the merged information be overwritten or stored in separate, new columns. Default to FALSE (new columns are created).

Details

The homogenization of the information within groups of duplicates depends on the existence of some fields in the input data frame, which result from the plantR workflow. The first essential field is the duplicate group identifiers, which is used to aggregate the records (see functions prepDup() and getDup()). The name of the column with these identifiers must be provided to the argument dup.name (default to 'dup.ID'). Other essential fields depend on the type of merge desired (argument info2merge), a different set of columns names are needed. These names should be provided to the arguments tax.names, geo.names, and loc.names.

For the merge of taxonomic information, the fields required by tax.names are:

'family': the botanical family (default: 'family.new')
'species': the scientific name (default: 'scientificName.new')
'det.name': the identifier name (default: 'identifiedBy.new')
'det.year': the identification year (default: 'yearIdentified.new')
'tax.check': the confidence level of the taxonomic identification (default: 'tax.check')
'status': the status of the taxon name (default: 'scientificNameStatus')

For the merge of geographical information, the fields required by geo.names are:

'lat': latitude in decimal degrees (default: 'decimalLatitude.new')
'lon': longitude in decimal degrees (default: 'decimalLongitude.new')
'org.coord': the origin of the coordinates (default: 'origin.coord')
'prec.coord': the precision of the coordinates (default: 'precision.coord')
'geo.check': the result of the geo. coordinate validation (default: 'geo.check')

For the merge of locality information, the fields required by loc.names are:

'loc.str': the locality search string (default: 'loc.correct')
'res.gazet': the resolution of the gazetteer coordinates (default: 'resolution.gazetteer')
'res.orig': the resolution of the source coordinates (default: 'resol.orig')
'loc.check': the result of the locality validation (default: 'loc.check')

For all groups of information (i.e. taxonomic, geographic and locality), the merging process consists in ordering the the information available for each group of duplicates from the best to the worst quality/resolution available. The best information available is then expanded for all records of the group of duplicates. The argument prop defines the duplicated proportion (given by prop.name) that should be used as a threshold. Only records with duplicated proportions above this threshold will be merged. For all other records, the output will be the same as the input. If no column prop.name is found in the input data, merge is performed for all records, with a warning.

For the merge of taxonomic information, the specimen(s) with the highest confidence level of the identification is used as the standard, from which the taxonomic information is expanded to other specimens within the same group of duplicates. By default, mergeDup() uses specimens flagged as having a 'high' confidence level.

In the case of conflicting species identification among specialists for the same group of duplicates, the most recent identification is assumed as the most up-to-date one. Note that if the year of identification is missing from one or more records, the corresponding identifications of these records are not taken into account while trying to assign the most up-to-date identification for a group of duplicates.

For the merge of geographical information, specimens are ordered according to the result of their geographical validation (i.e. field 'geo.check') and the resolutions of the original geographical coordinates. Thus, for each group of duplicates the specimen whose coordinates were validated at the best level (e.g. 'ok_county') is used to expand the information for the specimens validate at lower levels (e.g. state or country levels).

A similar procedure is performed to merge the information regarding the locality description. Specimens are ordered according to the result of their locality validation (i.e. field 'loc.check'), and the one ranked best within the group of duplicates (e.g. 'ok_municip.2locality') is the one used as the standard.

For the merge of taxonomic, geographic and locality information, the specimens used as references of the best information available for each group of duplicate are stored in the columns 'ref.spec.tax', 'ref.spec.geo' and 'ref.spec.loc', respectively. The merge of collector information (i.e. collector name, number and year) is predicted, but not yet implemented in the current version of the package.

Value

If overwrite == FALSE, the function returns the input data frame dups and the new columns containing the homogenized information. The names of these columns are the same of the previous one but with an added suffix '1'. If overwrite == TRUE, the homogenized information is saved on the same columns of the input data and the names of the columns remain the same.

Author(s)

Renato A. F. de Lima

Examples

#An example for the merge of taxonomic information only
(df = data.frame(
  ID = c("a7","b2","c4","d1","e9","f3","g2","h8","i6","j5"),
  dup.ID = c("a7|b2","a7|b2","c4|d1|e9","c4|d1|e9","c4|d1|e9",
             "f3|g2","f3|g2","h8|i6|j5","h8|i6|j5","h8|i6|j5"),
  fam = c("AA","AA","BB","BB","Bb","CC","DD","EE","Ee","ee"),
  sp = c("a a","a b","c c","c d","c d","e e","f f","h h","h h","h h"),
  det = c("spec","n_spec","spec1","spec2","n_spec1",
          "spec3","spec4","n_spec2","n_spec3","n_spec4"),
  yr = c("2010","2019","2019","2000","2020",NA,"1812","2020","2020","2020"),
  check = c("high","low","high","high","low","high","high","low","low","low"),
  stat = rep("possibly_ok", 10)))

mergeDup(df, info2merge = "tax",
        rec.ID = "ID",
        tax.names = c(family = "fam",
                      species = "sp",
                      det.name = "det",
                      det.year = "yr",
                      tax.check = "check",
                      status = "stat"))

LimaRAF/plantR documentation built on Jan. 1, 2023, 10:18 a.m.