mergeDup | R Documentation |
This function homogenize the information of different groups of fields (e.g. taxonomic, geographic or locality) for groups of duplicate specimens.
mergeDup( dups, dup.name = "dup.ID", prop.name = "dup.prop", prop = 0.75, rec.ID = "numTombo", info2merge = c("tax", "geo", "loc"), tax.names = c(family = "family.new", species = "scientificName.new", det.name = "identifiedBy.new", det.year = "yearIdentified.new", tax.check = "tax.check", status = "scientificNameStatus"), geo.names = c(lat = "decimalLatitude.new", lon = "decimalLongitude.new", org.coord = "origin.coord", prec.coord = "precision.coord", geo.check = "geo.check"), loc.names = c(loc.str = "loc.correct", res.gazet = "resolution.gazetteer", res.orig = "resol.orig", loc.check = "loc.check"), tax.level = "high", overwrite = FALSE )
dups |
the input data frame. |
dup.name |
character. The name of column in the input data frame with the duplicate group ID. Default to the plantR output 'dup.ID'. |
prop.name |
character. The name of column in the input data frame with the proportion of duplicates found within the group ID. Default to the plantR output 'dup.prop'. |
prop |
numerical. The threshold value of proportion of duplicated values retrieved (i.e. dup.prop) to enter the merging routine. Should be between zero and one. Default to 0.75. |
rec.ID |
character. The name of the columns containing the unique record
identifier (see function |
info2merge |
Vector. The groups of information (i.e. fields) to be merged. Currently, only taxonomic ('tax'), geographic ('geo') and/or locality ('loc') information can be merged. Default to merge all of them. |
tax.names |
Vector. A named vector containing the names of columns in the input data frame with the taxonomic information to be merged. |
geo.names |
Vector. A named vector containing the names of columns in the input data frame with the geographical information to be merged. |
loc.names |
Vector. A named vector containing the names of columns in the input data frame with the locality information to be merged. |
tax.level |
character. A vector with the confidence level of the identification that should be considered in the merge of taxonomic information. Default to "high". |
overwrite |
logical. Should the merged information be overwritten or stored in separate, new columns. Default to FALSE (new columns are created). |
The homogenization of the information within groups of duplicates depends
on the existence of some fields in the input data frame, which result from
the plantR workflow. The first essential field is the duplicate group
identifiers, which is used to aggregate the records (see functions
prepDup()
and getDup()
). The name of the column with these identifiers
must be provided to the argument dup.name
(default to 'dup.ID'). Other
essential fields depend on the type of merge desired (argument
info2merge
), a different set of columns names are needed. These names
should be provided to the arguments tax.names
, geo.names
, and
loc.names
.
For the merge of taxonomic information, the fields required by
tax.names
are:
'family': the botanical family (default: 'family.new')
'species': the scientific name (default: 'scientificName.new')
'det.name': the identifier name (default: 'identifiedBy.new')
'det.year': the identification year (default: 'yearIdentified.new')
'tax.check': the confidence level of the taxonomic identification (default: 'tax.check')
'status': the status of the taxon name (default: 'scientificNameStatus')
For the merge of geographical information, the fields required by
geo.names
are:
'lat': latitude in decimal degrees (default: 'decimalLatitude.new')
'lon': longitude in decimal degrees (default: 'decimalLongitude.new')
'org.coord': the origin of the coordinates (default: 'origin.coord')
'prec.coord': the precision of the coordinates (default: 'precision.coord')
'geo.check': the result of the geo. coordinate validation (default: 'geo.check')
For the merge of locality information, the fields required by loc.names
are:
'loc.str': the locality search string (default: 'loc.correct')
'res.gazet': the resolution of the gazetteer coordinates (default: 'resolution.gazetteer')
'res.orig': the resolution of the source coordinates (default: 'resol.orig')
'loc.check': the result of the locality validation (default: 'loc.check')
For all groups of information (i.e. taxonomic, geographic and locality), the
merging process consists in ordering the the information available for each
group of duplicates from the best to the worst quality/resolution available.
The best information available is then expanded for all records of the group
of duplicates. The argument prop
defines the duplicated proportion (given
by prop.name
) that should be used as a threshold. Only records with
duplicated proportions above this threshold will be merged. For all other
records, the output will be the same as the input. If no column prop.name
is found in the input data, merge is performed for all records, with a
warning.
For the merge of taxonomic information, the specimen(s) with the highest
confidence level of the identification is used as the standard, from which
the taxonomic information is expanded to other specimens within the same
group of duplicates. By default, mergeDup()
uses specimens flagged as
having a 'high' confidence level.
In the case of conflicting species identification among specialists for the same group of duplicates, the most recent identification is assumed as the most up-to-date one. Note that if the year of identification is missing from one or more records, the corresponding identifications of these records are not taken into account while trying to assign the most up-to-date identification for a group of duplicates.
For the merge of geographical information, specimens are ordered according to the result of their geographical validation (i.e. field 'geo.check') and the resolutions of the original geographical coordinates. Thus, for each group of duplicates the specimen whose coordinates were validated at the best level (e.g. 'ok_county') is used to expand the information for the specimens validate at lower levels (e.g. state or country levels).
A similar procedure is performed to merge the information regarding the locality description. Specimens are ordered according to the result of their locality validation (i.e. field 'loc.check'), and the one ranked best within the group of duplicates (e.g. 'ok_municip.2locality') is the one used as the standard.
For the merge of taxonomic, geographic and locality information, the specimens used as references of the best information available for each group of duplicate are stored in the columns 'ref.spec.tax', 'ref.spec.geo' and 'ref.spec.loc', respectively. The merge of collector information (i.e. collector name, number and year) is predicted, but not yet implemented in the current version of the package.
If overwrite == FALSE
, the function returns the input data frame
dups
and the new columns containing the homogenized information.
The names of these columns are the same of the previous one but with an
added suffix '1'. If overwrite == TRUE
, the homogenized information is
saved on the same columns of the input data and the names of the columns
remain the same.
Renato A. F. de Lima
prepDup and getDup.
#An example for the merge of taxonomic information only (df = data.frame( ID = c("a7","b2","c4","d1","e9","f3","g2","h8","i6","j5"), dup.ID = c("a7|b2","a7|b2","c4|d1|e9","c4|d1|e9","c4|d1|e9", "f3|g2","f3|g2","h8|i6|j5","h8|i6|j5","h8|i6|j5"), fam = c("AA","AA","BB","BB","Bb","CC","DD","EE","Ee","ee"), sp = c("a a","a b","c c","c d","c d","e e","f f","h h","h h","h h"), det = c("spec","n_spec","spec1","spec2","n_spec1", "spec3","spec4","n_spec2","n_spec3","n_spec4"), yr = c("2010","2019","2019","2000","2020",NA,"1812","2020","2020","2020"), check = c("high","low","high","high","low","high","high","low","low","low"), stat = rep("possibly_ok", 10))) mergeDup(df, info2merge = "tax", rec.ID = "ID", tax.names = c(family = "fam", species = "sp", det.name = "det", det.year = "yr", tax.check = "check", status = "stat"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.