#' @title Merge Duplicate Information
#'
#' @description This function homogenize the information of different groups of
#' fields (e.g. taxonomic, geographic or locality) for groups of duplicate
#' specimens.
#'
#' @param dups the input data frame.
#' @param dup.name character. The name of column in the input data frame with
#' the duplicate group ID. Default to the __plantR__ output 'dup.ID'.
#' @param prop.name character. The name of column in the input data frame with
#' the proportion of duplicates found within the group ID. Default to the
#' __plantR__ output 'dup.prop'.
#' @param prop numerical. The threshold value of proportion of duplicated values
#' retrieved (i.e. dup.prop) to enter the merging routine. Should be between
#' zero and one. Default to 0.75.
#' @param rec.ID character. The name of the columns containing the unique record
#' identifier (see function `getTombo()`). Default to 'numTombo'.
#' @param info2merge Vector. The groups of information (i.e. fields) to be
#' merged. Currently, only taxonomic ('tax'), geographic ('geo') and/or
#' locality ('loc') information can be merged. Default to merge all of them.
#' @param tax.names Vector. A named vector containing the names of columns in
#' the input data frame with the taxonomic information to be merged.
#' @param geo.names Vector. A named vector containing the names of columns in
#' the input data frame with the geographical information to be merged.
#' @param loc.names Vector. A named vector containing the names of columns in
#' the input data frame with the locality information to be merged.
#' @param tax.level character. A vector with the confidence level of the
#' identification that should be considered in the merge of taxonomic
#' information. Default to "high".
#' @param overwrite logical. Should the merged information be overwritten or
#' stored in separate, new columns. Default to FALSE (new columns are
#' created).
#'
#' @author Renato A. F. de Lima
#'
#' @details
#' The homogenization of the information within groups of duplicates depends
#' on the existence of some fields in the input data frame, which result from
#' the __plantR__ workflow. The first essential field is the duplicate group
#' identifiers, which is used to aggregate the records (see functions
#' `prepDup()` and `getDup()`). The name of the column with these identifiers
#' must be provided to the argument `dup.name` (default to 'dup.ID'). Other
#' essential fields depend on the type of merge desired (argument
#' `info2merge`), a different set of columns names are needed. These names
#' should be provided to the arguments `tax.names`, `geo.names`, and
#' `loc.names`.
#'
#' For the merge of taxonomic information, the fields required by
#' `tax.names` are:
#' - 'family': the botanical family (default: 'family.new')
#' - 'species': the scientific name (default: 'scientificName.new')
#' - 'det.name': the identifier name (default: 'identifiedBy.new')
#' - 'det.year': the identification year (default: 'yearIdentified.new')
#' - 'tax.check': the confidence level of the taxonomic identification
#' (default: 'tax.check')
#' - 'status': the status of the taxon name (default: 'scientificNameStatus')
#'
#' For the merge of geographical information, the fields required by
#' `geo.names` are:
#' - 'lat': latitude in decimal degrees (default: 'decimalLatitude.new')
#' - 'lon': longitude in decimal degrees (default: 'decimalLongitude.new')
#' - 'org.coord': the origin of the coordinates (default: 'origin.coord')
#' - 'prec.coord': the precision of the coordinates (default:
#' 'precision.coord')
#' - 'geo.check': the result of the geo. coordinate validation (default:
#' 'geo.check')
#'
#' For the merge of locality information, the fields required by `loc.names`
#' are:
#' - 'loc.str': the locality search string (default: 'loc.correct')
#' - 'res.gazet': the resolution of the gazetteer coordinates (default:
#' 'resolution.gazetteer')
#' - 'res.orig': the resolution of the source coordinates (default:
#' 'resol.orig')
#' - 'loc.check': the result of the locality validation (default: 'loc.check')
#'
#' For all groups of information (i.e. taxonomic, geographic and locality), the
#' merging process consists in ordering the the information available for each
#' group of duplicates from the best to the worst quality/resolution available.
#' The best information available is then expanded for all records of the group
#' of duplicates. The argument `prop` defines the duplicated proportion (given
#' by `prop.name`) that should be used as a threshold. Only records with
#' duplicated proportions above this threshold will be merged. For all other
#' records, the output will be the same as the input. If no column `prop.name`
#' is found in the input data, merge is performed for all records, with a
#' warning.
#'
#' For the merge of taxonomic information, the specimen(s) with the highest
#' confidence level of the identification is used as the standard, from which
#' the taxonomic information is expanded to other specimens within the same
#' group of duplicates. By default, `mergeDup()` uses specimens flagged as
#' having a 'high' confidence level.
#'
#' In the case of conflicting species identification among specialists for the
#' same group of duplicates, the most recent identification is assumed as the
#' most up-to-date one. Note that if the year of identification is missing from
#' one or more records, the corresponding identifications of these records are
#' not taken into account while trying to assign the most up-to-date
#' identification for a group of duplicates.
#'
#' For the merge of geographical information, specimens are ordered according
#' to the result of their geographical validation (i.e. field 'geo.check') and
#' the resolutions of the original geographical coordinates. Thus, for each
#' group of duplicates the specimen whose coordinates were validated at the
#' best level (e.g. 'ok_county') is used to expand the information for the
#' specimens validate at lower levels (e.g. state or country levels).
#'
#' A similar procedure is performed to merge the information regarding the
#' locality description. Specimens are ordered according to the result of their
#' locality validation (i.e. field 'loc.check'), and the one ranked best within
#' the group of duplicates (e.g. 'ok_municip.2locality') is the one used as the
#' standard.
#'
#' For the merge of taxonomic, geographic and locality information, the
#' specimens used as references of the best information available for each
#' group of duplicate are stored in the columns 'ref.spec.tax', 'ref.spec.geo'
#' and 'ref.spec.loc', respectively. The merge of collector information (i.e.
#' collector name, number and year) is predicted, but not yet implemented in
#' the current version of the package.
#'
#' @return If `overwrite == FALSE`, the function returns the input data frame
#' \code{dups} and the new columns containing the homogenized information.
#' The names of these columns are the same of the previous one but with an
#' added suffix '1'. If `overwrite == TRUE`, the homogenized information is
#' saved on the same columns of the input data and the names of the columns
#' remain the same.
#'
#' @examples
#' #An example for the merge of taxonomic information only
#' (df = data.frame(
#' ID = c("a7","b2","c4","d1","e9","f3","g2","h8","i6","j5"),
#' dup.ID = c("a7|b2","a7|b2","c4|d1|e9","c4|d1|e9","c4|d1|e9",
#' "f3|g2","f3|g2","h8|i6|j5","h8|i6|j5","h8|i6|j5"),
#' fam = c("AA","AA","BB","BB","Bb","CC","DD","EE","Ee","ee"),
#' sp = c("a a","a b","c c","c d","c d","e e","f f","h h","h h","h h"),
#' det = c("spec","n_spec","spec1","spec2","n_spec1",
#' "spec3","spec4","n_spec2","n_spec3","n_spec4"),
#' yr = c("2010","2019","2019","2000","2020",NA,"1812","2020","2020","2020"),
#' check = c("high","low","high","high","low","high","high","low","low","low"),
#' stat = rep("possibly_ok", 10)))
#'
#' mergeDup(df, info2merge = "tax",
#' rec.ID = "ID",
#' tax.names = c(family = "fam",
#' species = "sp",
#' det.name = "det",
#' det.year = "yr",
#' tax.check = "check",
#' status = "stat"))
#'
#'
#' @seealso
#' \link[plantR]{prepDup} and \link[plantR]{getDup}.
#'
#' @import data.table
#'
#' @export mergeDup
#'
#'
mergeDup <- function(dups, dup.name = "dup.ID", prop.name = "dup.prop",
prop = 0.75, rec.ID = "numTombo",
info2merge = c("tax", "geo", "loc"),
tax.names = c(family = "family.new",
species = "scientificName.new",
det.name = "identifiedBy.new",
det.year = "yearIdentified.new",
tax.check = "tax.check",
status = "scientificNameStatus"),
geo.names = c(lat = "decimalLatitude.new",
lon = "decimalLongitude.new",
org.coord = "origin.coord",
prec.coord = "precision.coord",
geo.check = "geo.check"),
loc.names = c(loc.str = "loc.correct",
res.gazet = "resolution.gazetteer",
res.orig = "resol.orig",
loc.check = "loc.check"),
tax.level = "high", overwrite = FALSE) {
## check input
if (!class(dups) == "data.frame")
stop("Input object needs to be a data frame!")
if (dim(dups)[1] == 0)
stop("Input data frame is empty!")
#Escaping R CMD check notes from using data.table syntax
dup.IDs <- dup.merge <- dup.prop <- same_spp <- tax.check.wk <- status.wk <- NULL
species.wk <- det.year.wk <- tax.check.wk1 <- status.wk1 <- geo.check.wk <- NULL
valor1 <- i.valor <- valor2 <- prioridade <- valor <- res.orig.wk <- NULL
decimalLatitude.new <- decimalLongitude.new <- prioridade.tax <- NULL
ref.spec.tax <- ref.spec.geo <- ref.spec.loc <- NULL
temp.rec.ID <- tmp.ordem <- tmp.prop.name <- NULL
#Checking essential columns
if (!dup.name %in% names(dups))
stop(paste0("Merge is only possible if the input data contain a column with the duplicate IDs"))
if ("tax" %in% info2merge & !all(tax.names %in% names(dups))) {
warning(paste0("To merge taxonomic info, the input data must contain the following column(s): ",
paste0(tax.names[!tax.names %in% names(dups)], collapse = ", "),
". Skipping 'tax' from info to merge."), call. = FALSE)
info2merge <- info2merge[!info2merge %in% 'tax']
}
if ("geo" %in% info2merge & !all(geo.names %in% names(dups))) {
warning(paste0("To merge geographic info, the input data must contain the following column(s): ",
paste0(geo.names[!geo.names %in% names(dups)], collapse = ", "),
". Skipping 'geo' from info to merge."), call. = FALSE)
info2merge <- info2merge[!info2merge %in% 'geo']
}
if ("loc" %in% info2merge & !all(loc.names %in% names(dups))) {
warning(paste0("To merge locality info, the input data must contain the following column(s): ",
paste0(loc.names[loc.names %in% names(dups)], collapse = ", "),
". Skipping 'loc' from info to merge."), call. = FALSE)
info2merge <- info2merge[!info2merge %in% 'loc']
}
if (length(info2merge) == 0)
stop(paste0("No information left to merge!"))
#Making sure there are no factors
fatores <- sapply(dups, class) == "factor"
if (any(fatores)) {
for (i in which(fatores))
dups[[i]] <- as.character(dups[[i]])
}
#Vector to keep the original data order
dups$tmp.ordem <- 1:dim(dups)[1]
#Subsetting the data with any indication of duplicates
dups1 <- dups[!is.na(dups[, dup.name]),]
dt <- data.table::data.table(dups1) # transforming the df into a data.table
dt[ , dup.IDs := .SD, .SDcols = c(dup.name)]
if (!rec.ID %in% names(dups)) {
warning("No unique record ID provided. Creating one",
call. = FALSE)
dt[, temp.rec.ID := tmp.ordem]
} else {
dt[ , temp.rec.ID := .SD, .SDcols = c(rec.ID)]
}
data.table::setkey(dt, dup.IDs) # setting 'dup.ID' as key to the data.table
#Creating the duplicate categories
if (!prop.name %in% names(dt)) {
warning("The input data has no column with the proportion of duplicates. Assuming to be 1",
call. = FALSE)
dt[, c(prop.name) := 1]
dt[, dup.merge := prop.name >= prop]
# dt[, dup.merge := TRUE]
} else {
dt[, dup.merge := prop.name >= prop]
# dt[, dup.merge := dup.prop >= prop]
}
if ("tax" %in% info2merge) {
# creating the new columns for taxonomic check
wk.cols <- paste0(names(tax.names), ".wk")
dt[, c(wk.cols) := .SD, .SDcols = tax.names]
## Finding the most recent, taxonomically valid name for each duplicate ID
#Are all taxonomically-validated species names equal within a group?
dt[, same_spp := "vazio"] # creating the new columns for taxonomic check
#All names for taxonomically-validated specimens are the same? Then, mark as 'yes'
dt[dup.merge & tax.check.wk %in% tax.level,
same_spp := if (data.table::uniqueN(species.wk) == 1) "sim"
else same_spp, by = dup.IDs]
dt[, same_spp := if (any(same_spp == "sim") &
data.table::uniqueN(species.wk) == 1) "sim"
else same_spp, by = dup.IDs]
dt[, same_spp := if (any(same_spp == "sim") &
data.table::uniqueN(species.wk) > 1) "no"
else same_spp, by = dup.IDs]
#Not all names for taxonomically-validated specimens are the same? Then, mark as 'no'
dt[dup.merge & tax.check.wk %in% tax.level,
same_spp := if (data.table::uniqueN(species.wk) > 1) "no"
else same_spp, by = dup.IDs]
dt[, same_spp := if (any(same_spp == "no")) "no"
else same_spp, by = dup.IDs]
## Defining the most up-to-date species name for the case of >=1 taxonomically validated names (new column species.correct1)
# converting det.year to numerical
data.table::setkey(dt, det.year.wk)
# dt[, det.year.wk := suppressWarnings(as.double(det.year.wk)),
# by = det.year.wk]
dt[, det.year.wk := suppressWarnings(lapply(.SD, as.numeric)),
.SDcols = c("det.year.wk")]
# creating the new columns for taxonomic check
new.cols <- paste0(wk.cols, "1")
dt[, c(new.cols) := .SD, .SDcols = tax.names]
# creating the order to organize names according to their status
dt[ , status.wk := gsub("\\|name_w_authors", "", status.wk, perl = TRUE)]
status.lvs <- data.table::data.table(
status.wk = c("forma", "subspecies", "variety", "not_Genus_epithet_format",
"hybrid_species","species_nova", "name_w_authors",
"name_w_non_ascii", "name_w_wrong_case", "not_name_has_digits",
"possibly_ok", "affinis", "conferre", "incertae_sedis",
"abbreviated_genus", "indet", "subfamily_as_genus",
"family_as_genus", "order_as_genus"),
valor = c(1,1,1,1.5, 2,2,3, 3,3,3, 3,4,4,5, 6,6,7, 8,9))
# add priority values of status to the data
data.table::setDT(dt)[status.lvs, valor1 := i.valor,
on = c(status.wk = "status.wk")]
#Saving for each duplicate group ID, the most recent identification, in the case of disagreement between the identification of taxonomically-validated specimens
dt[, tmp.prop.name := .SD, .SDcol = c(prop.name)]
dt[, tmp.prop.name := suppressWarnings(lapply(.SD, as.numeric)),
.SDcols = c("tmp.prop.name")]
dt[ , prioridade.tax := valor1 + (1 - tmp.prop.name)]
# dt[ , prioridade.tax := valor1 + (1 - as.numeric(dup.prop))]
dt[ , c("valor1", "tmp.prop.name") := NULL]
data.table::setkey(dt, dup.IDs, prioridade.tax)
dt[same_spp == "no",
c(new.cols) := .SD[tax.check.wk %in%
tax.level][which.max(det.year.wk[tax.check.wk %in% tax.level])],
by = dup.IDs, .SDcols = c(wk.cols)]
dt[same_spp == "sim",
c("det.name.wk1", "det.year.wk1") := .SD[tax.check.wk %in% tax.level &
which.max(det.year.wk)][1],
by = dup.IDs, .SDcols = c("det.name.wk1", "det.year.wk1")]
dt[same_spp == "sim",
tax.check.wk1 := if (any(tax.check.wk1 %in% tax.level)) tax.level
else tax.check.wk1, by = dup.IDs]
#Defining the reference specimen for each duplicate group
# dt[, ref.spec.tax := temp.rec.ID[1L], by = dup.IDs]
dt[, ref.spec.tax := NA_character_]
dt[same_spp == "no",
ref.spec.tax := temp.rec.ID[tax.check.wk %in%
tax.level][which.max(det.year.wk[tax.check.wk %in% tax.level])],
by = dup.IDs]
dt[same_spp == "sim",
ref.spec.tax := temp.rec.ID[tax.check.wk %in% tax.level &
which.max(det.year.wk)][1],
by = dup.IDs]
dt[same_spp == "vazio", ref.spec.tax := temp.rec.ID[1L], by = dup.IDs]
#Removing unecessary columns
dt[, c("same_spp", "prioridade.tax", wk.cols) := NULL]
#Renaming the new columns
new.cols1 <- paste0(tax.names, "1")
data.table::setnames(dt, c(new.cols), c(new.cols1))
}
if ("geo" %in% info2merge) {
# creating the new columns for geographic check
wk.cols <- paste0(names(geo.names),".wk")
dt[, c(wk.cols) := .SD, .SDcols = geo.names]
# replacing NAs
dt[, geo.check.wk := ifelse(is.na(geo.check.wk),
"no_cannot_check", geo.check.wk), ]
cols <- c('prec.coord.wk',"lat.wk", "lon.wk")
dt[, (cols) := replace(.SD, is.na(.SD), "no_coord"), .SDcols = cols]
# creating the new columns for the geographical check and merge
new.cols <- paste0(wk.cols, "1")
dt[, c(new.cols) := .SD, .SDcols = geo.names]
# converting lat and long to character
lat.long.tmp <- new.cols[which(names(geo.names) %in% c("lat","lon"))]
dt[, (lat.long.tmp) := lapply(.SD, as.character), .SDcols = lat.long.tmp]
# creating the order/priority for geo info replacement
patt <- c('invert_lon|invert_lat|invert_both|transposed|transp_inv_lon|transp_inv_lat|transp_inv_both')
dt[, geo.check.wk := gsub(patt, "", geo.check.wk, perl = TRUE), ]
dt[, geo.check.wk := gsub('\\[\\]', "", geo.check.wk, perl = TRUE), ]
geo.check.lvs <- data.table::data.table(
geo.check.wk = c("ok_county", "ok_county_close","ok_locality_gazet","ok_county_gazet",
"shore", "ok_state", "ok_state_gazet", "ok_country", "ok_country_gazet",
"ok_country[border]",
"no_cannot_check", "check_gazetteer", "bad_country", "open_sea"),
valor = c(1,3,4,5,6,8,9,10,11,12,20,20,25,25))
coord.prec.lvs <- data.table::data.table(
prec.coord.wk = c("miliseconds", "seconds","seconds_centroid",
"minutes", "degrees", "no_coord"),
valor = c(0,0,0,1,6,20))
# add values of geo.check and coord. resolution to the original data
data.table::setDT(dt)[geo.check.lvs, valor1 := i.valor, on = c(geo.check.wk = "geo.check.wk")]
data.table::setDT(dt)[coord.prec.lvs, valor2 := i.valor, on = c(prec.coord.wk = "prec.coord.wk")]
# creating the priority index and re-organizing the original data
dt[ , prioridade := valor1 + valor2]
dt[ , c("valor1", "valor2") := NULL]
data.table::setkey(dt, dup.IDs, prioridade)
#Replacing the columns by the first row by groups of duplicate
dt[dup.merge == TRUE, (new.cols) := .SD[1L],
by = dup.IDs, .SDcols = wk.cols]
#Defining the reference specimen for each duplicate group
dt[, ref.spec.geo := temp.rec.ID[1L], by = dup.IDs]
#Removing unecessary columns
dt[, c("prioridade", wk.cols) := NULL]
#Renaming the new columns
new.cols1 <- paste0(geo.names, "1")
data.table::setnames(dt, c(new.cols), c(new.cols1))
}
if ("loc" %in% info2merge) {
# creating the new columns for locality check
wk.cols <- paste0(names(loc.names),".wk")
dt[, c(wk.cols) := .SD, .SDcols = loc.names]
# removing NAs and creating the new columns for locality and geographical check
cols <- c('loc.str.wk', 'loc.check.wk')
dt[, (cols) := replace(.SD, is.na(.SD), "no_loc"), .SDcols = cols]
# creating the new columns for the locality check and merge
new.cols <- paste0(wk.cols[!wk.cols %in% "res.orig.wk"], "1")
dt[, c(new.cols) := .SD, .SDcols = c(loc.names[!names(loc.names) %in% "res.orig"])]
# creating the order/priority for locality info replacement
loc.check.lvs <- data.table::data.table(
loc.check.wk = c("check_local.2no.info","check_local.2country","check_local.2state","check_local.2municip.",
"check_municip.2no.info","check_municip.2country","check_municip.2state",
"check_state2no.info","check_state2country",
"check_country2no.info",
"ok_country2state","ok_country2municip.","ok_country2locality",
"ok_state2municip.","ok_state2locality",
"ok_municip.2locality",
"ok_same_resolution"),
valor = c(5,4,3,2,5,4,3,5,4,5,3,2,1,2,1,1,NA_integer_))
# add values of order/priority to the original data
data.table::setDT(dt)[loc.check.lvs, valor := i.valor,
on = c(loc.check.wk = "loc.check.wk")]
# add values of order/priority to the original data without loc.check info
dt[is.na(valor) & res.orig.wk %in% "no_info", valor := 5]
dt[is.na(valor) & res.orig.wk %in% "country", valor := 4]
dt[is.na(valor) & res.orig.wk %in% "stateProvince", valor := 3]
dt[is.na(valor) & res.orig.wk %in% "municipality", valor := 2]
dt[is.na(valor) & res.orig.wk %in% "locality", valor := 1]
# re-organizing the original data
data.table::setkey(dt, dup.IDs, valor)
#Replacing the columns by the first row by groups of duplicate
## include an extra step: paste(unique(loc.correct[which.min(valor)][1]), collapse="|")??
dt[dup.merge == TRUE, (new.cols) := .SD[1L],
by = dup.IDs, .SDcols = c(wk.cols[!wk.cols %in% "res.orig.wk"])]
#Defining the reference specimen for each duplicate group
dt[, ref.spec.loc := temp.rec.ID[1L], by = dup.IDs]
#Removing unecessary columns
dt[, c("valor", wk.cols) := NULL]
#Renaming the new columns
new.cols1 <- paste0(loc.names[!names(loc.names) %in% "res.orig"], "1")
data.table::setnames(dt, c(new.cols), c(new.cols1))
}
if (!prop.name %in% names(dups1))
dt[, c(prop.name) := NULL]
if (overwrite) {
## Removing columns that are no longer necessary
dt[, c("dup.IDs", "dup.merge", "temp.rec.ID") := NULL]
data.table::setkeyv(dt, c("tmp.ordem"))
## Detecting the target columns to overwrite
colunas <- data.table::copy(names(dt))
colunas <- colunas[!colunas %in% names(dups)]
colunas0 <- gsub("1", "", colunas, perl = TRUE)
# subsetting and editing the new columns
tmp.dt <- data.table::data.table(dt[, .SD, .SDcols = colunas])
names(tmp.dt) <- gsub("1", "", names(tmp.dt))
tmp.dt[, decimalLatitude.new := as.double(decimalLatitude.new)]
tmp.dt[, decimalLongitude.new := as.double(decimalLongitude.new)]
# transforming the original data frame and overwritting the new columns
dt1 <- data.table::data.table(dups)
ref.cols <- c("ref.spec.tax", "ref.spec.geo", "ref.spec.loc")
dt1[, (ref.cols) := NA_character_]
colunas2 <- names(dt1)[names(dt1) %in% colunas0]
data.table::setcolorder(tmp.dt, colunas2)
data.table::setkeyv(dt1, c("tmp.ordem"))
replace_these <- as.vector(dt1[ , !is.na(.SD), .SDcols = c(dup.name)])
dt1[replace_these, (colunas2) := tmp.dt]
} else {
## Removing columns that are no longer necessary
dt[, c("dup.IDs", "dup.merge", "temp.rec.ID") := NULL]
## Creating the missing new columns in the unicate subset of the data
ref.cols <- c("ref.spec.tax", "ref.spec.geo", "ref.spec.loc")
colunas <- names(dt)[!names(dt) %in% c(names(dups), ref.cols)]
colunas0 <- unique(gsub("1", "", colunas))
# data with no indication of duplicates
dt0 <- data.table::data.table(dups[is.na(dups[, dup.name]), ])
dt0[, (colunas) := .SD, .SDcols = colunas0]
ref.cols1 <- ref.cols[ref.cols %in% names(dt)]
dt0[, (ref.cols1) := NA_character_]
data.table::setcolorder(dt0, names(dt))
# merging the unicate and duplicated data tables
dt1 <- data.table::rbindlist(list(dt0, dt))
# re-ordering the resulting data.table to the input order
data.table::setkeyv(dt1, c("tmp.ordem"))
}
#Removing unecessary columns and returning
dt1[, c("tmp.ordem") := NULL]
dups1 <- data.frame(dt1)
return(dups1)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.