rmDup: Remove Duplicates

View source: R/rmDup.R

rmDupR Documentation

Remove Duplicates

Description

This function keeps one and removes all other specimens within groups of duplicate specimens.

Usage

rmDup(
  df,
  dup.name = "dup.ID",
  prop.name = "dup.prop",
  rec.ID = "numTombo",
  order.by = NULL,
  rm.all = FALSE,
  print.rm = TRUE
)

Arguments

df

the input data frame.

dup.name

character. The name of column in the input data frame with the duplicate group ID. Default to the plantR output 'dup.ID'.

prop.name

character. The name of column in the input data frame with the proportion of duplicates found within the group ID. Default to the plantR output 'dup.prop'.

rec.ID

character. The name of the columns containing the unique record identifier (see function getTombo()). Default to 'numTombo'.

order.by

character. Column name(s) used to order records within groups of duplicates.

rm.all

logical. Should all duplicates be removed or only the duplicated entries from the same collection but different sources? Default to FALSE.

print.rm

logical. Should the number of records removed be printed? Default to TRUE.

Details

The input data frame df must contain the typical columns resulting from plantR workflow and functions. Otherwise, the names of these columns should be provided using arguments dup.name (i.e. characters used to aggregate records into groups of duplicates) and prop.name (i.e. proportion of duplicated records).

Since only one record is kept per group of duplicates, this procedure should preferably be carried after the homogenization of the specimens informations (see function mergeDup()). Otherwise, important information on the removed records may be lost.

In addition, since not all columns are merged within duplicates (only the columns related to the taxonomic, geographical and location validation procedures), all other information contained in the removed records are lost. Therefore, make sure that this information in unnecessary for your specific purposes before using this function.

By default, the record retained for each group of duplicates is determined by the proportion of duplicates the record has within the group (argument prop.name) and by the original order of the input data frame df. So, the first record with the highest proportion of duplicates will be the record retained. But the user can use the argument order.by if the data should be order by any of the columns in the input data. This column will be used to create the 'key' within the data.table parlance and order the data accordingly.

Finally, users can choose between removing all but one records within each group of duplicate, or to remove only those records with duplicated entries from the same collection in different sources (i.e. virtual duplicates), using the argument rm.all. This option can be useful if the same collection has its database in two or more repositories (e.g. speciesLink and GBIF). It is important to note that this removal is dependent on the duplicate group ID found for each record. So, if the information was entered differently in the different sources, it is not guaranteed that they will be grouped under the same duplicate group ID, and thus be excluded from the data.

Author(s)

Renato A. F. de Lima

Examples

(df <- data.frame(numTombo = c("a1", "b2", "c3", "c3", "d5", "d5", "e7", "f4", "g9"),
                  dup.ID = c("a1|b2", "a1|b2", "c3|c3", "c3|c3", "d5|d5|e7",
                             "d5|d5|e7", "d5|d5|e7", "f4", NA),
                  dup.prop = c(1, 1, 1, 1, 0.5, 0.5, 1, 1, NA),
                  stringsAsFactors = FALSE))
rmDup(df)
rmDup(df, rm.all = TRUE)
rmDup(df, rm.all = TRUE, print.rm = FALSE)


LimaRAF/plantR documentation built on Jan. 1, 2023, 10:18 a.m.