rmDup | R Documentation |
This function keeps one and removes all other specimens within groups of duplicate specimens.
rmDup( df, dup.name = "dup.ID", prop.name = "dup.prop", rec.ID = "numTombo", order.by = NULL, rm.all = FALSE, print.rm = TRUE )
df |
the input data frame. |
dup.name |
character. The name of column in the input data frame with the duplicate group ID. Default to the plantR output 'dup.ID'. |
prop.name |
character. The name of column in the input data frame with the proportion of duplicates found within the group ID. Default to the plantR output 'dup.prop'. |
rec.ID |
character. The name of the columns containing the unique record
identifier (see function |
order.by |
character. Column name(s) used to order records within groups of duplicates. |
rm.all |
logical. Should all duplicates be removed or only the duplicated entries from the same collection but different sources? Default to FALSE. |
print.rm |
logical. Should the number of records removed be printed? Default to TRUE. |
The input data frame df
must contain the typical columns
resulting from plantR workflow and functions. Otherwise, the names
of these columns should be provided using arguments dup.name
(i.e.
characters used to aggregate records into groups of duplicates) and
prop.name
(i.e. proportion of duplicated records).
Since only one record is kept per group of duplicates, this procedure
should preferably be carried after the homogenization of the specimens
informations (see function mergeDup()
). Otherwise, important information
on the removed records may be lost.
In addition, since not all columns are merged within duplicates (only the columns related to the taxonomic, geographical and location validation procedures), all other information contained in the removed records are lost. Therefore, make sure that this information in unnecessary for your specific purposes before using this function.
By default, the record retained for each group of duplicates is determined
by the proportion of duplicates the record has within the group (argument
prop.name
) and by the original order of the input data frame df
.
So, the first record with the highest proportion of duplicates will be the
record retained. But the user can use the argument order.by
if the data
should be order by any of the columns in the input data. This column will
be used to create the 'key' within the data.table
parlance and order the
data accordingly.
Finally, users can choose between removing all but one records within each
group of duplicate, or to remove only those records with duplicated entries
from the same collection in different sources (i.e. virtual duplicates),
using the argument rm.all
. This option can be useful if the same
collection has its database in two or more repositories (e.g. speciesLink
and GBIF). It is important to note that this removal is dependent on the
duplicate group ID found for each record. So, if the information was
entered differently in the different sources, it is not guaranteed that
they will be grouped under the same duplicate group ID, and thus be
excluded from the data.
Renato A. F. de Lima
(df <- data.frame(numTombo = c("a1", "b2", "c3", "c3", "d5", "d5", "e7", "f4", "g9"), dup.ID = c("a1|b2", "a1|b2", "c3|c3", "c3|c3", "d5|d5|e7", "d5|d5|e7", "d5|d5|e7", "f4", NA), dup.prop = c(1, 1, 1, 1, 0.5, 0.5, 1, 1, NA), stringsAsFactors = FALSE)) rmDup(df) rmDup(df, rm.all = TRUE) rmDup(df, rm.all = TRUE, print.rm = FALSE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.