prepDup | R Documentation |
This function creates the duplicate search strings by concatenating the information on the taxonomy, collection and locality of the records.
prepDup( x, col.names = c(family = "family.new", species = "scientificName.new", col.name = "recordedBy.new", col.last.name = "last.name", col.number = "recordNumber.new", col.year = "year.new", col.loc = "municipality.new", loc.str = "loc.correct"), comb.fields = list(c("family", "col.last.name", "col.number", "col.loc"), c("family", "col.year", "col.number", "col.loc"), c("species", "col.last.name", "col.number", "col.year"), c("col.year", "col.last.name", "col.number", "col.loc")), rec.ID = "numTombo", noYear = "s.d.", noName = "s.n.", noNumb = "s.n.", ignore.miss = TRUE )
x |
a data frame with the species records. |
col.names |
vector. A named vector containing the names of columns in the input data frame for each of the information that should be used to create the duplicate search string(s). Default to the plantR output column names. |
comb.fields |
list. A list containing one or more vectors with the information that should be used to create the duplicate search strings. Default to four vectors of information to be combined. |
rec.ID |
character. The name of the columns containing the unique record
identifier (see function |
noYear |
character. Standard for missing data in Year. Default to "n.d.". |
noName |
character. Standard for missing data in collector name. Default to "s.n.". |
noNumb |
character. Standard for missing data in collector number. Default to "s.n.". |
ignore.miss |
logical. Should the duplicate search strings with missing/unknown information (e.g. 'n.d.', 's.n.', NA) be excluded from the duplicate search. Default to TRUE. |
Three groups of fields are available to produce the duplicate search
string, and they are related to taxonomy, collection and locality of the
specimen. These fields should be provided to the argument col.names
and
they are:
'family': the botanical family (default: 'family.new')
'species': the scientific name (default: 'scientificName.new')
'col.name': the collector name (default: 'recordedBy.new')
'col.last.name': the collector last name (default: 'last.name')
'col.number': the collector serial number (default: 'recordNumber.new')
'col.year': the collection year (default: 'year.new')
'col.loc': the collection locality (default: 'municipality.new')
The corresponding columns that should be used to retrieve these fields in
the input data frame must be provided as a named vector in the argument
col.names
, in which the fields listed above are the names and
each element is the corresponding column name in the input data frame.
If an element named 'loc.str' containing the column name of the plantR locality string (i.e. 'loc.correct') is also provided, it can be used to complement any missing locality information in the locality of the collection (i.e 'col.loc') that may have been retrieved in the data processing within the plantR workflow.
The duplicate search strings are created by combining the fields listed
above. Each combination of those fields (e.g. 'col.name' and 'col.number')
should be provided to the argument comb.fields
as a vector within a list.
The number of strings to be generated will correspond to the number of
vectors in this list. The order of the fields within vectors does not
change the duplicate search process.
The argument rec.ID
should indicate the column name in the input data
containing the unique record identifier, which in the plantR workflow
is obtained using the function getTombo()
. If only GBIF data is used,
this column could be the field 'gbifID'. This identifier is used to
indicate the groups of duplicated records, which is one of the outputs of
function getDup()
and is used to homogenize information within the groups
of duplicates (function mergeDup()
).
Please note that the retrieval of duplicates greatly depends on the completeness of the input information and in the amount of differences of notation standards among collections. In addition, the smaller the vectors of fields to be combined to create the duplicate strings, the higher the number of (true and false) duplicates will be retrieved.
Renato A. F. de Lima
getTombo, getDup and mergeDup.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.