DataClean | R Documentation |
DataClean
cleans the data in a character vector according to the
conditions in the arguments.
DataClean(
x,
fix.comma = TRUE,
fix.semcol = TRUE,
fix.col = TRUE,
fix.bracket = TRUE,
fix.punct = TRUE,
fix.space = TRUE,
fix.sep = TRUE,
fix.leadzero = TRUE
)
x |
A character vector. If not, coerced to character by
|
fix.comma |
logical. If |
fix.semcol |
logical. If |
fix.col |
logical. If |
fix.bracket |
logical. If |
fix.punct |
logical. If |
fix.space |
logical. If |
fix.sep |
logical. If |
fix.leadzero |
logical. If |
This function aids in standardization and preparation of the PGR passport
data for creation of a KWIC index with KWIC
function
and the identification of probable duplicate accessions by the
ProbDup
function. It cleans the character strings in
passport data fields(columns) specified as the input character vector
x
according to the conditions in the arguments in the same order. If
the input vector x
is not of type character, it is coerced to a
character vector.
This function is designed particularly for use with fields corresponding to
accession names such as accession ids, collection numbers, accession names
etc. It is essentially a wrapper around the gsub
base
function with regex
arguments. It also converts all
strings to upper case and removes leading and trailing spaces.
Commas, semicolons and colons which are sometimes used to separate multiple
strings or names within the same field can be replaced with a single space
using the logical arguments fix.comma
, fix.semcol
and
fix.col
respectively.
Similarly the logical argument fix.bracket
can be used to replace all
brackets including parenthesis, square brackets and curly brackets with
space.
The logical argument fix.punct
can be used to remove all punctuation
from the data.
fix.space
can be used to convert all space characters such as tab,
newline, vertical tab, form feed and carriage return to spaces and finally
convert multiple spaces to single space.
fix.sep
can be used to merge together accession identifiers
composed of alphabetic characters separated from as series of digits by a
space character. For example IR 64, PUSA 256 etc.
fix.leadzero
can be used to remove leading zeros from accession name
fields to facilitate matching to identify probable duplicates. e.g. IR0064 ->
IR64
A character vector with the cleaned data converted to upper case.
NAs
if any are converted to blank strings.
gsub
, regex
,
MergeKW
, KWIC
,
ProbDup
names <- c("S7-12-6", "ICG-3505", "U 4-47-18;EC 21127", "AH 6481", "RS 1",
"AK 12-24", "2-5 (NRCG-4053)", "T78, Mwitunde", "ICG 3410",
"#648-4 (Gwalior)", "TG4;U/4/47/13", "EC0021003")
DataClean(names)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.