View source: R/subNonStandardCharacters.R
subNonStandardCharacters | R Documentation |
First convert to ASCII, stripping standard
accents and special characters. Then find
the first and last character not in
standardCharacters
and replace all
between them with replacement
. For
example, a string like "Ruben" where "e"
carries an accent and is mangled by some
software would become something like
"Rub_n" using the default values for
standardCharacters
and
replacement
.
subNonStandardCharacters(x,
standardCharacters=c(letters, LETTERS,
' ','.', '?', '!', ',', 0:9, '/', '*',
'$', '%', '\"', "\'", '-', '+', '&',
'_', ';', '(', ')', '[', ']', '\n'),
replacement='_',
gsubList=list(list(pattern =
'\\\\\\\\|\\\\',
replacement='\"')), ... )
x |
character vector in which it is desired
to find the first and last character not
in |
standardCharacters |
a character vector of acceptable characters to keep. |
replacement |
a character to replace the substring
starting and ending with characters not
in |
gsubList |
list of lists of |
... |
optional arguments passed to
|
1. for(il in 1:length(gsubList))
x <- gsub(gsubList[[il]][["pattern"]],
gsubList[[il]][['replacement']], x)
2. x <- stringi::stri_trans_general(x,
"Latin-ASCII")
3. nx <- length(x)
4. x. <- strsplit(x, "", ...)
5. for(ix in 1:nx)
find the first and
last standardCharacters
in x.[ix]
and substitute replacement
for
everything in between.
NOTES:
** To find the elements of x that have changed,
use either
subNonStandardCharacters(x) != x
or
grep(replacement,
subNonStandardCharacters(x))
,
where
replacement
is the replacement
argument = "_" by default.
** On 13 May 2013 Jeff Newmiller at the University of California, Davis, wrote, 'I think it is a fools errand to think that you can automatically "normalize" arbitrary Unicode characters to an ASCII form that everyone will agree on.' (This was a reply on r-help@r-project.org, subject: "Re: [R] Matching names with non-English characters".)
** On 2014-12-15 Ista Zahn suggested
stri_trans_general
.
(This was a reply on r-help@r-project.org,
subject: "[R] Comparing Latin characters with
and without accents?".)
a character vector with everything between the
first and last character not in
standardCharacters
replaced by
replacement
.
Spencer Graves with thanks to Jeff Newmiller,
who described this as a "fool's errand",
Milan Bouchet-Valat, who directed me to
iconv
, and Ista Zahn, who
suggested
stri_trans_general
.
sub
, strsplit
,
grepNonStandardCharacters
,
subNonStandardNames
subNonStandardNames
iconv
in the base
package does some conversion, but is not
consistent across platforms, at least
using R 3.1.2 on 2015-01.25.
stri_trans_general
seems better.
##
## 1. Consider Names = Ruben, Avila and Jose, where
## "e" and "A" in these examples carry an accent.
## With the default values for standardCharacters and
## replacement, these might be converted to something
## like Rub_n, _vila, and Jos_, with different software
## possibly mangling the names differently. (The
## standard checks for R packages in an English locale
## complains about non-ASCII characters, because they
## are not portable.)
##
nonstdNames <- c('Ra`l', 'Ra`', '`l', 'Torres, Raul',
"Robert C. \\Bobby\\\\", NA, '', ' ',
'$12', '12%')
# confusion in character sets can create
# names like Names[2]
Name2 <- subNonStandardCharacters(nonstdNames)
str(Name2)
# check
Name2. <- c('Ra_l', 'Ra_', '_l', nonstdNames[4],
'Robert C. "Bobby"', NA, '', ' ',
'$12', '12%')
str(Name2.)
all.equal(Name2, Name2.)
##
## 2. Example from iconv
##
icx <- c("Ekstr\u{f8}m", "J\u{f6}reskog",
"bi\u{df}chen Z\u{fc}rcher")
icx2 <- subNonStandardCharacters(icx)
# check
icx. <- c('Ekstrom', 'Joreskog', 'bisschen Zurcher')
all.equal(icx2, icx.)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.