subNonStandardCharacters: sub nonstandard characters with replacement
In Ecfun: Functions for Ecdat

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/subNonStandardCharacters.R

First convert to ASCII, stripping standard accents and special characters. Then find the first and last character not in standardCharacters and replace all between them with replacement. For example, a string like "Ruben" where "e" carries an accent and is mangled by some software would become something like "Rub_n" using the default values for standardCharacters and replacement.

subNonStandardCharacters(x,
   standardCharacters=c(letters, LETTERS, 
      ' ','.', '?', '!', ',', 0:9, '/', '*', 
      '$', '%', '\"', "\'", '-', '+', '&', 
      '_', ';', '(', ')', '[', ']', '\n'),
   replacement='_',
   gsubList=list(list(pattern = 
      '\\\\\\\\|\\\\',
      replacement='\"')), ... )

`x`	character vector in which it is desired to find the first and last character not in `standardCharacters` and replace that substring by `replacement`.
`standardCharacters`	a character vector of acceptable characters to keep.
`replacement`	a character to replace the subtring starting and ending with characters not in `standardCharacters`.
`gsubList`	list of lists of `pattern` and `replacement` arguments to be called in succession before looking for nonStandardCharacters
`...`	optional arguments passed to `strsplit`

1. for(il in 1:length(gsubList))x <- gsub( gsubList[[il]][["pattern"]], gsublist[[il]][['replacement']], x)

2. x <- stringi::stri_trans_general(x, "Latin-ASCII")

3. nx <- length(x)

4. x. <- strsplit(x, "", ...)

5. for(ix in 1:nx) find the first and last standardCharacters in x.[ix] and substitute replacement for everything in between.

NOTES:

** To find the elements of x that have changed, use either subNonStandardCharacters(x) != x or grep(replacement, subNonStandardCharacters(x)), where replacement is the replacement argument = "_" by default.

** On 13 May 2013 Jeff Newmiller at the University of California, Davis, wrote, 'I think it is a fools errand to think that you can automatically "normalize" arbitrary Unicode characters to an ASCII form that everyone will agree on.' (This was a reply on r-help@r-project.org, subject: "Re: [R] Matching names with non- English characters".)

** On 2014-12-15 Ista Zahn suggested stri_trans_general. (This was a reply on r-help@r-project.org, subject: "[R] Comparing Latin characters with and without accents?".)

a character vector with everthing between the first and last character not in standardCharacters replaced by replacement.

Spencer Graves with thanks to Jeff Newmiller, who described this as a "fool's errand", Milan Bouchet-Valat, who directed me to iconv, and Ista Zahn, who suggested stri_trans_general.

sub, strsplit, grepNonStandardCharacters, subNonStandardNames subNonStandardNames iconv in the base package does some conversion, but is not consistent across platforms, at least using R 3.1.2 on 2015-01.25. stri_trans_general seems better.

##
## 1. Consider Names = Ruben, Avila and Jose, where "e" and "A" in
##    these examples carry an accent.  With the default values
##    for standardCharacters and replacement, these might be 
##    converted to something like Rub_n, _vila, and Jos_, with 
##    different software possibly mangling the names differently.  
##    (The standard checks for R packages in an English locale 
##    complains about non-ASCII characters, because they are 
##    not portable.)
##
nonstdNames <- c('Ra`l', 'Ra`', '`l', 'Torres, Raul',
           "Robert C. \\Bobby\\\\", NA, '', '  ', 
           '$12', '12%')




#  confusion in character sets can create
#  names like Names[2]
Name2 <- subNonStandardCharacters(nonstdNames)
str(Name2)

# check 
Name2. <- c('Ra_l', 'Ra_', '_l', nonstdNames[4],
            'Robert C. "Bobby"', NA, '', '  ', 
            '$12', '12%')
str(Name2.)

all.equal(Name2, Name2.)

##
## 2.  Example from iconv
##
icx <- c("Ekstr\xf8m", "J\xf6reskog", 
         "bi\xdfchen Z\xfcrcher")
icx2 <- subNonStandardCharacters(icx)

# check 
icx. <- c('Ekstrom', 'Joreskog', 'bisschen Zurcher')

all.equal(icx2, icx.)