| validUTF8 | R Documentation |
Check if each element of a character vector is valid in its implied encoding.
validUTF8(x) validEnc(x)
x |
a character vector. |
These use similar checks to those used by functions such as
grep.
validUTF8 ignores any marked encoding (see
Encoding) and so looks directly if the bytes in each
string are valid UTF-8. (For the validity of ‘noncharacters’
see the help for intToUtf8.)
validEnc regards character strings as validly encoded unless
their encodings are marked as UTF-8 or they are unmarked and the R
session is in a UTF-8 or other multi-byte locale. (The checks in
other multi-byte locales depend on the OS and as with
iconv not all invalid inputs may be detected.)
A logical vector of the same length as x. NA elements
are regarded as validly encoded.
It would be possible to check for the validity of character strings in a Latin-1 encoding, but extensions such as CP1252 are widely accepted as ‘Latin-1’ and 8-bit encodings rarely need to be checked for validity.
x <-
## from example(text)
c("Jetz", "no", "chli", "z\xc3\xbcrit\xc3\xbc\xc3\xbctsch:",
"(noch", "ein", "bi\xc3\x9fchen", "Z\xc3\xbc", "deutsch)",
## from a CRAN check log
"\xfa\xb4\xbf\xbf\x9f")
validUTF8(x)
validEnc(x) # depends on the locale
Encoding(x) <-"UTF-8"
validEnc(x) # typically the last, x[10], is invalid
## Maybe advantageous to declare it "unknown":
G <- x ; Encoding(G[!validEnc(G)]) <- "unknown"
try( substr(x, 1,1) ) # gives 'invalid multibyte string' error in a UTF-8 locale
try( substr(G, 1,1) ) # works in a UTF-8 locale
nchar(G) # fine, too
## but it is not "more valid" typically:
all.equal(validEnc(x),
validEnc(G)) # typically TRUE
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.