knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "README-" ) options(width = 95)
utf8 is an R package for manipulating and printing UTF-8 text that fixes multiple bugs in R's UTF-8 handling.
utf8 is available on CRAN. To install the latest released version, run the following command in R:
install.packages("utf8")
To install the latest development version, run the following:
devtools::install_github("patperry/r-utf8")
library(utf8)
Use as_utf8()
to validate input text and convert to UTF-8 encoding. The
function alerts you if the input text has the wrong declared encoding:
# second entry is encoded in latin-1, but declared as UTF-8 x <- c("fa\u00E7ile", "fa\xE7ile", "fa\xC3\xA7ile") Encoding(x) <- c("UTF-8", "UTF-8", "bytes") as_utf8(x) # fails # mark the correct encoding Encoding(x[2]) <- "latin1" as_utf8(x) # succeeds
Use utf8_normalize()
to convert to Unicode composed normal form (NFC).
Optionally apply compatibility maps for NFKC normal form or case-fold.
# three ways to encode an angstrom character (angstrom <- c("\u00c5", "\u0041\u030a", "\u212b")) utf8_normalize(angstrom) == "\u00c5" # perform full Unicode case-folding utf8_normalize("Grรถรe", map_case = TRUE) # apply compatibility maps to NFKC normal form # (example from https://twitter.com/aprilarcus/status/367557195186970624) utf8_normalize("๐ธ๐ ๐๐ง๐ข๐๐จ๐๐ ๐ ๐๐พ๐๐ฝ ๐ ๐ ๐๐๐พ ๐ก๐ฆ๐๐๐๐๐๐๐ ๐๐ ๐๐พ ๐๐๐ ๐๐๐๐พ ๐๐๐๐๐๐๐๐๐๐ ๐๐ ๐๐๐๐ ๐๐ฒ๐ญ๐ญ๐ฉ๐ข๐ช๐ข๐ซ๐ฑ๐๐ฏ๐ถ ๐๐ฒ๐ฉ๐ฑ๐ฆ๐ฉ๐ฆ๐ซ๐ค๐ณ๐๐ฉ ๐๐ฉ๐๐ซ๐ข ๐๐ ๐๐๐ ๐ผ๐บ๐ ๐ฎ๐ท๐ฌ๐ธ๐ญ๐ฎ ๐๐ ๐๐ฅ๐ค ๐๐ ๐๐๐๐ ๐๐๐๐๐.", map_compat = TRUE)
On some platforms (including MacOS), the R implementation of print()
uses an
outdated version of the Unicode standard to determine which characters are
printable. Use utf8_print()
for an updated print function:
print(intToUtf8(0x1F600 + 0:79)) # with default R print function utf8_print(intToUtf8(0x1F600 + 0:79)) # with utf8_print, truncates line utf8_print(intToUtf8(0x1F600 + 0:79), chars = 1000) # higher character limit
Cite utf8 with the following BibTeX entry:
print(suppressWarnings(citation("utf8")), "Bibtex")
The project maintainer welcomes contributions in the form of feature requests, bug reports, comments, unit tests, vignettes, or other code. If you'd like to contribute, either
fork the repository and submit a pull request
or contact the maintainer via e-mail.
This project is released with a Contributor Code of Conduct, and if you choose to contribute, you must adhere to its terms.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.