knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    fig.path = "README-"
)
options(width = 95)

utf8

rcc Coverage Status CRAN Status License CRAN RStudio Mirror Downloads

utf8 is an R package for manipulating and printing UTF-8 text that fixes multiple bugs in R's UTF-8 handling.

Installation

Stable version

utf8 is available on CRAN. To install the latest released version, run the following command in R:

install.packages("utf8")

Development version

To install the latest development version, run the following:

devtools::install_github("patperry/r-utf8")

Usage

library(utf8)

Validate character data and convert to UTF-8

Use as_utf8() to validate input text and convert to UTF-8 encoding. The function alerts you if the input text has the wrong declared encoding:

# second entry is encoded in latin-1, but declared as UTF-8
x <- c("fa\u00E7ile", "fa\xE7ile", "fa\xC3\xA7ile")
Encoding(x) <- c("UTF-8", "UTF-8", "bytes")
as_utf8(x) # fails

# mark the correct encoding
Encoding(x[2]) <- "latin1"
as_utf8(x) # succeeds

Normalize data

Use utf8_normalize() to convert to Unicode composed normal form (NFC). Optionally apply compatibility maps for NFKC normal form or case-fold.

# three ways to encode an angstrom character
(angstrom <- c("\u00c5", "\u0041\u030a", "\u212b"))
utf8_normalize(angstrom) == "\u00c5"

# perform full Unicode case-folding
utf8_normalize("GrรถรŸe", map_case = TRUE)

# apply compatibility maps to NFKC normal form
# (example from https://twitter.com/aprilarcus/status/367557195186970624)
utf8_normalize("๐–ธ๐—ˆ ๐”๐ง๐ข๐œ๐จ๐๐ž ๐—… ๐—๐–พ๐—‹๐–ฝ ๐•Œ ๐—…๐—‚๐—„๐–พ ๐‘ก๐‘ฆ๐‘๐‘’๐‘“๐‘Ž๐‘๐‘’๐‘  ๐—Œ๐—ˆ ๐—๐–พ ๐—‰๐—Ž๐— ๐—Œ๐—ˆ๐—†๐–พ ๐šŒ๐š˜๐š๐šŽ๐š™๐š˜๐š’๐š—๐š๐šœ ๐—‚๐—‡ ๐—’๐—ˆ๐—Ž๐—‹ ๐”–๐”ฒ๐”ญ๐”ญ๐”ฉ๐”ข๐”ช๐”ข๐”ซ๐”ฑ๐”ž๐”ฏ๐”ถ ๐”š๐”ฒ๐”ฉ๐”ฑ๐”ฆ๐”ฉ๐”ฆ๐”ซ๐”ค๐”ณ๐”ž๐”ฉ ๐”“๐”ฉ๐”ž๐”ซ๐”ข ๐—Œ๐—ˆ ๐—’๐—ˆ๐—Ž ๐–ผ๐–บ๐—‡ ๐“ฎ๐“ท๐“ฌ๐“ธ๐“ญ๐“ฎ ๐•—๐• ๐•Ÿ๐•ฅ๐•ค ๐—‚๐—‡ ๐—’๐—ˆ๐—Ž๐—‹ ๐’‡๐’๐’๐’•๐’”.",
               map_compat = TRUE)

Print emoji

On some platforms (including MacOS), the R implementation of print() uses an outdated version of the Unicode standard to determine which characters are printable. Use utf8_print() for an updated print function:

print(intToUtf8(0x1F600 + 0:79)) # with default R print function

utf8_print(intToUtf8(0x1F600 + 0:79)) # with utf8_print, truncates line

utf8_print(intToUtf8(0x1F600 + 0:79), chars = 1000) # higher character limit

Citation

Cite utf8 with the following BibTeX entry:

print(suppressWarnings(citation("utf8")), "Bibtex")

Contributing

The project maintainer welcomes contributions in the form of feature requests, bug reports, comments, unit tests, vignettes, or other code. If you'd like to contribute, either

This project is released with a Contributor Code of Conduct, and if you choose to contribute, you must adhere to its terms.



patperry/r-utf8 documentation built on Jan. 26, 2024, 12:59 a.m.