library(DT)

Introduction

This package helps you read and use data from Vietnamese sources in R. Such data often use Vietnamese legacy character encodings such as TCVN (Vietnam Standards / Tieu chuan Viet Nam) and are still in use today. Such data are not read correctly in R (which doesn't seem to support these Vietnamese encodings). To correct this problem and make such data available in R, this package converts data in legacy Vietnamese encodings to the correct Unicode characters. The main function is decodeVN.

The package currently supports conversion between the following encodings:

It converts between any of these encodings, but most commonly one would want to convert legacy encodings to their correct Unicode characters. The

package supports character vectors and data frames (the character columns thereof).

Background

For those interested here is some background on character sets and encodings.

A character set is a collection of characters used in languages.

A coded character set assigns a unique number to each character in a character set. The unique values in a coded character sets are known as code points (e.g. \"U+0041 \" corresponcs to capital letter A in Unicode).

Unicode (also known as Universal Coded Character Set) is a modern coded character set that contains over 100,000 characters from most writing systems of the world.

Character encodings are the method by which a coded character set is converted to binary. UTF-8 is a commonly used encoding for Unicode characters (UTF stands for Unicode Transformation Format).

Historically, the Vietnamese language could not be represented well using 128 characters in the ASCII standard. Hence, several character encodings were developed for the Vietnamese language with the aim of representing Vietnamese characters while fitting into 1 byte (=8-bit, allowing up to 255 characters). To achieve that, some some code points were reassigned and differ from today's standards like Unicode. Thus, when reading data that use those Vietnamese encodings on systems that assume e.g. UTF-8 encoding (Unicode), we get gibberish text (also known as mojibake).

Let's take a Vietnamese string that is supposed to be:

\"Quảng Trị, An Đôn, Thừa Thiên Huế\"

If it is encoded using a legacy Vietnamese encoding, it might displayed as something like:

\"Qu¶ng TrÞ, An §«n, Thõa Thiªn HuÕ\"

\"Quäng TrÎ, An Đôn, ThØa Thiên Hu‰\"

\"Quäng Tr¸, An Đôn, Th×a Thiên Huª"\

when it is read in R or other software (depending on the specific encoding of the data).

These days, almost all Vietnamese computer systems use Unicode and there is no need to use the legacy encodings anymore. Nevertheless, historic data may still be encoded in these legacy encodings and require conversion.

This package fixes this problem by converting the garbled strings back to their original (with or without the diacritcs). When saving the output in R, it will use standard encodings and data will read correctly in the future.

You can check the output from this package using e.g. this website \url{http://www.enderminh.com/minh/vnconversions.aspx}.

If you want to know more about the technicalities of encodings and character sets, see \url{https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/}.

Installation

The package can be installed from CRAN:

install.packages("vietnameseConverter")

and loaded into R via:

library(vietnameseConverter)

Example applications

Disclaimer: Printing Unicode characters from R function output in package vignettes is difficult. Often R will print the Unicode escape characters (the code point) instead of the actual characters, e.g. instead of "ế".

This is a limitation in vignettes. I'll try to circumvent it in this vignette as much as possible, e.g. by using DT::data.table() and by copy/pasting the correct console output below the output with Unicode characters. Even though it doesn't print nicely in this vignette, the converted data are correct (as you can see if you use data.table(), View(), or print vectors in the R console.

Run the code from this vignette in your R console to see the actual output, or try the examples in ?decodeVNN

Vectors

Let's consider a simple vector with garbled characters:

string_garbled <- c("Qu¶ng TrÞ", "An §«n", "Thõa Thiªn HuÕ")
string_garbled

It can be fixed using the decodeVN function:

tmp <- decodeVN(string_garbled)
tmp

In your R console, the Unicode characters will be printed correctly as:

## [1] "Quảng Trị"      "An Ðôn"         "Thừa Thiên Huế"

By default, the output contains character with diacritics (accents). We can also get output without the diacritics (plain ASCII letters) by setting argument diacritics = FALSE:

decodeVN(string_garbled, diacritics = FALSE)

By default, decodeVN() attempts to convert TCVN3 to Unicode characte (as shown in this example)r. If your data are in different encodings, set argument from.

For example, a string in VPS encoding is converted via:

string_garbled_vps <- c("Quäng TrÎ",  "An ñôn", "ThØa Thiên Hu‰")
decodeVN(string_garbled_vps, from = "VPS")

Which in your R cosole will be

## [1] "Quảng Trị"      "An Ðôn"         "Thừa Thiên Huế"

There is also an argument to, which by default is set to to = "Unicode". If you want output that simulates data encoded in one of the supported Vietnamese encodings, set argument to accordingly, e.g.

string_tcvn_to_vps <- decodeVN(x = string_garbled, from = "TCVN3", to = "VPS")
string_tcvn_to_vps

This string is now garbled the VPS-way. It can still be converted to Unicode:

decodeVN(string_tcvn_to_vps, from = "VPS")

which is:

## [1] "Quảng Trị"      "An Ðôn"         "Thừa Thiên Huế"

Data frames

We can also convert entire data frames at once. It will only convert character columns and leave the other columns untouched. If you have factor columns, convert them to character first.

The package contains a list containing several data frames which simulate the problems with different encodings. These data frames show what the problems with different encodings look like when the data are loaded in an environment that assumes UTF-8 / native encoding). The data frame is based on the table of provinces in \url{https://en.wikipedia.org/wiki/Provinces_of_Vietnam}.

Loading the list of sample data.

data(vn_samples)

vn_samples is a list of four data frames. See ?vn_samples for details.

The first item $Unicode shows the correct characters.

head(vn_samples$Unicode)

Note: when printing a data frame in the R colsole, the Unicode characters are not displayed properly, which is a limitation of the print method for data frames in R (it's the same issue as mentioned in the disclaimer above). The characters are correct though and show correctly when printing the columns as vectors:

head(vn_samples$Unicode$Province_city)

Which again is not formatted nicely in the vignette (but would be formatted correctly in your console). The string is:

## [1] "Bắc Giang" "Bắc Kạn"   "Cao Bằng"  "Hà Giang"  "Lạng Sơn"  "Phú Thọ"

It also shows correctly when using View (not shown here, it works in interactive sessions):

View(vn_samples$Unicode)

It is also shown correctly when using DT::datatable:

DT::datatable(vn_samples$Unicode)

The other data frames in vn_samples show garbled text from several Vietnamese encodings. Here's the example in TCVN3 encoding:

DT::datatable(vn_samples$TCVN3)
DT::datatable(vn_samples$TCVN3)

One can easily convert the entire data frame:

# take data frames out of list for easier readability
df_unicode <- vn_samples$Unicode
df_tcvn3 <- vn_samples$TCVN3

# conversion from TCVN3 to Unicode (default)
df_tcvn3_converted <- decodeVN(df_tcvn3)

# print output
DT::datatable(df_tcvn3_converted)

After conversion it is identical to the original with Unicode characters

all.equal(df_unicode, df_tcvn3_converted)

Again, we can choose to return characters without accents by setting diacritics = FALSE:

df_tcvn3_converted2 <- decodeVN(df_tcvn3, diacritics = FALSE)
DT::datatable(df_tcvn3_converted2)

We can also use the from and to arguments for conversions between other encodings.

For example, to convert VISCII-encoded data to Unicode, use:

df_viscii <- vn_samples$VISCII
DT::datatable(decodeVN(df_viscii, from = "VISCII"))

The same thing without diacritics:

DT::datatable(decodeVN(df_viscii, from = "VISCII", diacritics = FALSE))

Spatial data

Since version 0.4.0, decodeVN() works directly on sf objects and preserves the geometry column. The workflow is identical to data frames shown above.

library(sf)
sf_object <- st_read(...)
sf_object_decoded <- decodeVN(sf_object)


jniedballa/vietnameseConverter documentation built on Dec. 21, 2021, 1:15 a.m.