ascii_replace: Replace invalid UTF-8 bytes.

Description Usage Arguments Value

Description

Replace invalid bytes detected by check_column_encoding with valid ASCII or UTF-8 characters. Involves manual construction of a replacement vector (see rep_str).

Usage

1
ascii_replace(dset, enc_check_results, column_name, rep_str)

Arguments

dset

A data.frame or data.table.

enc_check_results

A list returned by calling check_column_encoding.

column_name

The name of an element in the list returned by check_column_encoding, corresponding to a column header in dset where encoding issues were detected (or possibly false positives).

rep_str

A matrix. The number of rows should equal the length of enc_check_results[[column_name]]. The number of columns should equal the maximum number of invalid bytes sequences observed in any element of enc_check_results[[column_name]]. Strings within the character vector consist of either a single character to replace in the corresponding strings of enc_check_results[[column_name]], or a random filler character or string. Because some elements of enc_check_results[[column_name]] may have more invalid byte sequences than others, and because the number of columns in the replacement matrix is equal to the maximum number of invalid sequences, some matrix rows may need dummy strings in order to be completely filled. This matrix must be manually constructed, as there is no method for guessing the proper ASCII or UTF-8 character to replace an invalid byte sequence. The function is only capable of replacing single columns at a time. To replace additional columns, the data.table returned by ascii_replace must be fed back into the function as the value of dset–likely with a different value for rep_str. While this may seem like an error-prone approach, remember that you can script your manual construction. A good idea is to use the same filler word throughout your matrix. Then call grep using that filler word as the value of the pattern argument and the result of calling ascii_replace as the value of argument \xodex. If the filler word is matched, it probably means that you missed a secondary, tertiary, etc. invalid byte sequence in one of your observations. (It's happened to me at least once!).

Value

A data.table with the same structure as dset but valid UTF-8 bytes.


jkroes/FixEncoding documentation built on May 19, 2019, 12:44 p.m.