testCharSystem: testCharSystem function

Description Usage Arguments Value Examples

View source: R/testCharSystem.R

Description

This function helps to detect characters originated from different Unicode blocks.

Usage

1
testCharSystem(dat, addCharSys = NULL, markword = TRUE, autochange = FALSE)

Arguments

dat

data vector

addCharSys

the list of character blocks. If not defined, c("Latin", "Cyrillic") are used.

markword

if TRUE (default), detect the word containing anomalous character and mark it. If FALSE, detect an anomalous character within a word and mark that character.

autochange

if TRUE change characters based on the proposed coding rules in "charcodescheme.csv" (for more than one character "|" separator appears, for an unknown character that character is replaced with "?").

Value

Returns an altered data vector with anomalous words/characters/replacements surrounded by asterisks (*).

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
library(HooverArchives)
library(stringi)

dat_vectorR <- c("\u0418\u043D\u0444\u043E\u0440\u043C\u0061\u0446\u0438\u044F", "\u0410\u0440\u0078\u0438\u0432\u044B")
dat_vector <- stri_unescape_unicode(dat_vectorR)

# Mark the word
testCharSystem(dat_vector, addCharSys=c("Latin", "Cyrillic"), autochange=FALSE, markword=TRUE)

# Mark anamolous character
testCharSystem(dat_vector, addCharSys=c("Latin", "Cyrillic"), autochange=FALSE, markword=FALSE)

# Replace anamolous character with correct character and mark it
testCharSystem(dat_vector, addCharSys=c("Latin", "Cyrillic"), autochange=TRUE, markword=FALSE)

kkalininMI/HooverArchives documentation built on Oct. 28, 2020, 10:16 a.m.