In mdlincoln/salty: Turn Clean Data into Messy Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)

salty

When teaching students how to clean data, it helps to have data that isn't too clean already. salty offers functions for "salting" clean data with problems often found in datasets in the wild, such as:

pseudo-OCR errors
inconsistent capitalization and spelling
invalid dates
unpredictable punctuation in numeric fields
missing values or empty strings

Installation

Install salty from CRAN with:

install.packages("salty")

You may install the development version of salty from github with:

# install.packages("devtools")
devtools::install_github("mdlincoln/salty")

Basic usage

library(salty)
set.seed(10)

# We'll use charlatan to create some sample data

sample_names <- charlatan::ch_name(10)
sample_names

sample_numbers <- charlatan::ch_double(10)
sample_numbers

salty offers several easy-to-use functions for adding common problems to your data.

# Add in erroneous letters or punctuation
salt_letters(sample_names)
salt_punctuation(sample_names)

# Flip capitals
salt_capitalization(sample_names)

# Introduce OCR errors. You can specify the proportion of values in the vector
# that should be salted, and the proportion of possible matches within a single
# value that should be changed.
salt_ocr(sample_names, p = 1, rep_p = 1)

salt_delete will simply drop characters from randomly selected values in a vector, while salt_empty and salt_na will replace entire values.

salt_delete(sample_names, p = 0.5, n = 6)

salt_empty(sample_names, p = 0.5)

salt_na(sample_names, p = 0.5)

Advanced usage

For more fine-grained control over the salting process, and for access to a wider range of salting types, you can use the underlying functions provided for: inserting, substituting, replacing.

The set of insertions and replacements are specified via shakers, pre-filled character sets and pattern/replacement pairs that the salt verbs then call.

available_shakers()

salt_insert keeps all the characters in the original while adding new ones, while salt_substitute overwrites those characters.

# Use p to specify the percent of values that you would like to salt
salt_insert(sample_names, shaker$punctuation, p = 0.5)

# Use n to specify how many new insertions/substitutions you want to make to selected values
salt_substitute(sample_names, shaker$punctuation, p = 0.5, n = 3)

Different flavors of salt are available using shaker, however you can also supply your own character vector of possible replacements if you like.

salt_insert(sample_names, shaker$mixed_letters, p = 0.5)

salt_insert(sample_numbers, shaker$digits, p = 0.5)

salt_insert(sample_names, c("foo", "bar", "baz"), p = 0.5)

salt_replace is a bit more targeted: it works with pairs of patterns and replacements, either contained in replacement_shaker or user-specified. Use rep_p to set a probability of how many possible replacements should actually get swapped out for any given value.

salt_replace(sample_names, replacement_shaker$ocr_errors, p = 1, rep_p = 1)

salt_replace(sample_names, replacement_shaker$capitalization, p = 0.5, rep_p = 0.2)

salt_replace(sample_numbers, replacement_shaker$decimal_commas, p = 0.5, rep_p = 1)

You may also specify your own arbitrary character vector of possible insertions.

salt_insert(sample_names, insertions = c("X", "Z"))

Possible future work

Modifying date strings to introduce subtle errors like invalid dates (e.g. February 30th)
Simulating character encoding problems

Related work

salty should not be used for anonymizing data; that's not its purpose. However, it does draw some inspiration from anonymizer.

To create sample data for salting, take a look at charlatan.

Acknowledgements

The common OCR replacement errors are partially derived from the sed replacements specified in the Royal Society Corpus project: Knappen, Jörg, Fischer, Stefan, Kermes, Hannah, Teich, Elke, and Fankhauser, Peter. 2017. "The Making of the Royal Society Corpus." In Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language. Göteborg, Sweden. Linköping University Electronic Press. https://aclanthology.org/W17-0503.pdf.

mdlincoln/salty documentation built on Sept. 3, 2024, 6:50 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com