clean_variable_spelling: Check and clean spelling or codes of multiple variables in a...

Description Usage Arguments Details Value Note Author(s) See Also Examples

View source: R/clean_variable_spelling.R

Description

This function allows you to clean your data according to pre-defined rules encapsulated in either a data frame or list of data frames. It has application for addressing mis-spellings and recoding variables (e.g. from electronic survey data).

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
clean_variable_spelling(
  x = data.frame(),
  wordlists = list(),
  from = 1,
  to = 2,
  spelling_vars = 3,
  sort_by = NULL,
  classes = NULL,
  warn = FALSE
)

Arguments

x

a data.frame

wordlists

a data frame or named list of data frames with at least two columns defining the word list to be used. If this is a data frame, a third column must be present to split the wordlists by column in x (see spelling_vars).

from

a column name or position defining words or keys to be replaced

to

a column name or position defining replacement values

spelling_vars

character or integer. If wordlists is a data frame, then this column in defines the columns in x corresponding to each section of the wordlists data frame. This defaults to 3, indicating the third column is to be used.

sort_by

a character the column to be used for sorting the values in each data frame. If the incoming variables are factors, this determines how the resulting factors will be sorted.

classes

a vector of class definitions for each of the columns. If this is not provided, the classes will be read from the columns themselves. Practically, this is used in clean_data() to mark columns as protected.

warn

if TRUE, warnings and errors from clean_spelling() will be shown as a single warning. Defaults to FALSE, which shows nothing.

Details

By default, this applies the function clean_spelling() to all columns specified by the column names listed in spelling_vars, or, if a global dictionary is used, this includes all character and factor columns as well.

spelling_vars

Spelling variables within wordlists represent keys that you want to match to column names in x (the data set). These are expected to match exactly with the exception of two reserved keywords that starts with a full stop:

Global wordlists

A global wordlist is a set of definitions applied to all valid columns of x indiscriminantly.

Value

a data frame with re-defined data based on the dictionary

Note

This function will only parse character and factor columns to protect numeric and Date columns from conversion to character.

Author(s)

Zhian N. Kamvar

Patrick Barks

See Also

matchmaker::match_df(), which this function wraps.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# Read in dictionary and coded date examples --------------------

wordlist <- read.csv(linelist_example("spelling-dictionary.csv"), 
                     stringsAsFactors = FALSE)
dat      <- read.csv(linelist_example("coded-data.csv"), 
                     stringsAsFactors = FALSE)
dat$date <- as.Date(dat$date)

# Clean spelling based on wordlist ------------------------------ 

wordlist # show the wordlist
head(dat) # show the data

res1 <- clean_variable_spelling(dat,
                                wordlists = wordlist,
                                from = "options",
                                to = "values",
                                spelling_vars = "grp")
head(res1)

# You can ensure the order of the factors are correct by specifying 
# a column that defines order.

dat[] <- lapply(dat, as.factor)
as.list(head(dat))
res2 <- clean_variable_spelling(dat, 
                                wordlists = wordlist, 
                                from = "options",
                                to = "values",
                                spelling_vars = "grp", 
                                sort_by = "orders")
head(res2)
as.list(head(res2))

reconhub/linelist documentation built on March 5, 2020, 2:41 p.m.