clean_numeric: Clean numeric variables within a dataset based on a...

View source: R/clean_numeric.R

clean_numericR Documentation

Clean numeric variables within a dataset based on a dictionary of value-replacement pairs

Description

Applies a dictionary of value-replacement pairs and a conversion function (defaults to as.numeric) to clean and standardize values of numeric variables. To use this approach the numeric columns of the original dataset should generally be imported as type "text" or "character" so that non-valid values are not automatically coerced to missing values on import.

Usage

clean_numeric(
  x,
  vars,
  vars_id = NULL,
  dict_clean = NULL,
  fn = as.numeric,
  na = ".na"
)

Arguments

x

A data frame with one or more columns to clean

vars

Names of columns within x to clean

vars_id

Optional vector of one or more ID columns within x on which corrections should be conditional.

If not specified the cleaning dictionary contains one entry for each unique combination of variable and non-valid value. If specified the cleaning dictionary contains one entry for each unique combination of variable, non-valid value, and ID variable.

dict_clean

Optional dictionary of value-replacement pairs (e.g. produced by check_numeric). If provided, must include columns "variable", "value", "replacement", and, if specified as an argument, all columns specified by vars_id.

If no dictionary is provided, will simply apply the conversion function to all columns specified in vars.

fn

Function to convert values to numeric. Defaults to as.numeric.

na

Keyword to use within column "replacement" for values that should be converted to NA. Defaults to ".na". The keyword is used to distinguish between "replacement" values that are missing because they have yet to be manually verified, and values that have been verified and really should be converted to NA.

Value

The original data frame x but with cleaned versions of columns vars

Examples

# load example dataset and dictionary of value-replacement pairs
data(ll1)
data(clean_num1)

# dictionary-based corrections to numeric vars 'age' and 'contacts'
clean_numeric(
  ll1,
  vars = c("age", "contacts"),
  dict_clean = clean_num1
)

# apply standardization with as.integer() rather than default as.numeric()
clean_numeric(
  ll1,
  vars = c("age", "contacts"),
  dict_clean = clean_num1,
  fn = as.integer
)

# apply standardization but no dictionary-based cleaning
clean_numeric(
  ll1,
  vars = c("age", "contacts")
)


epicentre-msf/dbc documentation built on Oct. 24, 2023, 9:25 p.m.