string_clean: Clean string & factor columns

string_cleanR Documentation

Clean string & factor columns

Description

string_clean is designed to clean and preprocess strings and factors within a data.frame or data.table after importing from SQL, text files, CSVs, etc. It encodes text to UTF-8, trims and replaces multiple whitespaces, converts blank strings to true NA values, and optionally converts strings factors. The function maintains the original order of columns and leaves numeric and logical columns as they were.

Usage

string_clean(dat = NULL, stringsAsFactors = FALSE)

Arguments

dat

name of data.frame or data.table

stringsAsFactors

logical. Specifies whether to convert strings to factors (TRUE) or not (FALSE). Note that columns that were originally factors will always be returned as factors.

Details

Depending on the size of the data.frame/data.table, the cleaning process can take a long time.

The string_clean function modifies objects in place due to the use of data.table's by-reference assignment (e.g., :=). In other words, there is no need to assign the output, just type string_clean(myTable).

Value

data.table

Examples


myTable <- data.table::data.table(
intcol = as.integer(1, 2, 3),
county = c(' King  County ', 'Pierce County', '  Snohomish  county '))
myTable[, county_factor := factor(county)]
string_clean(myTable, stringsAsFactors = TRUE)
print(myTable)


PHSKC-APDE/rads documentation built on April 14, 2025, 10:47 a.m.