ternP: Preprocess a raw data frame for use with ternG or ternD

View source: R/ternP.R

ternPR Documentation

Preprocess a raw data frame for use with ternG or ternD

Description

ternP() cleans a raw data frame loaded from a CSV or XLSX file, applying a standardized set of transformations and performing validation checks before the data is passed to ternG or ternD.

Usage

ternP(data)

Arguments

data

A data frame or tibble as loaded from a CSV or XLSX file (e.g. via readr::read_csv() or readxl::read_excel()). All character columns are processed; numeric and logical columns are passed through unchanged by the string-cleaning steps.

Value

A named list with three elements:

clean_data

A tibble containing the fully cleaned dataset, ready to pass to ternG() or ternD().

sparse_rows

A tibble of rows from clean_data where more than 50% of values are NA. These rows are retained in clean_data but extracted here for optional review or download. An empty tibble if no sparse rows exist.

feedback

A named list of feedback items. Each element is NULL if the corresponding transformation was not triggered, or a value describing what changed:

string_na_converted

A named list with elements total (integer count of values converted) and cols (character vector of affected column names), or NULL if no string NA values were found.

blank_rows_removed

A named list with elements count (integer) and row_indices (integer vector of original row positions removed), or NULL if none.

sparse_rows_flagged

A named list with elements count (integer) and row_indices (integer vector of row positions in clean_data with >50% missingness), or NULL if none.

case_normalized_vars

A named list with elements cols (character vector of affected column names) and detail (a named list per column, each with changed_from and changed_to character vectors showing the exact value changes), or NULL if none.

dropped_empty_cols

Character vector of column names (or "" for unnamed columns) that were dropped because they were 100% empty, or NULL if none.

Cleaning pipeline (in order)

  1. String NA values ("NA", "na", "Na", "unk") are converted to NA.

  2. Leading and trailing whitespace is trimmed from all character columns.

  3. Columns that are 100% empty (all NA) are silently dropped.

  4. Rows where every cell is NA are removed.

  5. Character columns where values differ only by capitalization (e.g. "Male" vs "MAle") are standardized to title case.

Validation hard stops

ternP() stops with a descriptive error if:

  • Any column name matches a protected health information (PHI) pattern (e.g. MRN, DOB, FirstName). De-identified research identifiers such as patient_id, subject_id, and participant_id are explicitly excluded, as are clinical-event dates (admission date, discharge date, visit date, etc.). Only personal-identity dates such as DOB and DOD are flagged.

  • Any column with a blank or whitespace-only header contains data. Completely empty unnamed columns are silently dropped and do not trigger this error.

See Also

ternG for grouped comparisons, ternD for descriptive statistics.

Examples


# Load a messy CSV and preprocess it
path   <- system.file("extdata/csv", "tern_colon_messy.csv",
                      package = "TernTables")
raw    <- read.csv(path, stringsAsFactors = FALSE)
result <- ternP(raw)

# Access cleaned data
result$clean_data

# Review preprocessing feedback
result$feedback

# Sparse rows flagged (>50% missing), retained but not removed
result$sparse_rows



TernTables documentation built on March 26, 2026, 5:09 p.m.