prepare_input_data: Prepare Input Data: Coerce to data.frame and (optionally)...
In FakeDataR: Privacy-Preserving Synthetic Data for 'LLM' Workflows

prepare_input_data

R Documentation

Prepare Input Data: Coerce to data.frame and (optionally) normalize values

Description

Converts common tabular objects to a base data.frame, and if normalize = TRUE it applies light, conservative value normalization:

Converts common date/time strings to POSIXct (best-effort across several formats)
Converts percent-like character columns (e.g. "85%") to numeric (85)
Maps a configurable set of "NA-like" strings to NA, while keeping common survey responses like "not applicable" or "prefer not to answer" as real levels
Normalizes yes/no character columns to an ordered factor c("no","yes")

Usage

prepare_input_data(
  data,
  normalize = TRUE,
  na_strings = c("", "NA", "N/A", "na", "No data", "no data"),
  keep_as_levels = c("not applicable", "prefer not to answer", "unsure"),
  percent_detect_threshold = 0.6,
  datetime_formats = c("%m/%d/%Y %H:%M:%S", "%m/%d/%Y %H:%M",
    "%Y-%m-%d %H:%M:%S", "%Y-%m-%d %H:%M", "%Y-%m-%dT%H:%M:%S",
    "%Y-%m-%dT%H:%M", "%m/%d/%Y", "%Y-%m-%d")
)

Arguments

`data`	An object coercible to `data.frame` (data.frame/tibble/data.table/matrix/list, etc.)
`normalize`	Logical, run value normalization step (default `TRUE`).
`na_strings`	Character vector that should become `NA` (default: `c("", "NA", "N/A", "na", "No data", "no data")`).
`keep_as_levels`	Character vector that should be kept as values (not `NA`), e.g., survey choices (default: `c("not applicable", "prefer not to answer", "unsure")`). Matching is case-insensitive.
`percent_detect_threshold`	Proportion of non-missing values that must contain `⁠%⁠` before converting a character column to numeric (default `0.6`).
`datetime_formats`	Candidate formats tried (in order) when parsing date-times strings. The best-fitting format (most successful parses) is used. Defaults cover `⁠mm/dd/yyyy HH:MM(:SS)?⁠`, ISO-8601, and date-only.