con_inadmissible_vocabulary: Detects variable levels not specified in standardized...
In dataquieR: Data Quality in Epidemiological Research

con_inadmissible_vocabulary

R Documentation

Detects variable levels not specified in standardized vocabulary

Description

For each categorical variable, value lists should be defined in the metadata. This implementation will examine, if all observed levels in the study data are valid.

Indicator

Usage

con_inadmissible_vocabulary(
  resp_vars = NULL,
  study_data,
  label_col,
  item_level = "item_level",
  threshold_value = 0,
  meta_data = item_level,
  meta_data_v2
)

Arguments

`resp_vars`	variable list the name of the measurement variables
`study_data`	data.frame the data frame that contains the measurements
`label_col`	variable attribute the name of the column in the metadata with labels of variables
`item_level`	data.frame the data frame that contains metadata attributes of study data
`threshold_value`	numeric from=0 to=100. a numerical value ranging from 0-100.
`meta_data`	data.frame old name for `item_level`
`meta_data_v2`	character path to workbook like metadata file, see `prep_load_workbook_like_file` for details. ALL LOADED DATAFRAMES WILL BE PURGED, using `prep_purge_data_frame_cache`, if you specify `meta_data_v2`.

Details

Algorithm of this implementation:

Remove missing codes from the study data (if defined in the metadata)
Interpretation of variable specific VALUE_LABELS as supplied in the metadata.
Identification of measurements not corresponding to the expected categories. Therefore two output data frames are generated:
- on the level of observation to flag each undefined category, and
- a summary table for each variable.
Values not corresponding to defined categories are removed in a data frame of modified study data

Value

a list with:

SummaryData: data frame summarizing inadmissible categories with the columns:
- Variables: variable name/label
- OBSERVED_CATEGORIES: the categories observed in the study data
- DEFINED_CATEGORIES: the categories defined in the metadata
- NON_MATCHING: the categories observed but not defined
- NON_MATCHING_N: the number of observations with categories not defined
- NON_MATCHING_N_PER_CATEGORY: the number of observations for each of the unexpected categories
- GRADING: indicator TRUE/FALSE if inadmissible categorical values were observed (more than indicated by the threshold_value)
SummaryTable: data frame for the dataquieR pipeline reporting the number and percentage of inadmissible categorical values
ModifiedStudyData: study data having inadmissible categories removed
FlaggedStudyData: study data having cases with inadmissible categories flagged

Examples

## Not run: 
sdt <- data.frame(DIAG = c("B050", "B051", "B052", "B999"),
                  MED0 = c("S01XA28", "N07XX18", "ABC", NA), stringsAsFactors = FALSE)
mdt <- tibble::tribble(
~ VAR_NAMES, ~ DATA_TYPE, ~ STANDARDIZED_VOCABULARY_TABLE, ~ SCALE_LEVEL, ~ LABEL,
"DIAG", "string", "<ICD10>", "nominal", "Diagnosis",
"MED0", "string", "<ATC>", "nominal", "Medication"
)
con_inadmissible_vocabulary(NULL, sdt, mdt, label_col = LABEL)
prep_load_workbook_like_file("meta_data_v2")
il <- prep_get_data_frame("item_level")
il$STANDARDIZED_VOCABULARY_TABLE[[11]] <- "<ICD10GM>"
il$DATA_TYPE[[11]] <- DATA_TYPES$INTEGER
il$SCALE_LEVEL[[11]] <- SCALE_LEVELS$NOMINAL
prep_add_data_frames(item_level = il)
r <- dq_report2("study_data", dimensions = "con")
r <- dq_report2("study_data", dimensions = "con",
     advanced_options = list(dataquieR.non_disclosure = TRUE))
r

## End(Not run)

dataquieR documentation built on May 12, 2026, 1:06 a.m.