check_missing_itemwise: Check Missing Data Item-wise with Dependency Logic
In DQA: Data Quality Assessment Tools

View source: R/check-missing.R

check_missing_itemwise

R Documentation

Check Missing Data Item-wise with Dependency Logic

Description

Analyzes missing data ('NA' values) for each variable (item-wise) by considering dependencies between variables. This function goes beyond simple NA counts by classifying missingness into different categories based on rules defined in metadata.

Usage

check_missing_itemwise(
  S_data,
  M_data,
  var_select = 1:nrow(M_data),
  Show_Plot = FALSE
)

Arguments

`S_data`	A data frame containing the source data to be checked.
`M_data`	A metadata data frame containing the validation rules.
`var_select`	A numeric or character vector specifying which variables to process. Can be indices or names from the 'VARIABLE' column of 'M_data'. Defaults to all variables.
`Show_Plot`	A logical value. If 'TRUE', a ggplot bar chart showing the missingness percentage for each variable is displayed.

Details

This function classifies each row for a given variable into one of four states:

**Completed:** The value is present where it is expected.
**Missing:** The value is 'NA' where it was expected (based on a parent condition).
**Jump:** The value is 'NA' because the parent condition was not met (i.e., the question was correctly skipped).
**Unexpected:** The value is present where it was *not* expected (a data quality issue).

The metadata ('M_data') must contain the following columns to define the rules:

**VARIABLE:** The name of the variable in the source data ('S_data') to be checked for missingness.
**VARIABLE_Code:** A unique numeric or character code assigned to each variable for identification and dependency mapping.
**Dependency:** Specifies the dependency of the variable on another variable. A value of '0' indicates no dependency, while other values indicate the 'VARIABLE_Code' of the parent variable.
**Dep_Value:** The specific value or condition of the parent variable (as referenced in 'Dependency') that must be met for the current variable to be applicable. Use '"ANY"' if the value of the parent variable can be any non-missing value.

Value

A 'data.table' summarizing the missing data analysis for each variable, with columns such as 'VARIABLE', 'Missing_Count', 'Jump_Count', 'Unexpected_Count', ‘Total_Applicable'(the variable’s value was expected to be completed based on metadata rules.), 'Percent_Complete', and 'Percent_Missing'.

Examples

# 1. Define comprehensive sample data and metadata
Meta_data <- data.frame(
  stringsAsFactors = FALSE,
  VARIABLE = c(
    "ID", "Gender", "Age", "Has_Job", "Job_Title",
    "Job_Satisfaction", "Last_Promotion_Year", "Has_Insurance",
    "Insurance_Provider", "Annual_Checkup"
  ),
  VARIABLE_Code = 1:10,
  Var_order = 1:10,
  Segment_Names = c(
    "Demographic", "Demographic", "Demographic", "Employment", "Employment",
    "Employment", "Employment", "Health", "Health", "Health"
  ),
  Dependency = c(0, 0, 0, 0, 4, 5, 5, 0, 8, 8),
  Dep_Value = c(
    "0", "0", "0", "0", "Yes", "ANY", "ANY", "0", "Yes", "Yes"
  )
)

Source_data <- data.frame(
  ID = 1:10,
  Gender = c("Male", "Female", "Male", "Female", "Male",
             "Female", "Male", "Female", "Male", "Female"),
  Age = c(25, 42, 31, 55, 29, 38, 45, 22, 60, 33),
  Has_Job = c(NA, "Yes", "No", "Yes", "Yes", "No", "Yes", "Yes", "No", "Yes"),
  Job_Title = c(NA, "Manager", NA, "Analyst", NA, "Student",
                "Director", "Engineer", NA, "Designer"),
  Job_Satisfaction = c(5, 9, NA, 8, 7, NA, 10, 9, NA, 6),
  Last_Promotion_Year = c(2020, 2021, NA, NA, NA, NA, 2024, 2022, NA, 2023),
  Has_Insurance = c("Yes", "No", "Yes", "Yes", "No", "Yes", "Yes", "No", "No", "Yes"),
  Insurance_Provider = c("Provider A", NA, "Provider B", "Provider C",
                         "Provider D", NA, "Provider E", NA, NA, "Provider F"),
  Annual_Checkup = c("Yes", NA, "No", "Yes", NA, "Yes", "Yes", "No", NA, "Yes")
)

# 2. Run the item-wise check with plot
item_report <- check_missing_itemwise(
  S_data = Source_data, M_data = Meta_data, Show_Plot = TRUE
)
print(item_report)

DQA documentation built on April 20, 2026, 9:06 a.m.