check_missing_segments: Check Missing Data by Segments

check_missing_segmentsR Documentation

Check Missing Data by Segments

Description

Analyzes data completeness at the segment level. A segment is a group of variables defined in the 'Segment_Names' column of the metadata.

Usage

check_missing_segments(S_data, M_data, Show_Plot = FALSE)

Arguments

S_data

A data frame containing the source data to be checked.

M_data

A metadata data frame containing the validation rules, including a 'Segment_Names' column.

Show_Plot

A logical value. If 'TRUE', a stacked bar chart visualizing the proportions for each segment is displayed.

Details

For each segment, this function evaluates every row of the source data ('S_data') and classifies it into one of three categories:

  • **Complete:** The row has all values as non-missing for any variable within the segment.

  • **Incomplete:** The row has at least one 'NA' value for variables in the segment.

  • **Fully Missing:** All variables belonging to the segment are 'NA' for that row.

The metadata ('M_data') must contain the following columns to define the rules:

  • **VARIABLE:** The name of the variable in the source data ('S_data') to be checked for missingness.

  • **VARIABLE_Code:** A unique numeric or character code assigned to each variable for identification and dependency mapping.

  • **Dependency:** Specifies the dependency of the variable on another variable. A value of '0' indicates no dependency, while other values indicate the 'VARIABLE_Code' of the parent variable.

  • **Dep_Value:** The specific value or condition of the parent variable (as referenced in 'Dependency') that must be met for the current variable to be applicable. Use '"ANY"' if the value of the parent variable can be any non-missing value.

The function returns a summary table with counts and percentages for each category per segment.

Value

A 'data.frame' summarizing the analysis for each segment, with columns: 'SEGMENT', 'Total_Rows', 'Complete_Count', 'Incomplete_Count', 'Missing_Count', 'Percent_Complete', 'Percent_Incomplete', and 'Percent_Missing'.

See Also

Other missing data checks: check_missing_itemwise(), check_missing_record()

Examples

# 1. Define comprehensive sample data and metadata
Meta_data <- data.frame(
  stringsAsFactors = FALSE,
  VARIABLE = c(
    "ID", "Gender", "Age", "Has_Job", "Job_Title",
    "Job_Satisfaction", "Last_Promotion_Year", "Has_Insurance",
    "Insurance_Provider", "Annual_Checkup"
  ),
  VARIABLE_Code = 1:10,
  Var_order = 1:10,
  Segment_Names = c(
    "Demographic", "Demographic", "Demographic", "Employment", "Employment",
    "Employment", "Employment", "Health", "Health", "Health"
  ),
  Dependency = c(0, 0, 0, 0, 4, 5, 5, 0, 8, 8),
  Dep_Value = c(
    "0", "0", "0", "0", "Yes", "ANY", "ANY", "0", "Yes", "Yes"
  )
)

Source_data <- data.frame(
  ID = 1:10,
Gender = c("Male", NA, "Male", "Female", "Male","Female", "Male", "Female", "Male", "Female"),
Age = c(25, NA, 31, 55, 29, 38, 45, 22, 60, 33),
Has_Job = c("Yes", NA, "No", "Yes", "Yes", "No", "Yes", "Yes", "No", "Yes"),
Job_Title = c(NA, NA, NA, "Analyst", NA, "Student","Director", "Engineer", NA, "Designer"),
Job_Satisfaction = c(5, NA, NA, 8, 7, NA, 10, 9, NA, 6),
Last_Promotion_Year = c(2020,NA , 2021, NA, NA, NA, 2024, 2022, NA, 2023),
Has_Insurance = c("Yes", NA, "Yes", "Yes", "No", "Yes", "Yes", "No", "No", "Yes"),
Insurance_Provider = c("Provider A", NA, "Provider B", "Provider C","Provider D", NA, "Provider E",
 NA, NA, "Provider F"),
Annual_Checkup = c("Yes", NA, "No", "Yes", NA, "Yes", "Yes", "No", NA, "Yes")
)
# 3. Run the segment check with plot
segment_report <- check_missing_segments(
  S_data = Source_data, M_data = Meta_data, Show_Plot = TRUE
)
print(segment_report)

DQA documentation built on April 20, 2026, 9:06 a.m.