View source: R/correctness_check.R
| correctness_check | R Documentation |
This function validates a data frame against a set of correctness rules specified in another data frame. It allows for complex validation operations, comparison with reference data, and detailed reporting.
correctness_check(
S_data,
M_data,
Result = FALSE,
show_column = NULL,
date_parser_fun = smart_to_gregorian_vec,
golden_data = NULL,
key_column = NULL,
external_data = NULL,
var_select = "all",
batch_size = 1000,
verbose = FALSE
)
S_data |
A data frame containing the data to be validated. |
M_data |
A data frame containing the validation rules. Must have at least the following columns:
|
Result |
Logical. If |
show_column |
Character vector. When |
date_parser_fun |
Function to convert date strings to Date objects. Default is |
golden_data |
Optional data frame or list containing reference data for validation.
Accessible within rules via the |
key_column |
Character string specifying the column name that links rows in S_data to corresponding rows in golden_data. Required when comparing individual rows with golden_data. |
external_data |
Optional list or data frame containing additional data for validation rules. |
var_select |
Character vector or numeric indices specifying which variables from M_data to validate. By default, it validates all variables. |
batch_size |
integer. Number of rows to process in each batch (for efficiency). |
verbose |
logical. If TRUE, prints progress messages. |
The function evaluates each rule specified in M_data against the corresponding data in S_data. Rules are R expressions written as strings, evaluated in an environment where:
Variables from S_data are available directly by name
val refers to the current variable being validated
GOLDEN provides access to reference data (when golden_data is provided)
Type conversion is applied to variables in S_data based on the TYPE column in M_data:
"date": Values are converted using the date_parser_fun
"numeric": Values are converted to numeric
Other types: No conversion is applied
Special handling for date comparisons is provided, including automatic wrapping of GOLDEN references when comparing dates.
If Result=FALSE (default): A data frame with one row per validated variable containing:
VARIABLE: Variable name
Condition_Met: Count of rows meeting the condition
Condition_Not_Met: Count of rows not meeting the condition
NA_Count: Count of rows where validation produced NA
Total_Applicable: Count of non-NA validation results
Total_Rows: Total number of rows in S_data
Percent_Met: Percentage of applicable rows meeting the condition
Percent_Not_Met: Percentage of applicable rows not meeting the condition
Error_Type: Value from Correctness_Error_Type column in M_data
If Result=TRUE: A data frame with one row per row in S_data, containing:
One column per validated variable with logical values (TRUE/FALSE/NA)
Any additional columns specified in show_column
Authorized_drug<-data.frame(
Drug_ID = 1:10,
Drug_Name = c("Atorvastatin", "Metformin", "Amlodipine", "Omeprazole", "Aspirin",
"Levothyroxine", "Sertraline", "Pantoprazole", "Losartan", "ASA"),
stringsAsFactors = FALSE
)
golde<-data.frame(
National_code = c("123", "456", "789","545","4454","554","665"),
LastName = c("Bahman","Johnson","Williams","Brown","Jones","Garcia","Miller"),
Certificate_Expiry = c("1404-07-01", "2030-01-12", "2025-01-11",
"1404-06-28","2025-09-19",NA,NA),
Blood_type = c("A-","B+","AB","A+","O-","O+","AB-"),
stringsAsFactors = FALSE
)
S_data <- data.frame(
National_code = c("123", "1456", "789","545","4454","554"),
LastName = c("Aliyar","Johnson","Williams","Brown","Jones","Garcia"),
VisitDate = c("2025-09-23", "2021-01-10", "2021-01-03","1404-06-28","1404-07-28",NA),
Test_date = c("1404-07-01", "2021-01-09", "2021-01-14","1404-06-29","2025-09-19",NA),
Certificate_validity = c("2025-09-23", "2025-01-12", "2025-02-11","1403-06-28","2025-09-19",NA),
Systolic_Reading1 = c(110, NA, 145, 125,114,NA),
Systolic_Reading2 = c(125, 150, NA, 110,100,NA),
Prescription_drug= c("Atorvastatin", "Metformin", "Amlodipine",
"Omeprazole", "Aspirin","Metoprolol"),
Blood_type = c("A-","B+","AB","A+","O-","O+"),
Height = c(178,195,165,NA,155,1.80),
stringsAsFactors = FALSE
)
M_data <- data.frame(
VARIABLE = c("National_code", "Certificate_validity", "VisitDate","Test_date",
"LastName","Systolic_Reading1","Systolic_Reading2",
"Prescription_drug","Blood_type","Height"),
Correctness_Rule = c(
"National_code %in% GOLDEN$National_code",
"val <= GOLDEN$Certificate_Expiry",
"((val >= '1404-06-01' & val <= '1404-06-31') | val == as.Date('2021-01-02'))",
"val != VisitDate",
"val %in% GOLDEN$LastName",
"",
"",
"val %in% Authorized_drug$Drug_Name",
"val %in% GOLDEN$Blood_type",
""),
TYPE=c("numeric","date","date","date","character","numeric",
"numeric","character","character","numeric"),
Correctness_Error_Type=c("Error",NA,"Warning","Error",NA,NA,NA,NA,"Error","Warning"),
stringsAsFactors = FALSE
)
result <- correctness_check(
S_data = S_data,
M_data = M_data,
golden_data = golde,
key_column = c("National_code"),
Result =FALSE,
external_data = Authorized_drug
)
print(result)
#
result <- correctness_check(
S_data = S_data,
M_data = M_data,
golden_data = golde,
#key_column = c("National_code"),#If you do not select a key, you can use Gold Data as a
#list and your logical rules will be NA.
Result =TRUE,
external_data = Authorized_drug
)
print(result)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.