View source: R/etl_qa_run_pipeline.R
etl_qa_run_pipeline | R Documentation |
This function runs a comprehensive quality assurance pipeline for ETL (Extract, Transform, Load) processes. It analyzes data for missingness, variable distributions, and optionally checks compliance with CHI (Community Health Indicators) standards.
etl_qa_run_pipeline(
connection = NULL,
data_source_type,
data_params = list(),
output_directory = NULL,
digits_mean = 3,
digits_prop = 3,
abs_threshold = 3,
rel_threshold = 2
)
connection |
A DBIConnection object. Required only when
|
data_source_type |
Character string specifying the type of data source.
Must be one of |
data_params |
List of data related parameters specific to the data source. Not all parameters are needed for all data sources. Please review the examples for details.
|
output_directory |
Character string specifying the directory where output
files will be saved. If |
digits_mean |
Integer specifying the number of decimal places for rounding
the reported mean, median, min, and max. Default is |
digits_prop |
Integer specifying the number of decimal places for rounding
proportions. Default is |
abs_threshold |
Numeric threshold for flagging absolute percentage changes
in proportions. Permissible range is [0, 100]. Default is |
rel_threshold |
Numeric threshold for flagging relative percentage changes
in means and medians. Permissible range is [0, 100]. Default is |
The function provides identical output whether using rads
, providing a
data.table that is in R's memory, or processing data directly on MS SQL Server.
The key is to correctly set up the arguments. Please refer to the examples
below for models that you should follow.
A list containing the final results from the ETL QA pipeline. Specifically, it includes:
config |
Configuration settings used for the analysis |
initial |
Initial ETL QA results |
final |
Final ETL QA results - ready for reporting |
exported |
File paths for exported tables and plots |
etl_qa_setup_config
for Step 1: Creating the config
object
etl_qa_initial_results
for Step 2: Initial ETL QA analysis
etl_qa_final_results
for Step 3: Final / formatted ETL QA analysis
etl_qa_export_results
for Step 4: Export of tables and plots
## Not run:
# The following three examples generate identical output:
# Example with RADS
qa.rads <- etl_qa_run_pipeline(
data_source_type = 'rads',
data_params = list(
function_name = 'get_data_birth',
time_var = 'chi_year',
time_range = c(2021, 2022),
cols = c('chi_age', 'race4', 'birth_weight_grams', 'birthplace_city',
'num_prev_cesarean', 'mother_date_of_birth'),
version = 'final',
kingco = FALSE,
check_chi = FALSE
),
output_directory = 'C:/temp/'
)
# Example with R dataframe
birth_data <- rads::get_data_birth(year = c(2021:2022),
kingco = F,
cols = c('chi_age', 'race4',
'birth_weight_grams', 'birthplace_city',
'num_prev_cesarean', 'chi_year',
'mother_date_of_birth'),
version = 'final')
qa.df <- etl_qa_run_pipeline(
data_source_type = 'r_dataframe',
data_params = list(
data = birth_data,
time_var = 'chi_year',
time_range = c(2021, 2022),
cols = c('chi_age', 'race4', 'birth_weight_grams', 'birthplace_city',
'num_prev_cesarean', 'mother_date_of_birth'),
check_chi = FALSE
),
output_directory = 'C:/temp/'
)
# Example with SQL Server
library(DBI)
myconnection <- rads::validate_hhsaw_key()
qa.sql <- etl_qa_run_pipeline(
data_source_type = 'sql_server',
connection = myconnection,
data_params = list(
schema_table = 'birth.final_analytic',
time_var = 'chi_year',
time_range = c(2021, 2022),
cols =c('chi_age', 'race4', 'birth_weight_grams', 'birthplace_city',
'num_prev_cesarean', 'mother_date_of_birth'),
check_chi = FALSE
),
output_directory = 'C:/temp/'
)
# Confirmation that the results are identical
all.equal(qa.rads$final, qa.df$final)
all.equal(qa.rads$final, qa.sql$final)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.