etl_qa_run_pipeline: Run ETL Quality Assurance Pipeline
In PHSKC-APDE/apde: Common functions used in APDE's work

etl_qa_run_pipeline

R Documentation

Run ETL Quality Assurance Pipeline

Description

This function runs a comprehensive quality assurance pipeline for ETL (Extract, Transform, Load) processes. It analyzes data for missingness, variable distributions, and optionally checks compliance with CHI (Community Health Indicators) standards.

Usage

etl_qa_run_pipeline(
  connection = NULL,
  data_source_type,
  data_params = list(),
  output_directory = NULL,
  digits_mean = 3,
  digits_prop = 3,
  abs_threshold = 3,
  rel_threshold = 2
)

Arguments

`connection`	A DBIConnection object. Required only when `data_source_type = 'sql_server'`
`data_source_type`	Character string specifying the type of data source. Must be one of `'r_dataframe'`, `'sql_server'`, or `'rads'`.
`data_params`	List of data related parameters specific to the data source. Not all parameters are needed for all data sources. Please review the examples for details. `check_chi`: Logical vector of length 1. When `check_chi = TRUE`, function will add any available CHI related variables to `cols` and will assess whether their values align with standards in `rads.data::misc_chi_byvars`. Default is `FALSE` `cols`: Character vector specifying the column names to analyze, e.g., `cols = c('race4', 'birth_weight_grams', 'birthplace_city')` `time_range`: Character vector of length 2 specifying the start and end of the time range, e.g., `time_range = c(2015, 2024)` `time_var`: Character string specifying the time interval variable, e.g., `time_var = 'chi_year'` `data`: Name of a data.frame or data.table that you want to assess with this function, e.g., 'data = myDataTable'. Required only when `data_source_type = 'r_dataframe'`. `function_name`: Character string specifying the relevant `rads::get_data_xxx` function, e.g., `function_name = 'get_data_birth'`. Required only when `data_source_type = 'rads'` `kingco`: Logical vector of length 1. Identifies whether you want limit the data to King County. Required only when `data_source_type = 'rads'`. Default is `kingco = TRUE` `version`: Character string specifying either `'final'` or `'stage'`. Required only when `data_source_type = 'rads'`. Default is `version = 'stage'` `schema_table`: The name of the schema and table to be accessed within the SQL Server `connection`. Must be in the form `myschema.mytable`, with a period as a separator. Required only when `data_source_type = 'sql_server'`
`output_directory`	Character string specifying the directory where output files will be saved. If `NULL`, the current working directory is used. Default is `output_directory = NULL`.
`digits_mean`	Integer specifying the number of decimal places for rounding the reported mean, median, min, and max. Default is `digits_mean = 3`.
`digits_prop`	Integer specifying the number of decimal places for rounding proportions. Default is `digits_prop = 3`.
`abs_threshold`	Numeric threshold for flagging absolute percentage changes in proportions. Permissible range is [0, 100]. Default is `abs_threshold = 3`.
`rel_threshold`	Numeric threshold for flagging relative percentage changes in means and medians. Permissible range is [0, 100]. Default is `rel_threshold = 2`.

Details

The function provides identical output whether using rads, providing a data.table that is in R's memory, or processing data directly on MS SQL Server. The key is to correctly set up the arguments. Please refer to the examples below for models that you should follow.

Value

A list containing the final results from the ETL QA pipeline. Specifically, it includes:

`config`	Configuration settings used for the analysis
`initial`	Initial ETL QA results
`final`	Final ETL QA results - ready for reporting
`exported`	File paths for exported tables and plots

Examples

## Not run: 
# The following three examples generate identical output:

# Example with RADS 
qa.rads <- etl_qa_run_pipeline(
  data_source_type = 'rads',
  data_params = list(
    function_name = 'get_data_birth',
    time_var = 'chi_year',
    time_range = c(2021, 2022),
    cols = c('chi_age', 'race4', 'birth_weight_grams', 'birthplace_city', 
             'num_prev_cesarean', 'mother_date_of_birth'),
    version = 'final', 
    kingco = FALSE, 
    check_chi = FALSE
  ), 
  output_directory = 'C:/temp/'
)


# Example with R dataframe
birth_data <- rads::get_data_birth(year = c(2021:2022), 
                                   kingco = F, 
                                   cols = c('chi_age', 'race4', 
                                   'birth_weight_grams', 'birthplace_city', 
                                   'num_prev_cesarean', 'chi_year', 
                                   'mother_date_of_birth'), 
                                   version = 'final')
                                   
qa.df <- etl_qa_run_pipeline(
  data_source_type = 'r_dataframe',
  data_params = list(
    data = birth_data,
    time_var = 'chi_year',
    time_range = c(2021, 2022),
    cols = c('chi_age', 'race4', 'birth_weight_grams', 'birthplace_city', 
             'num_prev_cesarean', 'mother_date_of_birth'), 
    check_chi = FALSE
  ), 
  output_directory = 'C:/temp/'
)


# Example with SQL Server
library(DBI)
myconnection <- rads::validate_hhsaw_key()
qa.sql <- etl_qa_run_pipeline(
  data_source_type = 'sql_server',
  connection = myconnection,
  data_params = list(
    schema_table = 'birth.final_analytic',
    time_var = 'chi_year',
    time_range = c(2021, 2022),
    cols =c('chi_age', 'race4', 'birth_weight_grams', 'birthplace_city', 
            'num_prev_cesarean', 'mother_date_of_birth'), 
    check_chi = FALSE
  ), 
  output_directory = 'C:/temp/'
)

# Confirmation that the results are identical
all.equal(qa.rads$final, qa.df$final)
all.equal(qa.rads$final, qa.sql$final)


## End(Not run)

PHSKC-APDE/apde documentation built on April 14, 2025, 10:46 a.m.