etl_qa_run_pipeline: Run ETL Quality Assurance Pipeline

View source: R/etl_qa_run_pipeline.R

etl_qa_run_pipelineR Documentation

Run ETL Quality Assurance Pipeline

Description

This function runs a comprehensive quality assurance pipeline for ETL (Extract, Transform, Load) processes. It analyzes data for missingness, variable distributions, and optionally checks compliance with CHI (Community Health Indicators) standards.

Usage

etl_qa_run_pipeline(
  connection = NULL,
  data_source_type,
  data_params = list(),
  output_directory = NULL,
  digits_mean = 3,
  digits_prop = 3,
  abs_threshold = 3,
  rel_threshold = 2
)

Arguments

connection

A DBIConnection object. Required only when data_source_type = 'sql_server'

data_source_type

Character string specifying the type of data source. Must be one of 'r_dataframe', 'sql_server', or 'rads'.

data_params

List of data related parameters specific to the data source. Not all parameters are needed for all data sources. Please review the examples for details.

  • check_chi: Logical vector of length 1. When check_chi = TRUE, function will add any available CHI related variables to cols and will assess whether their values align with standards in rads.data::misc_chi_byvars. Default is FALSE

  • cols: Character vector specifying the column names to analyze, e.g., cols = c('race4', 'birth_weight_grams', 'birthplace_city')

  • time_range: Character vector of length 2 specifying the start and end of the time range, e.g., time_range = c(2015, 2024)

  • time_var: Character string specifying the time interval variable, e.g., time_var = 'chi_year'

  • data: Name of a data.frame or data.table that you want to assess with this function, e.g., 'data = myDataTable'. Required only when data_source_type = 'r_dataframe'.

  • function_name: Character string specifying the relevant rads::get_data_xxx function, e.g., function_name = 'get_data_birth'. Required only when data_source_type = 'rads'

  • kingco: Logical vector of length 1. Identifies whether you want limit the data to King County. Required only when data_source_type = 'rads'. Default is kingco = TRUE

  • version: Character string specifying either 'final' or 'stage'. Required only when data_source_type = 'rads'. Default is version = 'stage'

  • schema_table: The name of the schema and table to be accessed within the SQL Server connection. Must be in the form myschema.mytable, with a period as a separator. Required only when data_source_type = 'sql_server'

output_directory

Character string specifying the directory where output files will be saved. If NULL, the current working directory is used. Default is output_directory = NULL.

digits_mean

Integer specifying the number of decimal places for rounding the reported mean, median, min, and max. Default is digits_mean = 3.

digits_prop

Integer specifying the number of decimal places for rounding proportions. Default is digits_prop = 3.

abs_threshold

Numeric threshold for flagging absolute percentage changes in proportions. Permissible range is [0, 100]. Default is abs_threshold = 3.

rel_threshold

Numeric threshold for flagging relative percentage changes in means and medians. Permissible range is [0, 100]. Default is rel_threshold = 2.

Details

The function provides identical output whether using rads, providing a data.table that is in R's memory, or processing data directly on MS SQL Server. The key is to correctly set up the arguments. Please refer to the examples below for models that you should follow.

Value

A list containing the final results from the ETL QA pipeline. Specifically, it includes:

config

Configuration settings used for the analysis

initial

Initial ETL QA results

final

Final ETL QA results - ready for reporting

exported

File paths for exported tables and plots

See Also

  • etl_qa_setup_config for Step 1: Creating the config object

  • etl_qa_initial_results for Step 2: Initial ETL QA analysis

  • etl_qa_final_results for Step 3: Final / formatted ETL QA analysis

  • etl_qa_export_results for Step 4: Export of tables and plots

Examples

## Not run: 
# The following three examples generate identical output:

# Example with RADS 
qa.rads <- etl_qa_run_pipeline(
  data_source_type = 'rads',
  data_params = list(
    function_name = 'get_data_birth',
    time_var = 'chi_year',
    time_range = c(2021, 2022),
    cols = c('chi_age', 'race4', 'birth_weight_grams', 'birthplace_city', 
             'num_prev_cesarean', 'mother_date_of_birth'),
    version = 'final', 
    kingco = FALSE, 
    check_chi = FALSE
  ), 
  output_directory = 'C:/temp/'
)


# Example with R dataframe
birth_data <- rads::get_data_birth(year = c(2021:2022), 
                                   kingco = F, 
                                   cols = c('chi_age', 'race4', 
                                   'birth_weight_grams', 'birthplace_city', 
                                   'num_prev_cesarean', 'chi_year', 
                                   'mother_date_of_birth'), 
                                   version = 'final')
                                   
qa.df <- etl_qa_run_pipeline(
  data_source_type = 'r_dataframe',
  data_params = list(
    data = birth_data,
    time_var = 'chi_year',
    time_range = c(2021, 2022),
    cols = c('chi_age', 'race4', 'birth_weight_grams', 'birthplace_city', 
             'num_prev_cesarean', 'mother_date_of_birth'), 
    check_chi = FALSE
  ), 
  output_directory = 'C:/temp/'
)


# Example with SQL Server
library(DBI)
myconnection <- rads::validate_hhsaw_key()
qa.sql <- etl_qa_run_pipeline(
  data_source_type = 'sql_server',
  connection = myconnection,
  data_params = list(
    schema_table = 'birth.final_analytic',
    time_var = 'chi_year',
    time_range = c(2021, 2022),
    cols =c('chi_age', 'race4', 'birth_weight_grams', 'birthplace_city', 
            'num_prev_cesarean', 'mother_date_of_birth'), 
    check_chi = FALSE
  ), 
  output_directory = 'C:/temp/'
)

# Confirmation that the results are identical
all.equal(qa.rads$final, qa.df$final)
all.equal(qa.rads$final, qa.sql$final)


## End(Not run)


PHSKC-APDE/apde documentation built on April 14, 2025, 10:46 a.m.