vignettes/etl_qa_run_pipeline.md

etl_qa_run_pipeline()

Introduction

The etl_qa_run_pipeline() function is designed to identify data quality issues before performing data analysis. To maximize its effectiveness, it is recommended to use this tool throughout the ETL pipeline. It can be run after extraction to ensure that the data we received is as expected and has not been corrupted during transmission. After transformation, it can check for error introduced by coding mistakes. Finally, after loading the data to its near-final destination, it can be run again — perhaps by an analyst familiar with the data — to check for any remaining data quality issues.

In essence, the function draws data from SQL Server, a data.table or data.frame in R’s memory, or via the rads::get_data_xxx functions. It identifies missing data and assesses changes in numeric statistics and categorical frequency distributions over time. Results are provided as tables that can be filtered and reviewed (and easily read into Tableau) and as PDF graphs that help identify changes over time. Optionally, it can check if variables align with CHI (Community Health Indicators) standards.

Sneak peek at QA output

The function returns a named list containing four items:

Here is a peek at the output generated by the example code that will be discussed in detail below.

Graphs

Figure 1: Missingness

This graph shows the percentage of observations (i.e., rows) that are missing at each time period.

Figure 2a: Numeric statistics

This graph shows basic statistics for numeric variables at each time period.

Figure 2b: Date statistics

This graph shows basic statistics for date variables at each time period. Note that datetime variables such as POSIXct and POSIXt in R and datetime, datetime2, and smalldatetime in SQL will assessed as dates.

Figure 2c: Categorical frequencies

This graph shows the proportion of observations with a given value at each time period. It displays the top eight most frequent values PLUS missing values (dotted line) PLUS all other values combined and labeled ‘Other values’. Note that numeric and date variables with six or fewer distinct values will be treated as a categorical.

Tables

As you can see in the figure from Figure 2c, the proportion lines can overlap and it can be difficult to identify all the differences that might be of interest. For a more detailed exploration of this data, you can view the tables stored in the exported Excel file (which are also saved in the final item of the returned list). Here is a snapshot of each of tables:

missingness

Here is a sample of the ‘missingness’ worksheet saved to the Excel file. You would be able to access this in R by typing something like myresult$final$missingness. As per the default settings (which can be adjusted), it is flagging >= 3% absolute changes in the proportion missing between adjacent time periods.

| time_period | varname | nrow | proportion | abs_change | |------------:|:----------------|------:|-----------:|:-----------| | 2013 | birthplace_city | 87829 | 1.000 | NA | | 2014 | birthplace_city | 89858 | 1.000 | NA | | 2015 | birthplace_city | 90371 | 1.000 | NA | | 2016 | birthplace_city | 91756 | 1.000 | NA | | 2017 | birthplace_city | 1347 | 0.015 | -98.5% | | 2018 | birthplace_city | 1398 | 0.016 | NA | | 2019 | birthplace_city | 1365 | 0.016 | NA | | 2020 | birthplace_city | 1423 | 0.017 | NA | | 2021 | birthplace_city | 1595 | 0.019 | NA |

values

Similarly, here is a sample from the ‘values’ worksheet (myresult$final$values). The default settings flag rows with a >= 2% relative change in the mean value between adjacent years.

| time_period | varname | value | count | proportion | abs_proportion_change | vartype | mean | median | min | max | rel_mean_change | rel_median_change | median_date | min_date | max_date | |---:|:---|:---|---:|---:|:---|:---|---:|---:|---:|---:|:---|:---|:---|:---|:---| | 2020 | mother_date_of_birth | NA | NA | NA | NA | Date | NA | NA | NA | NA | NA | NA | 1989-12-03 | 1955-09-03 | 2007-07-07 | | 2021 | mother_date_of_birth | NA | NA | NA | NA | Date | NA | NA | NA | NA | NA | NA | 1990-09-11 | 1958-02-28 | 2010-01-03 | | 2022 | mother_date_of_birth | NA | NA | NA | NA | Date | NA | NA | NA | NA | NA | NA | 1991-07-24 | 1965-09-28 | 2010-04-23 | | 2013 | num_prev_cesarean | NA | NA | NA | NA | Continuous | 0.180 | 0 | 0 | 9 | NA | NA | NA | NA | NA | | 2014 | num_prev_cesarean | NA | NA | NA | NA | Continuous | 0.188 | 0 | 0 | 8 | 4.6% | NA | NA | NA | NA | | 2015 | num_prev_cesarean | NA | NA | NA | NA | Continuous | 0.190 | 0 | 0 | 8 | NA | NA | NA | NA | NA | | 2016 | num_prev_cesarean | NA | NA | NA | NA | Continuous | 0.190 | 0 | 0 | 7 | NA | NA | NA | NA | NA | | 2017 | num_prev_cesarean | NA | NA | NA | NA | Continuous | 0.750 | 1 | 0 | 9 | 295% | Inf% | NA | NA | NA | | 2018 | num_prev_cesarean | NA | NA | NA | NA | Continuous | 1.173 | 1 | 0 | 10 | 56.4% | NA | NA | NA | NA |

chi_standards

Finally, if you set check_chi = TRUE, your Excel workbook will have a sheet called ‘chi_standards’ (myresult$final$chi_standards). It provides a simple ‘*’ to indicate when there is a misalignment between the CHI standard and your data.

| chi_year | varname | group | your_data | chi | problem | |---------:|:-----------|:---------|----------:|----:|:--------| | 2019 | chi_race_7 | Black | 1 | 1 | NA | | 2020 | chi_race_7 | Black | 1 | 1 | NA | | 2021 | chi_race_7 | Black | 1 | 1 | NA | | 2022 | chi_race_7 | Black | 1 | 1 | NA | | NA | chi_race_7 | Hispanic | 0 | 1 | * | | 2013 | chi_race_7 | Multiple | 1 | 1 | NA | | 2014 | chi_race_7 | Multiple | 1 | 1 | NA | | 2015 | chi_race_7 | Multiple | 1 | 1 | NA | | 2016 | chi_race_7 | Multiple | 1 | 1 | NA |

Now that you’ve learned what this function can can, let’s learn how to use it!

Setting up the environment

Start by loading the package:

library(apde)

Function arguments

Since this function allows the user to QA data in three different data sources that need different specifications, understanding the arguments is not as simple as typing args(etl_qa_run_pipeline). Here are all the possible arguments as described in the helpfile (?etl_qa_run_pipeline):

data_source_type

Character string specifying the type of data source. Must be one of 'r_dataframe', 'sql_server', or 'rads'.

connection

A DBIConnection object.Required only when data_source_type = 'sql_server'.

data_params

List of data related parameters specific to the data source. Not all parameters are needed for all data sources. Please review the examples for details.

Logical vector of length 1. When check_chi = TRUE, function will add any available CHI related variables to data_params$cols and will assess whether their values align with standards in rads.data::misc_chi_byvars.Default is check_chi = FALSE.

Character vector specifying the column names to analyze, e.g., cols = c('race4', 'birth_weight_grams', 'birthplace_city').

Character vector of length 2 specifying the start and end of the time range, e.g., time_range = c(2015, 2024).

Character string specifying the time interval variable, e.g., time_var = 'chi_year'.

Name of a data.frame or data.table that you want to assess with this function, e.g., data = myDataTable.Required only when data_source_type = 'r_dataframe'.

Character string specifying the relevant rads::get_data_xxx function, e.g., function_name = 'get_data_birth'.Required only when data_source_type = 'rads'.

Logical vector of length 1. Identifies whether you want limit the data to King County.Required only when data_source_type = 'rads'.Default is kingco = TRUE.

Character string specifying either 'final' or 'stage'.Required only when data_source_type = 'rads'.Default is version = 'stage'.

The name of the schema and table to be accessed within the SQL Server connection. Must be in the form myschema.mytable, with a period as a separator, e.g., schema_table = 'birth.stage_analytic'.Required only when data_source_type = 'sql_server'.

output_directory

Character string specifying the directory where output files will be saved. If NULL, the current working directory is used.Default is output_directory = NULL.

digits_mean

Integer specifying the number of decimal places for rounding the reported mean, median, min, and max.Default is digits_mean = 3.

digits_prop

Integer specifying the number of decimal places for rounding proportions.Default is digits_prop = 3.

abs_threshold

Numeric threshold for flagging absolute percentage changes in proportions. Permissible range is [0, 100].Default is abs_threshold = 3.

rel_threshold

Numeric threshold for flagging relative percentage changes in means and medians. Permissible range is [0, 100].Default is rel_threshold = 2.

Examples

As stated above, this function can QA data in SQL Server, a local data.table/ data.frame, or from rads. Here we provide an example of each method to QA the same data source. Since all three methods provide identical output, we will only explore the SQL Server QA in detail knowing that it is generalizable.

QA SQL Server data

myconnection <- rads::validate_hhsaw_key()
qaSQL <- etl_qa_run_pipeline(
  data_source_type = 'sql_server',
  connection = myconnection,
  data_params = list(
    schema_table = 'birth.final_analytic',
    time_var = 'chi_year',
    time_range = c(2013, 2022),
    cols =c('chi_age', 'race4', 'birth_weight_grams', 'birthplace_city', 
            'num_prev_cesarean', 'mother_date_of_birth'), 
    check_chi = TRUE
  ), 
  output_directory = tempdir()
)
Creating config object ... running etl_qa_setup_config()

Analyzing data ... running etl_qa_initial_results()

😦👿🤬⚠ 
The following varname and group combinations exist in the rads.data::misc_chi_byvars 
standards but are missing from your dataset. Please ensure your dataset complies with
the CHI standard.

|    varname|group        |note                                                      |
|----------:|:------------|:---------------------------------------------------------|
| chi_race_7|Hispanic     |It's OK! A race variable cannot also represent ethnicity. |
|      race3|Hispanic     |It's OK! A race variable cannot also represent ethnicity. |
|      race3|Non-Hispanic |It's OK! A race variable cannot also represent ethnicity. |

Preparing results ... running etl_qa_final_results()

Visualizing data ... running etl_qa_export_results()

Finished! Check your output_directory: C:\Users\DCOLOM~1.KC\AppData\Local\Temp\RtmpyIyuUr

Since we set check_chi = TRUE, the function pulled in all available CHI related variables and compared them to known standards. In this case, it identified three deviations between the birth data and the CHI standards. However, these are known issues that are related to conception of the CHI variables rather than a data processing issue, so there is a note letting the user know not to worry about the discrepancy. I’ve hidden this message for the examples that follow to simplify this vignette.

The final statement asks the user to check their output_directory, which is tempdir(). Let’s see what is saved there:

list.files(tempdir())
 [1] "birth.final_analytic_qa_2024_10_04.xlsx"       
 [2] "birth.final_analytic_qa_missing_2024_10_04.pdf"
 [3] "birth.final_analytic_qa_values_2024_10_04.pdf" 
 [4] "file109410de117e"                              
 [5] "file109412717fc4"                              
 [6] "file109412ae2594"                              
 [7] "file109412f824f7"                              
 [8] "file10941c9941ee"                              
 [9] "file109436734218"                              
[10] "file109438ac984"                               
[11] "file10944e342276"                              
[12] "Rf109428a2405d"

We also saved the output as the object qaSQL. Let’s confirm that what was written about the returned object in the Sneak peek section above was correct:

# confirm it is a list
inherits(qaSQL, 'list')
[1] TRUE
# get the name of the object in the returned list
names(qaSQL)
[1] "config"   "initial"  "final"    "exported"
# get the names of the returned tables (which correspond to those in the Excel file)
names(qaSQL$final)
[1] "missingness"   "values"        "chi_standards"

QA data.table / data.frame data in memory

birth_data <- rads::get_data_birth(year = c(2013:2022), 
                             kingco = F, 
                             version = 'final')

qaDF <- etl_qa_run_pipeline(
  data_source_type = 'r_dataframe',
  data_params = list(
    data = birth_data,
    time_var = 'chi_year',
    time_range = c(2013, 2022),
    cols = c('chi_age', 'race4', 'birth_weight_grams', 'birthplace_city', 
             'num_prev_cesarean', 'mother_date_of_birth'), 
    check_chi = TRUE
  ), 
  output_directory = tempdir()
)

QA rads data

qaRADS <- etl_qa_run_pipeline(
  data_source_type = 'rads',
  data_params = list(
    function_name = 'get_data_birth',
    time_var = 'chi_year',
    time_range = c(2013, 2022),
    cols = c('chi_age', 'race4', 'birth_weight_grams', 'birthplace_city',
             'num_prev_cesarean', 'mother_date_of_birth'),
    version = 'final',
    kingco = FALSE,
    check_chi = TRUE
  ),
  output_directory = tempdir()
)

Confirm that the results from each method are identical

identical(qaDF$final, qaSQL$final)
[1] TRUE
identical(qaDF$final, qaRADS$final)
[1] TRUE

Conclusion

Remember to consult the function documentation for more detailed information on usage and parameters. Happy coding!

Updated by dcolombara, 2024-10-04



PHSKC-APDE/apde documentation built on April 14, 2025, 10:46 a.m.