apde: Common functions used in APDE's work

etl_qa_run_pipeline()

Introduction

The etl_qa_run_pipeline() function is designed to identify data quality issues before performing data analysis. To maximize its effectiveness, it is recommended to use this tool throughout the ETL pipeline. It can be run after extraction to ensure that the data we received is as expected and has not been corrupted during transmission. After transformation, it can check for error introduced by coding mistakes. Finally, after loading the data to its near-final destination, it can be run again — perhaps by an analyst familiar with the data — to check for any remaining data quality issues.

In essence, the function draws data from SQL Server, a data.table or data.frame in R’s memory, or via the rads::get_data_xxx functions. It identifies missing data and assesses changes in numeric statistics and categorical frequency distributions over time. Results are provided as tables that can be filtered and reviewed (and easily read into Tableau) and as PDF graphs that help identify changes over time. Optionally, it can check if variables align with CHI (Community Health Indicators) standards.

Sneak peek at QA output

The function returns a named list containing four items:

config: Configuration settings used for the analysis
initial: Initial ETL QA results
final: Final ETL QA data.table summaries that are exported and used in plots
exported: File paths for exported tables and plots

Here is a peek at the output generated by the example code that will be discussed in detail below.

This graph shows the percentage of observations (i.e., rows) that are missing at each time period.

This graph shows basic statistics for numeric variables at each time period.

This graph shows basic statistics for date variables at each time period. Note that datetime variables such as POSIXct and POSIXt in R and datetime, datetime2, and smalldatetime in SQL will assessed as dates.

This graph shows the proportion of observations with a given value at each time period. It displays the top eight most frequent values PLUS missing values (dotted line) PLUS all other values combined and labeled ‘Other values’. Note that numeric and date variables with six or fewer distinct values will be treated as a categorical.

As you can see in the figure from Figure 2c, the proportion lines can overlap and it can be difficult to identify all the differences that might be of interest. For a more detailed exploration of this data, you can view the tables stored in the exported Excel file (which are also saved in the final item of the returned list). Here is a snapshot of each of tables:

Here is a sample of the ‘missingness’ worksheet saved to the Excel file. You would be able to access this in R by typing something like myresult$final$missingness. As per the default settings (which can be adjusted), it is flagging >= 3% absolute changes in the proportion missing between adjacent time periods.

| time_period | varname | nrow | proportion | abs_change | |------------:|:----------------|------:|-----------:|:-----------| | 2013 | birthplace_city | 87829 | 1.000 | NA | | 2014 | birthplace_city | 89858 | 1.000 | NA | | 2015 | birthplace_city | 90371 | 1.000 | NA | | 2016 | birthplace_city | 91756 | 1.000 | NA | | 2017 | birthplace_city | 1347 | 0.015 | -98.5% | | 2018 | birthplace_city | 1398 | 0.016 | NA | | 2019 | birthplace_city | 1365 | 0.016 | NA | | 2020 | birthplace_city | 1423 | 0.017 | NA | | 2021 | birthplace_city | 1595 | 0.019 | NA |

Similarly, here is a sample from the ‘values’ worksheet (myresult$final$values). The default settings flag rows with a >= 2% relative change in the mean value between adjacent years.

| time_period | varname | value | count | proportion | abs_proportion_change | vartype | mean | median | min | max | rel_mean_change | rel_median_change | median_date | min_date | max_date | |---:|:---|:---|---:|---:|:---|:---|---:|---:|---:|---:|:---|:---|:---|:---|:---| | 2020 | mother_date_of_birth | NA | NA | NA | NA | Date | NA | NA | NA | NA | NA | NA | 1989-12-03 | 1955-09-03 | 2007-07-07 | | 2021 | mother_date_of_birth | NA | NA | NA | NA | Date | NA | NA | NA | NA | NA | NA | 1990-09-11 | 1958-02-28 | 2010-01-03 | | 2022 | mother_date_of_birth | NA | NA | NA | NA | Date | NA | NA | NA | NA | NA | NA | 1991-07-24 | 1965-09-28 | 2010-04-23 | | 2013 | num_prev_cesarean | NA | NA | NA | NA | Continuous | 0.180 | 0 | 0 | 9 | NA | NA | NA | NA | NA | | 2014 | num_prev_cesarean | NA | NA | NA | NA | Continuous | 0.188 | 0 | 0 | 8 | 4.6% | NA | NA | NA | NA | | 2015 | num_prev_cesarean | NA | NA | NA | NA | Continuous | 0.190 | 0 | 0 | 8 | NA | NA | NA | NA | NA | | 2016 | num_prev_cesarean | NA | NA | NA | NA | Continuous | 0.190 | 0 | 0 | 7 | NA | NA | NA | NA | NA | | 2017 | num_prev_cesarean | NA | NA | NA | NA | Continuous | 0.750 | 1 | 0 | 9 | 295% | Inf% | NA | NA | NA | | 2018 | num_prev_cesarean | NA | NA | NA | NA | Continuous | 1.173 | 1 | 0 | 10 | 56.4% | NA | NA | NA | NA |

Finally, if you set check_chi = TRUE, your Excel workbook will have a sheet called ‘chi_standards’ (myresult$final$chi_standards). It provides a simple ‘*’ to indicate when there is a misalignment between the CHI standard and your data.

| chi_year | varname | group | your_data | chi | problem | |---------:|:-----------|:---------|----------:|----:|:--------| | 2019 | chi_race_7 | Black | 1 | 1 | NA | | 2020 | chi_race_7 | Black | 1 | 1 | NA | | 2021 | chi_race_7 | Black | 1 | 1 | NA | | 2022 | chi_race_7 | Black | 1 | 1 | NA | | NA | chi_race_7 | Hispanic | 0 | 1 | * | | 2013 | chi_race_7 | Multiple | 1 | 1 | NA | | 2014 | chi_race_7 | Multiple | 1 | 1 | NA | | 2015 | chi_race_7 | Multiple | 1 | 1 | NA | | 2016 | chi_race_7 | Multiple | 1 | 1 | NA |

Now that you’ve learned what this function can can, let’s learn how to use it!

Setting up the environment

Start by loading the package:

library(apde)

Function arguments

Since this function allows the user to QA data in three different data sources that need different specifications, understanding the arguments is not as simple as typing args(etl_qa_run_pipeline). Here are all the possible arguments as described in the helpfile (?etl_qa_run_pipeline):

`data_source_type`

Character string specifying the type of data source. Must be one of 'r_dataframe', 'sql_server', or 'rads'.

`connection`

A DBIConnection object.Required only when data_source_type = 'sql_server'.

`data_params`

List of data related parameters specific to the data source. Not all parameters are needed for all data sources. Please review the examples for details.

data_params$check_chi

Logical vector of length 1. When check_chi = TRUE, function will add any available CHI related variables to data_params$cols and will assess whether their values align with standards in rads.data::misc_chi_byvars.Default is check_chi = FALSE.

data_params$cols

Character vector specifying the column names to analyze, e.g., cols = c('race4', 'birth_weight_grams', 'birthplace_city').

data_params$time_range

Character vector of length 2 specifying the start and end of the time range, e.g., time_range = c(2015, 2024).

data_params$time_var

Character string specifying the time interval variable, e.g., time_var = 'chi_year'.

data_params$data

Name of a data.frame or data.table that you want to assess with this function, e.g., data = myDataTable.Required only when data_source_type = 'r_dataframe'.

data_params$function_name

Character string specifying the relevant rads::get_data_xxx function, e.g., function_name = 'get_data_birth'.Required only when data_source_type = 'rads'.

data_params$kingco

Logical vector of length 1. Identifies whether you want limit the data to King County.Required only when data_source_type = 'rads'.Default is kingco = TRUE.

data_params$version

Character string specifying either 'final' or 'stage'.Required only when data_source_type = 'rads'.Default is version = 'stage'.

data_params$schema_table

The name of the schema and table to be accessed within the SQL Server connection. Must be in the form myschema.mytable, with a period as a separator, e.g., schema_table = 'birth.stage_analytic'.Required only when data_source_type = 'sql_server'.

`output_directory`

Character string specifying the directory where output files will be saved. If NULL, the current working directory is used.Default is output_directory = NULL.

`digits_mean`

Integer specifying the number of decimal places for rounding the reported mean, median, min, and max.Default is digits_mean = 3.

`digits_prop`

Integer specifying the number of decimal places for rounding proportions.Default is digits_prop = 3.

`abs_threshold`

Numeric threshold for flagging absolute percentage changes in proportions. Permissible range is [0, 100].Default is abs_threshold = 3.

`rel_threshold`

Numeric threshold for flagging relative percentage changes in means and medians. Permissible range is [0, 100].Default is rel_threshold = 2.

Examples

As stated above, this function can QA data in SQL Server, a local data.table/ data.frame, or from rads. Here we provide an example of each method to QA the same data source. Since all three methods provide identical output, we will only explore the SQL Server QA in detail knowing that it is generalizable.

myconnection <- rads::validate_hhsaw_key()
qaSQL <- etl_qa_run_pipeline(
  data_source_type = 'sql_server',
  connection = myconnection,
  data_params = list(
    schema_table = 'birth.final_analytic',
    time_var = 'chi_year',
    time_range = c(2013, 2022),
    cols =c('chi_age', 'race4', 'birth_weight_grams', 'birthplace_city', 
            'num_prev_cesarean', 'mother_date_of_birth'), 
    check_chi = TRUE
  ), 
  output_directory = tempdir()
)

Creating config object ... running etl_qa_setup_config()

Analyzing data ... running etl_qa_initial_results()

😦👿🤬⚠ 
The following varname and group combinations exist in the rads.data::misc_chi_byvars 
standards but are missing from your dataset. Please ensure your dataset complies with
the CHI standard.

|    varname|group        |note                                                      |
|----------:|:------------|:---------------------------------------------------------|
| chi_race_7|Hispanic     |It's OK! A race variable cannot also represent ethnicity. |
|      race3|Hispanic     |It's OK! A race variable cannot also represent ethnicity. |
|      race3|Non-Hispanic |It's OK! A race variable cannot also represent ethnicity. |

Preparing results ... running etl_qa_final_results()

Visualizing data ... running etl_qa_export_results()

Finished! Check your output_directory: C:\Users\DCOLOM~1.KC\AppData\Local\Temp\RtmpyIyuUr

Since we set check_chi = TRUE, the function pulled in all available CHI related variables and compared them to known standards. In this case, it identified three deviations between the birth data and the CHI standards. However, these are known issues that are related to conception of the CHI variables rather than a data processing issue, so there is a note letting the user know not to worry about the discrepancy. I’ve hidden this message for the examples that follow to simplify this vignette.

The final statement asks the user to check their output_directory, which is tempdir(). Let’s see what is saved there:

list.files(tempdir())

 [1] "birth.final_analytic_qa_2024_10_04.xlsx"       
 [2] "birth.final_analytic_qa_missing_2024_10_04.pdf"
 [3] "birth.final_analytic_qa_values_2024_10_04.pdf" 
 [4] "file109410de117e"                              
 [5] "file109412717fc4"                              
 [6] "file109412ae2594"                              
 [7] "file109412f824f7"                              
 [8] "file10941c9941ee"                              
 [9] "file109436734218"                              
[10] "file109438ac984"                               
[11] "file10944e342276"                              
[12] "Rf109428a2405d"

birth.final_analytic_qa_missing_2024_09_30.pdf contains graphs like the one showin in Figure 1.
birth.final_analytic_qa_values_2024_09_30.pdf contains the graphs like those in Figures 2a, 2b, and 2c.
birth.final_analytic_qa_2024_09_30.xlsx contains three worksheeets, one for each of the three tables given in the Tables section above.

We also saved the output as the object qaSQL. Let’s confirm that what was written about the returned object in the Sneak peek section above was correct:

# confirm it is a list
inherits(qaSQL, 'list')

[1] TRUE

# get the name of the object in the returned list
names(qaSQL)

[1] "config"   "initial"  "final"    "exported"

# get the names of the returned tables (which correspond to those in the Excel file)
names(qaSQL$final)

[1] "missingness"   "values"        "chi_standards"

birth_data <- rads::get_data_birth(year = c(2013:2022), 
                             kingco = F, 
                             version = 'final')

qaDF <- etl_qa_run_pipeline(
  data_source_type = 'r_dataframe',
  data_params = list(
    data = birth_data,
    time_var = 'chi_year',
    time_range = c(2013, 2022),
    cols = c('chi_age', 'race4', 'birth_weight_grams', 'birthplace_city', 
             'num_prev_cesarean', 'mother_date_of_birth'), 
    check_chi = TRUE
  ), 
  output_directory = tempdir()
)

QA `rads` data

qaRADS <- etl_qa_run_pipeline(
  data_source_type = 'rads',
  data_params = list(
    function_name = 'get_data_birth',
    time_var = 'chi_year',
    time_range = c(2013, 2022),
    cols = c('chi_age', 'race4', 'birth_weight_grams', 'birthplace_city',
             'num_prev_cesarean', 'mother_date_of_birth'),
    version = 'final',
    kingco = FALSE,
    check_chi = TRUE
  ),
  output_directory = tempdir()
)

identical(qaDF$final, qaSQL$final)

[1] TRUE

identical(qaDF$final, qaRADS$final)

[1] TRUE

Conclusion

Remember to consult the function documentation for more detailed information on usage and parameters. Happy coding!

– Updated by dcolombara, 2024-10-04

PHSKC-APDE/apde documentation built on April 14, 2025, 10:46 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

PHSKC-APDE/apde Common functions used in APDE's work

vignettes/etl_qa_run_pipeline.md In PHSKC-APDE/apde: Common functions used in APDE's work

etl_qa_run_pipeline()

Introduction

Sneak peek at QA output

Graphs

Figure 1: Missingness

Figure 2a: Numeric statistics

Figure 2b: Date statistics

Figure 2c: Categorical frequencies

Tables

missingness

values

chi_standards

Setting up the environment

Function arguments

data_source_type

connection

data_params

data_params$check_chi

data_params$cols

data_params$time_range

data_params$time_var

data_params$data

data_params$function_name

data_params$kingco

data_params$version

data_params$schema_table

output_directory

digits_mean

digits_prop

abs_threshold

rel_threshold

Examples

QA SQL Server data

QA data.table / data.frame data in memory

QA rads data

Confirm that the results from each method are identical

Conclusion

R Package Documentation

Browse R Packages

We want your feedback!

PHSKC-APDE/apde
Common functions used in APDE's work

vignettes/etl_qa_run_pipeline.md
In PHSKC-APDE/apde: Common functions used in APDE's work

`data_source_type`

`connection`

`data_params`

`data_params$check_chi`

`data_params$cols`

`data_params$time_range`

`data_params$time_var`

`data_params$data`

`data_params$function_name`

`data_params$kingco`

`data_params$version`

`data_params$schema_table`