knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Copyright 2022, University of Pittsburgh. All Rights Reserved. License: GPL-2
To install from CRAN use:
install.packages("dbGaPCheckup")
To install the development version from GitHub use:
devtools::install_github("lwheinsberg/dbGaPCheckup/pkg")
This document is designed to provide "quick start" guidance for using the dbGaPCheckUp
R package. Please see the table below and dbGaPCheckup_vignette
for more detailed information.
fn.path <- system.file("extdata", "Functions.xlsx", package = "dbGaPCheckup", mustWork=TRUE) fns <- readxl::read_xlsx(fn.path) knitr::kable(fns, caption="List of function names and types.")
After the dbGaPCheckup
package has been installed, you can load the R package using this command:
library(dbGaPCheckup)
Then proceed as follows:
DS.data
; DD.dict
; check_report
, optionally defining any missing value codes (e.g., -9999) via the non.NA.missing.codes
argument.Note, as you will see below, this package requires several fields beyond those required by the dbGaP formatting requirements. Specifically, the data dictionary is required to also have MIN
, MAX
, and TYPE
fields. If your data dictionary does not include these fields already, you can use the add_missing_fields
function to auto fill them (see below).
dbGaPCheckup
R packagelibrary(dbGaPCheckup)
DS.data
.DS.path <- system.file("extdata", "DS_Example.txt", package = "dbGaPCheckup", mustWork=TRUE) DS.data <- read.table(DS.path, header=TRUE, sep="\t", quote="", as.is = TRUE)
DD.dict
.DD.path <- system.file("extdata", "DD_Example2f.xlsx", package = "dbGaPCheckup", mustWork=TRUE) DD.dict <- readxl::read_xlsx(DD.path)
check_report
.With many functions, specification of missing value codes are important for accurate results.
report <- check_report(DD.dict = DD.dict, DS.data = DS.data, non.NA.missing.codes=c(-4444, -9999))
add_missing_fields
and repeat check_report
As described in more detail in the dbGaPCheckup_vignette
vignette, some checks contain embedded "pre-checks" that must be passed before the check can be run. For example, As mentioned above, this package requires MIN
, MAX
, and TYPE
fields in the data dictionary. We have created a function to auto fill these fields that can be used to get further along in the checks.
DD.dict.updated <- add_missing_fields(DD.dict, DS.data)
Once the fields are added, you can return to run your checks.
report.v2 <- check_report(DD.dict = DD.dict.updated , DS.data = DS.data, non.NA.missing.codes=c(-4444, -9999))
Now we see that 13 out of 15 checks pass, but the workflow fails at description_check
and missing_value_check
. Specifically, in description_check
we see that variables PREGNANT
and REACT
were identified as having missing variable descriptions (VARDESC
), and variables HEIGHT
and WEIGHT
incorrectly have identical descriptions. In missing_value_check
, we see that the variable CUFFSIZE
contains a -9999
encoded value that is not specified in a VALUES
column. While we have included functions that support "simple fixes", the issues identified here would need to be corrected manually in your data dictionary before moving on.
Note that we have also created reporting functions that generate graphical and textual descriptions and awareness checks of the data in HTML format (see dbGaPCheckup_vignette
vignette: create_awareness_report
(Appendix A) and create_report
(Appendix B) for more details). These reports are designed to help you catch other potential errors in your data set. Note that the create_report
generated is quite long however, so we recommend that you only submit subsets of variables at a time. Specification of missing value codes are also important for effective plotting.
# Functions not run here as they work best when initiated interactively # Awareness Report (See Appendix A of the `dbGaPCheckup` vignette) create_awareness_report(DD.dict.updated, DS.data, non.NA.missing.codes=c(-9999, -4444), output.path= tempdir()) # Data Report (See Appendix B of the `dbGaPCheckup` vignette) create_report(DD.dict.updated, DS.data, sex.split=TRUE, sex.name= "SEX", start = 3, end = 7, non.NA.missing.codes=c(-9999,-4444), output.path= tempdir(), open.html=TRUE)
More details on execution and interpretation have been provided in the dbGaPCheckup_vignette
vignette.
After your data dictionary is fully consistent with your data, you can use the label_data
function to convert your data to labelled data, essentially embedding the data dictionary into the data for future use!
DS_labelled_data <- label_data(DD.dict.updated, DS.data, non.NA.missing.codes=c(-9999)) labelled::var_label(DS_labelled_data$SEX) labelled::val_labels(DS_labelled_data$SEX) attributes(DS_labelled_data$SEX) labelled::na_values(DS_labelled_data$HX_DEPRESSION)
If you have any questions or comments, please feel free to contact us!
Lacey W. Heinsberg: law145@pitt.edu
Daniel E. Weeks: weeks@pitt.edu
Bug reports: https://github.com/lwheinsberg/dbGaPCheckup/issues
This package was developed with partial support from the National Institutes of Health under award numbers R01HL093093, R01HL133040, and K99HD107030. The eval_function
and dat_function
functions that form the backbone of the awareness reports were inspired by an elegant 2016 homework answer submitted by Tanbin Rahman in our HUGEN 2070 course ‘Bioinformatics for Human Genetics’. We would also like to thank Nick Moshgat for testing and providing feedback on our package during development.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.