check_sattr: Check sample attributes file
In UW-GAC/dbgaptools: Creates and Checks Standard Files for dbGaP submission

Description Usage Arguments Details Value

View source: R/check_functions.R

Check contents of a sample attributes file for dbGaP posting.

1
2
3

check_sattr(dsfile, ddfile = NULL, na_vals = c("NA", "N/A", "na",
  "n/a"), samp_exp = NULL, sampleID_col = "SAMPLE_ID",
  topmed = FALSE)

`dsfile`	Path to the data file on disk
`ddfile`	Path to the data dictionary file on disk
`na_vals`	Vector of strings that should be read in as NA/missing in data file (see details of `read_ds_file`)
`samp_exp`	List of expected sample IDs
`sampleID_col`	Column name for sample-level ID
`topmed`	Logical to indicate TOPMed study

The sample attributes file should be a tab-delimited .txt file. When (topmed = TRUE) checks presence of additional, TOPMed-specific sample attributes variables: SEQUENCING_CENTER, Funding_Source, TOPMed_Phase, TOPMed_Project, Study_Name.

Note that none of the BioSample variables (BODY_SITE, ANALYTE_TYPE, HISTOLOGICAL_TYPE, IS_TUMOR) are strictly required in the sense that their absence will not break dbGaP processing pipeline or delay study release. However, their inclusion is strongly encouraged, and indeed necessary for cancer studies and other tissue-specific studies, and are thus considered "required" variables for the purposes of this checking script.

If a data dictionary is provided (ddfile != NULL), additionally checks correspondence between column names in data file and entries in data dictionary. Data dictionary files can be Excel (.xls, .xlsx) or tab-delimited .txt.

satt_report, a list of the following issues (when present):

`missing_vars`	Missing and required variables
`dup_samples`	List of duplicated sample IDs
`blank_idx`	Row index of blank/missing sample IDs
`dd_errors`	Differences in fields between data file and data dictionary
`extra_samples`	Samples in data file missing from `ssm_exp`
`missing_samples`	Samples in `ssm_exp` missing from data file
`missing_topmed_vars`	Missing and required variables for TOPMed