check_ssm: Check sample subject mapping file

Description Usage Arguments Details Value

View source: R/check_functions.R

Description

Check contents of a sample subject mapping file for dbGaP posting.

Usage

1
2
3
check_ssm(dsfile, ddfile = NULL, na_vals = c("NA", "N/A", "na", "n/a"),
  ssm_exp = NULL, sampleID_col = "SAMPLE_ID",
  subjectID_col = "SUBJECT_ID", sample_uses = NULL, topmed = FALSE)

Arguments

dsfile

Path to the data file on disk

ddfile

Path to the data dictionary file on disk

na_vals

Vector of strings that should be read in as NA/missing in data file (see details of read_ds_file)

ssm_exp

Dataframe of expected SAMPLE_ID and SUBJECT_ID, with optionaly third column 'quarantine' (see Details below)

sampleID_col

Column name for sample-level ID

subjectID_col

Column name for subject-level ID

sample_uses

Either a single string for expected SAMPLE_USE across all samples, or a data frame with SAMPLE_ID and SAMPLE_USE values

topmed

Logical to indicate TOPMed study

Details

The sample subject mapping file should be a tab-delimited .txt file. When ssm_exp != NULL, checks for expected correspondence between SAMPLE_ID and SUBJECT_ID. Any differences in mapping between the two, or a difference in the list of expected SAMPLE_IDs or SUBJECT_IDs, will be returned in the output. If ssm_exp != NULL contains an additional logical field 'quarantine,' code will check that SAMPLE_USE is left blank (read in as 'NA') for this record. Quarantined samples will otherwise be treated as other records in terms of checking for missing or extra subjects or samples.

If topmed, then SAMPLE_USE is expected to be either "Seq_DNA_WholeGenome; Seq_DNA_SNP_CNV" or "Seq_DNA_SNP_CNV; Seq_DNA_WholeGenome", except for samples marked as quarantine in ssm_exp.

If a data dictionary is provided ddfile != NULL, additionally checks correspondence between column names in data file and entries in data dictionary. Data dictionary files can be Excel (.xls, .xlsx) or tab-delimited .txt.

Value

ssm_report, a list of the following issues (when present):

missing_vars

Missing and required variables

dup_samples

List of duplicated sample IDs

blank_idx

Row index of blank/missing subject or sample IDs

dd_errors

Differences in fields between data file and data dictionary

extra_subjects

Subjects in data file missing from ssm_exp

missing_subjects

Subjects in ssm_exp missing from data file

extra_samples

Samples in data file missing from ssm_exp

missing_samples

Samples in ssm_exp missing from data file

ssm_diffs

Discrepancies in mapping between SAMPLE_ID and SUBJECT_ID. Lists entries in ssm_exp that disagree with mapping in the data file

sampuse_diffs

Discrepancies with expected SAMPLE_USE values


UW-GAC/dbgaptools documentation built on April 30, 2019, 9:41 p.m.