check_cross_file: Cross file checks
In UW-GAC/dbgaptools: Creates and Checks Standard Files for dbGaP submission

Description Usage Arguments Details Value

View source: R/check_cross_file.R

Check presence of expected subjects and samples across dbGaP files.

check_cross_file(subj_file, ssm_file, molecular_samples,
  sattr_file = NULL, pheno_file = NULL, ped_file = NULL,
  subjectID_col = "SUBJECT_ID", sampleID_col = "SAMPLE_ID",
  consent_col = "CONSENT")

`subj_file`	Path to subject consent file on disk
`ssm_file`	Path to sample-subject mapping file on disk
`molecular_samples`	Vector of sample IDs with molecular data
`sattr_file`	Path to sample attributes file on disk
`pheno_file`	Path to phenotype file on disk
`ped_file`	Path to pedigree file on disk
`subjectID_col`	Column name for subject-level ID across file
`sampleID_col`	Column name for sample-level ID across files
`consent_col`	Column name for consent in subject file

Checks for presence of expected subjects and samples across a set of dbGaP files. At a minimum, requires a subject consent file, sample-subject mapping file, and list of sample IDs for which molecular data is being submitted. Subjects with consent codes other than 0 and positive integers are returned as an error and excluded from further checks. Including additional files increases the number of pairwise checks done across files. The basic principles behind these checks are:

subject file: must contain all subjects in phenotype and pedigree files
sample-subject mapping file: must contain all samples with molecular data; all samples listed here must map to subjects with consent=0 or consent >=1 in subject file
pedigree file: subjects not mapping to samples with molecular data (i.e. linking individuals in a pedigrees) are expected to have consent=0 in the subject file
phenotype file: should have no subjects missing consent or with consent=0
sample attributes file: all samples listed here must map to subjects with consent >= 1 in subject file

Note issues returned in the report may not always require corrective action - i.e. sometimes there are extenuating circumstances, such as when consented study subjects are missing from current molecular data submissions but expected in future submissions, and are thus retained in dbGaP files with non-zero consent status.

cross_check_report, a list of the following issues (when present):

ssm_miss_molecular: List of molecular data samples missing from sample-subject mapping file
ssm_no_molecular: List of samples in the sample-subject mapping file that are not molecular data samples
subj_consent_err: List of subjects in subject consent file with invalid consent codes, which were excluded from subsequent checks
subj_miss_ssm: List of subjects in sample-subject mapping file either missing from subject file, or in subject file but with invalid consent code
sattr_miss_molecular: List of molecular data samples missing from the sample-attributes file
sattr_consent_err: List of samples in sample attributes file that map to subjects with consent other than >= 1
pheno_consent_err: List of subjects in the phenotype file that have consent other than >= 1
pheno_miss_molecular: List of molecular data samples missing from the phenotype file
subj_miss_ped: Subjects in pedigree file that are missing from subject consent file
ped_consent_err: List of subjects in pedigree file having non-0 consent and not mapping to a sample with molecular data
ped_miss_molecular: List of molecular samples mapped to a subject not present in the pedigree file, and thus assumed to be singletons/unrelateds