gwasqc: Quality Control Of GWA Data

Description Usage Arguments Specifying The Input Data Files Field Separator Missing Values Column Names Case Sensitivity Filter for Implausible Values High Quality Filters Plotting Filter Output File Name Verbosity Level Number And Content Of Plots The Output Files Author(s) Examples

Description

Performs the quality control of data from Genome-Wide Association Studies (GWAS).

Usage

1
gwasqc(script)

Arguments

script

Name of a textual input file with processing instructions. The file should contain the names and locations of all GWAS data files to be processed along with basic information from each individual study, and instructions for the quality check.

Specifying The Input Data Files

The names of the GWAS data files are specified in the input script with the command PROCESS (one line per file). A different directory path can be specified for each file.

Example:

PROCESS input_file_1.txt
PROCESS /dir_1/dir_2/input_file_2.csv

The QC is applied first to ‘input_file_1.txt’ and then to ‘input_file_2.csv’.

Field Separator

The field (column) separator can be different for each GWAS data file. gwasqc() automatically detects the separator field for each input file based on the first 10 rows. However, the user has the possibility to specify the separator manually for each individual file using the command SEPARATOR. The supported arguments and related separators are listed below:

Argument Separator
COMMA comma
TAB tabulation
WHITESPACE whitespace
SEMICOLON semicolon

Example:

PROCESS input_file_1.txt
SEPARATOR TAB
PROCESS input_file_2.csv
PROCESS input_file_3.txt

For the input file ‘input_file_1.txt’ the field separator is determined automatically by the program but, for the input files ‘input_file_2.csv’ and ‘input_file_3.txt’ the separator is manually set to tabulation by the user.

Missing Values

By default, gwasqc() assumes that missing values are labeled as NA. However, the label for missing value can be specified manually by the user with the command MISSING.

Example:

MISSING -
PROCESS input_file_1.txt
MISSING NA
PROCESS input_file_2.csv

The hyphen symbol identifies missing values in the input file ‘input_file_1.txt’ and NA identifies missing values in the input file ‘input_file_2.txt’.

Column Names

In the table below, the complete list of the default column names for the GWAS data file is reported. These names identify uniquely the items in the GWAS data file.

Default column name(s) Description
MARKER Marker name
CHR Chromosome number or name
POSITION Marker position
ALLELE1, ALLELE2 Coded and non-coded alleles
FREQLABEL Allele frequency for the coded allele
STRAND Strand
IMPUTED Label value indicating if the marker
was imputed (1) or genotyped (0)
IMP_QUALITY Imputation quality statistics; this can be
different depending on the software used
for imputation: MACH's Rsq, IMPUTE's properinfo, ...
EFFECT Effect size
STDERR Standard error
PVALUE P-value
HWE_PVAL Hardy-Weinberg equilibrium p-value
CALLRATE Genotype callrate
N Sample size
USED_FOR_IMP Label value indicating if a marker
was used for imputation (1) or not (0)
AVPOSTPROB Average posterior probability for imputed marker allele dosage

Given that different names can be provided for each GWAS data file, gwasqc() allows to redefine the default values for every input file in the input script. The redefinition command consists of the default column name followed by the new column name. To redefine the default column names for coded and non-coded alleles, the command ALLELE followed by two new column names is used.

Example 1:

Let's assume to have two input files, ‘input_file_1.txt’ and ‘input_file_2.csv’. In the ‘input_file_1.txt’, the column names for effect size and standard error are beta and SE, respectively. In the ‘input_file_2.csv’, the column name for the effect size is the same as in ‘input_file_1.txt’, but the column name for the standard error is STDERR. The correct column redefinition is as follows:

EFFECT beta
STDERR SE
PROCESS input_file_1.txt
STDERR STDERR
PROCESS input_file_2.csv

There is no need to redefine the EFFECT field.

Example 2:

Consider an input file, ‘input_file_1.txt’, with the following names for ALLELE1 and ALLELE2: myRefAllele and myNonRefAllele. The new column definition is applied as follows:

ALLELE myRefAllele myNonRefAllele
PROCESS input_file_1.txt

Case Sensitivity

By default the gwasqc() assumes that column names in the input files are case insensitive. For example, the column names STDERR, StdErr, and STDErr are all perfectly equivalent. This behaviour can be modified for every input file in the input script using the command CASESENSITIVE, that controls case sensitivity for the column names, as specified below:

Argument Description
0 Column names in the input file
are case insensitive (default)
1 Column names in the input file
are case sensitive

Example:

CASESENSITIVE 1
PROCESS input_file_1.txt
CASESENSITIVE 0
PROCESS input_file_2.csv

Filter for Implausible Values

Often, there is the necessity to identify implausible values, to exclude unreliable results from the meta-analysis. Implausible values can happen due to data sparseness, errors in the data handling, or other causes.

gwasqc() identifies SNPs with suspicious statistics (p-value, standard error, etc.) by applying appropriate threshold values. After the data processing, a detailed report including the number of SNPs with implausible statistics and the nature of the problem is produced. In addition, suspicious SNPs are excluded from the calculation of the summary statistics on data quality.

The default filter thresholds are listed below:

Default column name Default thresholds
STDERR [0, 100000]
IMP_QUALITY (0, 1.5)
PVALUE (0, 1)
FREQLABEL (0, 1)
HWE_PVAL (0, 1)
CALLRATE (0, 1)

The user has the option to modify the thresholds to account for specific needs. The new thresholds can be specified after the redefinition of the column name.

Example:

Assume that the input file ‘input_file_1.txt’ has a standard error column called STDERR and that the corresponding column in the input file ‘input_file_2.csv’ is called SE. In addition, the imputation quality column is defined as oevar_imp in both files. The following script shows how the user can re-define the column names while applying different plausibility filters:

STDERR STDERR 0 80000
IMP_QUALITY oevar_imp 0 1
PROCESS input_file_1.txt
STDERR SE 0 100000
PROCESS input_file_2.csv

The file ‘input_file_1.txt’ has new [0, 80000] thresholds for the standard error column and new (0, 1) thresholds for the imputation quality. For the file ‘input_file_2.csv’ the thresholds of [0, 100000] will be applied to the standard error column, while for the imputation quality column the same filters as for the ‘input_file_1.txt’ will be applied.

High Quality Filters

SNPs with low imputation quality and with too small minor allele frequency (MAF) could make spuriously small p-values happen. Checking for the presence of cryptic relatedness or hidden population sub-structure through the estimation of the inflation factor lambda can be important, but one needs to identify the SNPs that could artificially increase the lambda value. gwasqc() identifies the 'high quality' SNPs by means of filters on the imputation quality and on the MAF. Summary statistics are calculated on the 'high quality' SNPs only. The default thresholds are listed below:

Default column name Default thresholds
FREQLABEL > 0.01
IMP_QUALITY > 0.3

The default values can be redefined using the command HQ_SNP for every input file in the input script. The command is followed by two values: the first one corresponds to the threshold for the minor allele frequency, and the second one corresponds to the threshold for the imputation quality.

Example:

If we want to define 'high quality' SNPs those with MAF > 0.03 and with imputation quality > 0.4, we would add the following lines to the input script:

HQ_SNP 0.03 0.4
PROCESS input_file_1.txt

Plotting Filter

The plotting filter is used to select appropriate data for the various summary plots. The filter has two threshold levels and each of them is applied dependently on the plot type and column. The default threshold values are listed below:

Default column name 1st level thresholds 2nd level thresholds
FREQLABEL > 0.01 > 0.05
IMP_QUALITY > 0.3 > 0.6

The default thresholds for the coded allele frequency and imputation quality can be redefined accordingly with the commands MAF and IMP for each input file.

Example:

MAF 0.02 0.03
IMP 0.3 0.5
PROCESS input_file_1.txt

A new plotting filter is set for the input file ‘input_file_1.txt’. There is a first level of filters which selects SNPs with MAF > 0.02 and the imputation quality > 0.3, and a second, higher, level filter which selects SNPs with MAF > 0.03 and imputation quality > 0.5.

Output File Name

For both text and graphic output files, the output file names are created by adding a prefix to the input file names. The prefix is specified with the command PREFIX.

Example:

PREFIX res_
PROCESS input_file_1.txt
PROCESS input_file_2.csv
PREFIX result_
PROCESS input_file_3.tab

All the output files corresponding to the input files ‘input_file_1.txt’ and ‘input_file_2.csv’ will be prefixed with res_; the output files corresponding to the input file ‘input_file_3.tab’ will be prefixed with result_.

Verbosity Level

The command VERBOSITY allows to control the number of output figures, as described below:

Argument Description
1 The default and the lowest verbosity level.
2 The highest verbosity level.

Example:

VERBOSITY 2
PROCESS input_file_1.txt
VERBOSITY 1
PROCESS input_file_2.csv

Number And Content Of Plots

Number and content of the output plots depend on the setting of the plotting filter and on the available columns in the input file. If some dependency is not satisfied because of missing columns or some filter setting, then some plots could not be created or they could be truncated at different levels than expected. See the tutorial for the list of dependencies.

The boxplots comparing EFFECT distributions across studies allow the specification of a BOXPLOTWIDTH that can be based on one of the other available information (typically the sample size). As an argument, BOXPLOTWIDTH requires one of the default column names. If BOXPLOTWIDTH is not specified all boxplots have the same width.

It is also possible to specify labels for every input file, to be used in the plots in spite of the full file names, which could be too long and, therefore, clutter the plots.

Example:

Let n_total be the column name which identifies the sample size in the input file ‘input_file_1.txt’, and samplesize the corresponding name in ‘input_file_2.csv’. Consider the following input script:

N n_total
PROCESS input_file_1.txt first
N samplesize
PROCESS /dir_1/dir_2/input_file_2.csv second
BOXPLOTWIDTH N

The width of the boxplots will be based on the study sample sizes, which is reported with different names in the two input files. The labels "first" and "second" will be used to identify the two studies in the plots.

The Output Files

gwasqc() produces 4 types of files:

  1. Figures, including QQ-plots, histograms, and boxplots.

  2. One textual report file with .txt extension.

  3. One comma-separated file with .csv extension, that contains all the summary statistics for the high quality imputation data.

  4. One HTML document, that combines both textual output and figures and allows a very easy and dynamic querying of all the output in a hypertext browser.

Author(s)

Daniel Taliun, Christian Fuchsberger, Cristian Pattaro

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
	

	# name of an input script
	script <- "GWASQC_script.txt"

	# load GWAtoolbox library
	library(GWAtoolbox)

	# show contents of the input script
	file.show(script, title=script)

	

	# run gwasqc() function
	gwasqc(script)

	
	
    

GWAtoolbox documentation built on May 2, 2019, 4:54 p.m.