gwasqc: Quality Control Of GWA Data
In GWAtoolbox: GWAS Quality Control

Description Usage Arguments Specifying The Input Data Files Field Separator Missing Values Column Names Case Sensitivity Filter for Implausible Values High Quality Filters Plotting Filter Output File Name Verbosity Level Number And Content Of Plots The Output Files Author(s) Examples

Performs the quality control of data from Genome-Wide Association Studies (GWAS).

1	gwasqc(script)

script

Name of a textual input file with processing instructions. The file should contain the names and locations of all GWAS data files to be processed along with basic information from each individual study, and instructions for the quality check.

The names of the GWAS data files are specified in the input script with the command PROCESS (one line per file). A different directory path can be specified for each file.

Example:

PROCESS input_file_1.txt

PROCESS /dir_1/dir_2/input_file_2.csv

The QC is applied first to ‘input_file_1.txt’ and then to ‘input_file_2.csv’.

The field (column) separator can be different for each GWAS data file. gwasqc() automatically detects the separator field for each input file based on the first 10 rows. However, the user has the possibility to specify the separator manually for each individual file using the command SEPARATOR. The supported arguments and related separators are listed below:

Argument	Separator
COMMA	comma
TAB	tabulation
WHITESPACE	whitespace
SEMICOLON	semicolon

Example:

PROCESS input_file_1.txt

SEPARATOR TAB

PROCESS input_file_2.csv

PROCESS input_file_3.txt

For the input file ‘input_file_1.txt’ the field separator is determined automatically by the program but, for the input files ‘input_file_2.csv’ and ‘input_file_3.txt’ the separator is manually set to tabulation by the user.

By default, gwasqc() assumes that missing values are labeled as NA. However, the label for missing value can be specified manually by the user with the command MISSING.

Example:

MISSING -

PROCESS input_file_1.txt

MISSING NA

PROCESS input_file_2.csv

The hyphen symbol identifies missing values in the input file ‘input_file_1.txt’ and NA identifies missing values in the input file ‘input_file_2.txt’.

In the table below, the complete list of the default column names for the GWAS data file is reported. These names identify uniquely the items in the GWAS data file.

Default column name(s)	Description
MARKER	Marker name
CHR	Chromosome number or name
POSITION	Marker position
ALLELE1, ALLELE2	Coded and non-coded alleles
FREQLABEL	Allele frequency for the coded allele
STRAND	Strand
IMPUTED	Label value indicating if the marker
	was imputed (1) or genotyped (0)
IMP_QUALITY	Imputation quality statistics; this can be
	different depending on the software used
	for imputation: MACH's Rsq, IMPUTE's properinfo, ...
EFFECT	Effect size
STDERR	Standard error
PVALUE	P-value
HWE_PVAL	Hardy-Weinberg equilibrium p-value
CALLRATE	Genotype callrate
N	Sample size
USED_FOR_IMP	Label value indicating if a marker
	was used for imputation (1) or not (0)
AVPOSTPROB	Average posterior probability for imputed marker allele dosage

Given that different names can be provided for each GWAS data file, gwasqc() allows to redefine the default values for every input file in the input script. The redefinition command consists of the default column name followed by the new column name. To redefine the default column names for coded and non-coded alleles, the command ALLELE followed by two new column names is used.

Example 1:

Let's assume to have two input files, ‘input_file_1.txt’ and ‘input_file_2.csv’. In the ‘input_file_1.txt’, the column names for effect size and standard error are beta and SE, respectively. In the ‘input_file_2.csv’, the column name for the effect size is the same as in ‘input_file_1.txt’, but the column name for the standard error is STDERR. The correct column redefinition is as follows:

EFFECT beta

STDERR SE

PROCESS input_file_1.txt

STDERR STDERR

PROCESS input_file_2.csv

There is no need to redefine the EFFECT field.

Example 2:

Consider an input file, ‘input_file_1.txt’, with the following names for ALLELE1 and ALLELE2: myRefAllele and myNonRefAllele. The new column definition is applied as follows:

ALLELE myRefAllele myNonRefAllele

PROCESS input_file_1.txt

By default the gwasqc() assumes that column names in the input files are case insensitive. For example, the column names STDERR, StdErr, and STDErr are all perfectly equivalent. This behaviour can be modified for every input file in the input script using the command CASESENSITIVE, that controls case sensitivity for the column names, as specified below:

Argument	Description
0	Column names in the input file
	are case insensitive (default)
1	Column names in the input file
	are case sensitive

Example:

CASESENSITIVE 1

PROCESS input_file_1.txt

CASESENSITIVE 0

PROCESS input_file_2.csv

Often, there is the necessity to identify implausible values, to exclude unreliable results from the meta-analysis. Implausible values can happen due to data sparseness, errors in the data handling, or other causes.

gwasqc() identifies SNPs with suspicious statistics (p-value, standard error, etc.) by applying appropriate threshold values. After the data processing, a detailed report including the number of SNPs with implausible statistics and the nature of the problem is produced. In addition, suspicious SNPs are excluded from the calculation of the summary statistics on data quality.

The default filter thresholds are listed below:

Default column name	Default thresholds
STDERR	[0, 100000]
IMP_QUALITY	(0, 1.5)
PVALUE	(0, 1)
FREQLABEL	(0, 1)
HWE_PVAL	(0, 1)
CALLRATE	(0, 1)

The user has the option to modify the thresholds to account for specific needs. The new thresholds can be specified after the redefinition of the column name.

Example:

Assume that the input file ‘input_file_1.txt’ has a standard error column called STDERR and that the corresponding column in the input file ‘input_file_2.csv’ is called SE. In addition, the imputation quality column is defined as oevar_imp in both files. The following script shows how the user can re-define the column names while applying different plausibility filters:

STDERR STDERR 0 80000

IMP_QUALITY oevar_imp 0 1

PROCESS input_file_1.txt

STDERR SE 0 100000

PROCESS input_file_2.csv

The file ‘input_file_1.txt’ has new [0, 80000] thresholds for the standard error column and new (0, 1) thresholds for the imputation quality. For the file ‘input_file_2.csv’ the thresholds of [0, 100000] will be applied to the standard error column, while for the imputation quality column the same filters as for the ‘input_file_1.txt’ will be applied.

SNPs with low imputation quality and with too small minor allele frequency (MAF) could make spuriously small p-values happen. Checking for the presence of cryptic relatedness or hidden population sub-structure through the estimation of the inflation factor lambda can be important, but one needs to identify the SNPs that could artificially increase the lambda value. gwasqc() identifies the 'high quality' SNPs by means of filters on the imputation quality and on the MAF. Summary statistics are calculated on the 'high quality' SNPs only. The default thresholds are listed below:

Default column name	Default thresholds
FREQLABEL	> 0.01
IMP_QUALITY	> 0.3

The default values can be redefined using the command HQ_SNP for every input file in the input script. The command is followed by two values: the first one corresponds to the threshold for the minor allele frequency, and the second one corresponds to the threshold for the imputation quality.

Example:

If we want to define 'high quality' SNPs those with MAF > 0.03 and with imputation quality > 0.4, we would add the following lines to the input script:

HQ_SNP 0.03 0.4

PROCESS input_file_1.txt

The plotting filter is used to select appropriate data for the various summary plots. The filter has two threshold levels and each of them is applied dependently on the plot type and column. The default threshold values are listed below:

Default column name	1st level thresholds	2nd level thresholds
FREQLABEL	> 0.01	> 0.05
IMP_QUALITY	> 0.3	> 0.6

The default thresholds for the coded allele frequency and imputation quality can be redefined accordingly with the commands MAF and IMP for each input file.

Example:

MAF 0.02 0.03

IMP 0.3 0.5

PROCESS input_file_1.txt

A new plotting filter is set for the input file ‘input_file_1.txt’. There is a first level of filters which selects SNPs with MAF > 0.02 and the imputation quality > 0.3, and a second, higher, level filter which selects SNPs with MAF > 0.03 and imputation quality > 0.5.

For both text and graphic output files, the output file names are created by adding a prefix to the input file names. The prefix is specified with the command PREFIX.

Example:

PREFIX res_

PROCESS input_file_1.txt

PROCESS input_file_2.csv

PREFIX result_

PROCESS input_file_3.tab

All the output files corresponding to the input files ‘input_file_1.txt’ and ‘input_file_2.csv’ will be prefixed with res_; the output files corresponding to the input file ‘input_file_3.tab’ will be prefixed with result_.

The command VERBOSITY allows to control the number of output figures, as described below:

Argument	Description
1	The default and the lowest verbosity level.
2	The highest verbosity level.

Example:

VERBOSITY 2

PROCESS input_file_1.txt

VERBOSITY 1

PROCESS input_file_2.csv

Number and content of the output plots depend on the setting of the plotting filter and on the available columns in the input file. If some dependency is not satisfied because of missing columns or some filter setting, then some plots could not be created or they could be truncated at different levels than expected. See the tutorial for the list of dependencies.

The boxplots comparing EFFECT distributions across studies allow the specification of a BOXPLOTWIDTH that can be based on one of the other available information (typically the sample size). As an argument, BOXPLOTWIDTH requires one of the default column names. If BOXPLOTWIDTH is not specified all boxplots have the same width.

It is also possible to specify labels for every input file, to be used in the plots in spite of the full file names, which could be too long and, therefore, clutter the plots.

Example:

Let n_total be the column name which identifies the sample size in the input file ‘input_file_1.txt’, and samplesize the corresponding name in ‘input_file_2.csv’. Consider the following input script:

N n_total

PROCESS input_file_1.txt first

N samplesize

PROCESS /dir_1/dir_2/input_file_2.csv second

BOXPLOTWIDTH N

The width of the boxplots will be based on the study sample sizes, which is reported with different names in the two input files. The labels "first" and "second" will be used to identify the two studies in the plots.

gwasqc() produces 4 types of files:

Figures, including QQ-plots, histograms, and boxplots.
One textual report file with .txt extension.
One comma-separated file with .csv extension, that contains all the summary statistics for the high quality imputation data.
One HTML document, that combines both textual output and figures and allows a very easy and dynamic querying of all the output in a hypertext browser.

Daniel Taliun, Christian Fuchsberger, Cristian Pattaro

	

	# name of an input script
	script <- "GWASQC_script.txt"

	# load GWAtoolbox library
	library(GWAtoolbox)

	# show contents of the input script
	file.show(script, title=script)

	

	# run gwasqc() function
	gwasqc(script)

GWAtoolbox documentation built on May 2, 2019, 4:54 p.m.

GWAtoolbox index

GWAtoolbox

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

GWAtoolbox
GWAS Quality Control

gwasqc: Quality Control Of GWA Data
In GWAtoolbox: GWAS Quality Control

Description

Usage

Arguments

Specifying The Input Data Files

Field Separator

Missing Values

Column Names

Case Sensitivity

Filter for Implausible Values

High Quality Filters

Plotting Filter

Output File Name

Verbosity Level

Number And Content Of Plots

The Output Files

Author(s)

Examples

Related to gwasqc in GWAtoolbox...

R Package Documentation

Browse R Packages

We want your feedback!

GWAtoolbox GWAS Quality Control

gwasqc: Quality Control Of GWA Data In GWAtoolbox: GWAS Quality Control

Description

Usage

Arguments

Specifying The Input Data Files

Field Separator

Missing Values

Column Names

Case Sensitivity

Filter for Implausible Values

High Quality Filters

Plotting Filter

Output File Name

Verbosity Level

Number And Content Of Plots

The Output Files

Author(s)

Examples

Related to gwasqc in GWAtoolbox...

R Package Documentation

Browse R Packages

We want your feedback!

GWAtoolbox
GWAS Quality Control

gwasqc: Quality Control Of GWA Data
In GWAtoolbox: GWAS Quality Control