gwasformat: Formatting of GWAS data files.

Description Usage Arguments Specifying The Input Data Files Field Separator Renaming Columns Column Names Columns Ordering Case Sensitivity Specifying Filters Inflation Factor and Genomic Control Effective Sample Size Output File Name The Output Files Author(s) Examples

Description

Renames and re-orders columns, sets tabulation as field (column) separator, calculates inflation factors and applies genomic control in GWAS data files.

Usage

1
gwasformat(script, logfile)

Arguments

script

Name of a textual input file with processing instructions. The file should contain the names and locations of all GWAS data files to be processed along with basic information from each individual study.

logfile

Name of a log file with processing output. The output contains calculated inflation factors, total number of markers and number of filtered markers.

Specifying The Input Data Files

The names of the GWAS data files are specified in the input script with the command PROCESS (one line per file). A different directory path can be specified for each file.

Example:

PROCESS input_file_1.txt
PROCESS /dir_1/dir_2/input_file_2.csv

The formatting is applied first to ‘input_file_1.txt’ and then to ‘input_file_2.csv’.

Field Separator

The field (column) separator can be different for each GWAS data file and during the formatting it is changed to tabulation. gwasformat() automatically detects the original separator field for each input file based on the first 10 rows. However, the user has the possibility to specify the original separator manually for each individual file using the command SEPARATOR. The supported arguments and related separators are listed below:

Argument Separator
COMMA comma
TAB tabulation
WHITESPACE whitespace
SEMICOLON semicolon

Example:

PROCESS input_file_1.txt
SEPARATOR COMMA
PROCESS input_file_2.csv
PROCESS input_file_3.txt

For the input file ‘input_file_1.txt’ the field separator is determined automatically by the program but, for the input files ‘input_file_2.csv’ and ‘input_file_3.txt’ the separator is manually set to comma by the user. After the formatting all three files will have tabulation as new field separator.

Renaming Columns

The original column names in the GWAS data files are renamed using the command RENAME in the input script. The command is followed by two words: the first one corresponds to the original column name, and the second one corresponds to the new column name. The column names can't contain tabulation or space characters.

Example:

Let's assume to have three input files: ‘input_file_1.txt’, ‘input_file_2.csv’ and ‘input_file_3.txt’. The files have column marker, which should be renamed. The new column name should be SNPID for ‘input_file_1.txt’, and rsId for ‘input_file_2.csv’ and ‘input_file_3.txt’. The correct column renaming is as follows:

RENAME marker SNPID
PROCESS input_file_1.txt
RENAME marker rsId
PROCESS input_file_2.csv
PROCESS input_file_3.txt

Column Names

In the table below, the complete list of the default column names for the GWAS data file is reported. These names identify uniquely the items in the GWAS data file.

Default column name(s) Description
MARKER Marker name
CHR Chromosome number or name
POSITION Marker position
ALLELE1, ALLELE2 Coded and non-coded alleles
FREQLABEL Allele frequency for the coded allele
STRAND Strand
IMPUTED Label value indicating if the marker
was imputed (1) or genotyped (0)
IMP_QUALITY Imputation quality statistics; this can be
different depending on the software used
for imputation: MACH's Rsq, IMPUTE's properinfo, ...
EFFECT Effect size
STDERR Standard error
PVALUE P-value
HWE_PVAL Hardy-Weinberg equilibrium p-value
CALLRATE Genotype callrate
N Sample size
USED_FOR_IMP Label value indicating if a marker
was used for imputation (1) or not (0)
AVPOSTPROB Average posterior probability for imputed marker allele dosage

Given that different names can be provided for each GWAS data file, gwasformat() allows to redefine the default values for every input file in the input script. The redefinition command consists of the default column name followed by the present column name. To redefine the default column names for coded and non-coded alleles, the command ALLELE followed by two present column names is used. If the present column name was renamed to the new column name with the command RENAME, then the new column name must be used in the redefinition command.

Example 1:

Let's assume to have two input files, ‘input_file_1.txt’ and ‘input_file_2.csv’. In the ‘input_file_1.txt’, the column names for P-value and standard error are pval and SE, respectively. In the ‘input_file_2.csv’, the column name for the P-value is the same as in ‘input_file_1.txt’, but the column name for the standard error is STDERR. The correct column redefinition is as follows:

PVALUE pval
STDERR SE
PROCESS input_file_1.txt
STDERR STDERR
PROCESS input_file_2.csv

There is no need to redefine the PVALUE field. Alternatively, if the column pval in ‘input_file_1.txt’ and ‘input_file_2.csv’ needs to be renamed to p-value, then the input script is as follows:

RENAME pval p-value
PVALUE p-value
PROCESS input_file_1.txt
STDERR STDERR
PROCESS input_file_2.csv

Example 2:

Consider an input file, ‘input_file_1.txt’, with the following names for ALLELE1 and ALLELE2: myRefAllele and myNonRefAllele. The new column definition is applied as follows:

ALLELE myRefAllele myNonRefAllele
PROCESS input_file_1.txt

Columns Ordering

By default the gwasformat() doesn't change the original ordering of columns in the input file. This behaviour can be modified for every input file in the input script using the command ORDER as specified below:

Argument Description
OFF The original ordering of columns is preserved
ON Columns are re-ordered following the alphabetical ordering
ON column_1 column_2 ... column_n Columns are re-ordered following the specified
order column_1 column_2 ... column_n

Example:

Let's assume to have three input files: ‘input_file_1.txt’, ‘input_file_2.csv’ and ‘input_file_3.txt’. Each file contains columns marker, chromosome and bp in the order as they are listed. The following input script renames the column marker to SNPID and switches the ordering mode for every input file:

RENAME marker SNPID
MARKER SNPID
CHR chromosome
POSITION bp
ORDER ON chromosome bp SNPID
PROCESS input_file_1.txt
ORDER OFF
PROCESS input_file_2.csv
ORDER ON
PROCESS input_file_3.txt

For the input file ‘input_file_1.txt’ the columns are re-ordered to: chromosome, bp, SNPID. For the input file ‘input_file_2.csv’ the original ordering of columns is preserved: SNPID, chromosome, bp. For the input file ‘input_file_3.txt’ the columns are re-ordered following the alphabetical ordering: bp, chromosome, SNPID.

Case Sensitivity

By default the gwasformat() assumes that column names in the input files are case insensitive. For example, the column names STDERR, StdErr, and STDErr are all perfectly equivalent. This behaviour can be modified for every input file in the input script using the command CASESENSITIVE, that controls case sensitivity for the column names, as specified below:

Argument Description
0 Column names in the input file
are case insensitive (default)
1 Column names in the input file
are case sensitive

Example:

CASESENSITIVE 1
PROCESS input_file_1.txt
CASESENSITIVE 0
PROCESS input_file_2.csv

Specifying Filters

The gwasformat() filters SNPs based on minor allele frequency(MAF) and imputation quality. The default thresholds are listed below:

Default column name Default thresholds
FREQLABEL > 0.01
IMP_QUALITY > 0.3

The default values can be redefined using the command HQ_SNP for every input file in the input script. The command is followed by two values: the first one corresponds to the threshold for the minor allele frequency, and the second one corresponds to the threshold for the imputation quality.

Example 1:

If we want to filter SNPs with MAF > 0.03 and with imputation quality > 0.4, we would add the following lines to the input script:

HQ_SNP 0.03 0.4
PROCESS input_file_1.txt

Example 2:

If we want to disable filtering, we would change the input script as follows:

HQ_SNP 0 0
PROCESS input_file_1.txt

Inflation Factor and Genomic Control

By default the gwasformat doesn't calculate the inflation factor and doesn't apply the genomic control. This behaviour can be modified for every input file in the input script using the command GC/GENOMICCONTROL as specified below:

Argument Description
OFF The inflation factor is not calculated
and genomic control is not applied
ON The inflation factor is calculated.
Values in PVALUE and STDERR columns
are corrected and saved to the new columns
PVALUE_gc and STDERR_gc, accordingly.
Has no effect if PVALUE column is not present.
numeric value The inflation factor is assumed to be
equal to the specified numeric value.
Values in PVALUE and STDERR columns
are corrected and saved to the new columns
PVALUE_gc and STDERR_gc, accordingly.

If the inflation factor value is less than 1.0, then the genomic control is not applied.

Example:

GC ON
PROCESS input_file_1.txt
GC OFF
PROCESS input_file_2.csv
GC 1.1
PROCESS input_file_3.txt

Effective Sample Size

By default, the gwasformat() computes the effective sample size based on IMP_QUALITY and N columns. The computed values are saved to the new column N_effective.

Output File Name

The output file names are created by adding a prefix to the input file names. The prefix is specified with the command PREFIX.

Example:

PREFIX res_
PROCESS input_file_1.txt
PROCESS input_file_2.csv
PREFIX result_
PROCESS input_file_3.tab

All the output files corresponding to the input files ‘input_file_1.txt’ and ‘input_file_2.csv’ will be prefixed with res_; the output files corresponding to the input file ‘input_file_3.tab’ will be prefixed with result_.

The Output Files

gwasformat produces one formatted (renamed/re-ordered columns, genomic control correction and etc.) copy of every original GWA data file. The formatting history information, containing calculated inflation factors and number of filtered markers, is saved to the log file under the provided logfile name.

Author(s)

Daniel Taliun, Christian Fuchsberger, Cristian Pattaro

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
	
	
	# name of an input script
	script <- "GWASFORMAT_script.txt"
	
	# name of a logfile
	logfile <- "gwasformat_log.txt"
	
	# load GWAtoolbox library
	library(GWAtoolbox)
	
	# show contents of the input script
	file.show(script, title=script)
	
	
	
	# run gwasformat() function
	gwasformat(script, logfile)
	
	
	
	

GWAtoolbox documentation built on May 2, 2019, 4:54 p.m.