gwasformat: Formatting of GWAS data files.
In GWAtoolbox: GWAS Quality Control

Description Usage Arguments Specifying The Input Data Files Field Separator Renaming Columns Column Names Columns Ordering Case Sensitivity Specifying Filters Inflation Factor and Genomic Control Effective Sample Size Output File Name The Output Files Author(s) Examples

Renames and re-orders columns, sets tabulation as field (column) separator, calculates inflation factors and applies genomic control in GWAS data files.

1	gwasformat(script, logfile)

`script`	Name of a textual input file with processing instructions. The file should contain the names and locations of all GWAS data files to be processed along with basic information from each individual study.
`logfile`	Name of a log file with processing output. The output contains calculated inflation factors, total number of markers and number of filtered markers.

The names of the GWAS data files are specified in the input script with the command PROCESS (one line per file). A different directory path can be specified for each file.

Example:

PROCESS input_file_1.txt

PROCESS /dir_1/dir_2/input_file_2.csv

The formatting is applied first to ‘input_file_1.txt’ and then to ‘input_file_2.csv’.

The field (column) separator can be different for each GWAS data file and during the formatting it is changed to tabulation. gwasformat() automatically detects the original separator field for each input file based on the first 10 rows. However, the user has the possibility to specify the original separator manually for each individual file using the command SEPARATOR. The supported arguments and related separators are listed below:

Argument	Separator
COMMA	comma
TAB	tabulation
WHITESPACE	whitespace
SEMICOLON	semicolon

Example:

PROCESS input_file_1.txt

SEPARATOR COMMA

PROCESS input_file_2.csv

PROCESS input_file_3.txt

For the input file ‘input_file_1.txt’ the field separator is determined automatically by the program but, for the input files ‘input_file_2.csv’ and ‘input_file_3.txt’ the separator is manually set to comma by the user. After the formatting all three files will have tabulation as new field separator.

The original column names in the GWAS data files are renamed using the command RENAME in the input script. The command is followed by two words: the first one corresponds to the original column name, and the second one corresponds to the new column name. The column names can't contain tabulation or space characters.

Example:

Let's assume to have three input files: ‘input_file_1.txt’, ‘input_file_2.csv’ and ‘input_file_3.txt’. The files have column marker, which should be renamed. The new column name should be SNPID for ‘input_file_1.txt’, and rsId for ‘input_file_2.csv’ and ‘input_file_3.txt’. The correct column renaming is as follows:

RENAME marker SNPID

PROCESS input_file_1.txt

RENAME marker rsId

PROCESS input_file_2.csv

PROCESS input_file_3.txt

In the table below, the complete list of the default column names for the GWAS data file is reported. These names identify uniquely the items in the GWAS data file.

Default column name(s)	Description
MARKER	Marker name
CHR	Chromosome number or name
POSITION	Marker position
ALLELE1, ALLELE2	Coded and non-coded alleles
FREQLABEL	Allele frequency for the coded allele
STRAND	Strand
IMPUTED	Label value indicating if the marker
	was imputed (1) or genotyped (0)
IMP_QUALITY	Imputation quality statistics; this can be
	different depending on the software used
	for imputation: MACH's Rsq, IMPUTE's properinfo, ...
EFFECT	Effect size
STDERR	Standard error
PVALUE	P-value
HWE_PVAL	Hardy-Weinberg equilibrium p-value
CALLRATE	Genotype callrate
N	Sample size
USED_FOR_IMP	Label value indicating if a marker
	was used for imputation (1) or not (0)
AVPOSTPROB	Average posterior probability for imputed marker allele dosage

Given that different names can be provided for each GWAS data file, gwasformat() allows to redefine the default values for every input file in the input script. The redefinition command consists of the default column name followed by the present column name. To redefine the default column names for coded and non-coded alleles, the command ALLELE followed by two present column names is used. If the present column name was renamed to the new column name with the command RENAME, then the new column name must be used in the redefinition command.

Example 1:

Let's assume to have two input files, ‘input_file_1.txt’ and ‘input_file_2.csv’. In the ‘input_file_1.txt’, the column names for P-value and standard error are pval and SE, respectively. In the ‘input_file_2.csv’, the column name for the P-value is the same as in ‘input_file_1.txt’, but the column name for the standard error is STDERR. The correct column redefinition is as follows:

PVALUE pval

STDERR SE

PROCESS input_file_1.txt

STDERR STDERR

PROCESS input_file_2.csv

There is no need to redefine the PVALUE field. Alternatively, if the column pval in ‘input_file_1.txt’ and ‘input_file_2.csv’ needs to be renamed to p-value, then the input script is as follows:

RENAME pval p-value

PVALUE p-value

PROCESS input_file_1.txt

STDERR STDERR

PROCESS input_file_2.csv

Example 2:

Consider an input file, ‘input_file_1.txt’, with the following names for ALLELE1 and ALLELE2: myRefAllele and myNonRefAllele. The new column definition is applied as follows:

ALLELE myRefAllele myNonRefAllele

PROCESS input_file_1.txt

By default the gwasformat() doesn't change the original ordering of columns in the input file. This behaviour can be modified for every input file in the input script using the command ORDER as specified below:

Argument	Description
OFF	The original ordering of columns is preserved
ON	Columns are re-ordered following the alphabetical ordering
ON column_1 column_2 ... column_n	Columns are re-ordered following the specified
	order column_1 column_2 ... column_n

Example:

Let's assume to have three input files: ‘input_file_1.txt’, ‘input_file_2.csv’ and ‘input_file_3.txt’. Each file contains columns marker, chromosome and bp in the order as they are listed. The following input script renames the column marker to SNPID and switches the ordering mode for every input file:

RENAME marker SNPID

MARKER SNPID

CHR chromosome

POSITION bp

ORDER ON chromosome bp SNPID

PROCESS input_file_1.txt

ORDER OFF

PROCESS input_file_2.csv

ORDER ON

PROCESS input_file_3.txt

For the input file ‘input_file_1.txt’ the columns are re-ordered to: chromosome, bp, SNPID. For the input file ‘input_file_2.csv’ the original ordering of columns is preserved: SNPID, chromosome, bp. For the input file ‘input_file_3.txt’ the columns are re-ordered following the alphabetical ordering: bp, chromosome, SNPID.

By default the gwasformat() assumes that column names in the input files are case insensitive. For example, the column names STDERR, StdErr, and STDErr are all perfectly equivalent. This behaviour can be modified for every input file in the input script using the command CASESENSITIVE, that controls case sensitivity for the column names, as specified below:

Argument	Description
0	Column names in the input file
	are case insensitive (default)
1	Column names in the input file
	are case sensitive

Example:

CASESENSITIVE 1

PROCESS input_file_1.txt

CASESENSITIVE 0

PROCESS input_file_2.csv

The gwasformat() filters SNPs based on minor allele frequency(MAF) and imputation quality. The default thresholds are listed below:

Default column name	Default thresholds
FREQLABEL	> 0.01
IMP_QUALITY	> 0.3

The default values can be redefined using the command HQ_SNP for every input file in the input script. The command is followed by two values: the first one corresponds to the threshold for the minor allele frequency, and the second one corresponds to the threshold for the imputation quality.

Example 1:

If we want to filter SNPs with MAF > 0.03 and with imputation quality > 0.4, we would add the following lines to the input script:

HQ_SNP 0.03 0.4

PROCESS input_file_1.txt

Example 2:

If we want to disable filtering, we would change the input script as follows:

HQ_SNP 0 0

PROCESS input_file_1.txt

By default the gwasformat doesn't calculate the inflation factor and doesn't apply the genomic control. This behaviour can be modified for every input file in the input script using the command GC/GENOMICCONTROL as specified below:

Argument	Description
OFF	The inflation factor is not calculated
	and genomic control is not applied
ON	The inflation factor is calculated.
	Values in PVALUE and STDERR columns
	are corrected and saved to the new columns
	PVALUE_gc and STDERR_gc, accordingly.
	Has no effect if PVALUE column is not present.
numeric value	The inflation factor is assumed to be
	equal to the specified numeric value.
	Values in PVALUE and STDERR columns
	are corrected and saved to the new columns
	PVALUE_gc and STDERR_gc, accordingly.

If the inflation factor value is less than 1.0, then the genomic control is not applied.

Example:

GC ON

PROCESS input_file_1.txt

GC OFF

PROCESS input_file_2.csv

GC 1.1

PROCESS input_file_3.txt

By default, the gwasformat() computes the effective sample size based on IMP_QUALITY and N columns. The computed values are saved to the new column N_effective.

The output file names are created by adding a prefix to the input file names. The prefix is specified with the command PREFIX.

Example:

PREFIX res_

PROCESS input_file_1.txt

PROCESS input_file_2.csv

PREFIX result_

PROCESS input_file_3.tab

All the output files corresponding to the input files ‘input_file_1.txt’ and ‘input_file_2.csv’ will be prefixed with res_; the output files corresponding to the input file ‘input_file_3.tab’ will be prefixed with result_.

gwasformat produces one formatted (renamed/re-ordered columns, genomic control correction and etc.) copy of every original GWA data file. The formatting history information, containing calculated inflation factors and number of filtered markers, is saved to the log file under the provided logfile name.

Daniel Taliun, Christian Fuchsberger, Cristian Pattaro

	
	
	# name of an input script
	script <- "GWASFORMAT_script.txt"
	
	# name of a logfile
	logfile <- "gwasformat_log.txt"
	
	# load GWAtoolbox library
	library(GWAtoolbox)
	
	# show contents of the input script
	file.show(script, title=script)
	
	
	
	# run gwasformat() function
	gwasformat(script, logfile)

GWAtoolbox documentation built on May 2, 2019, 4:54 p.m.

GWAtoolbox index

GWAtoolbox

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

GWAtoolbox
GWAS Quality Control

gwasformat: Formatting of GWAS data files.
In GWAtoolbox: GWAS Quality Control

Description

Usage

Arguments

Specifying The Input Data Files

Field Separator

Renaming Columns

Column Names

Columns Ordering

Case Sensitivity

Specifying Filters

Inflation Factor and Genomic Control

Effective Sample Size

Output File Name

The Output Files

Author(s)

Examples

Related to gwasformat in GWAtoolbox...

R Package Documentation

Browse R Packages

We want your feedback!

GWAtoolbox GWAS Quality Control

gwasformat: Formatting of GWAS data files. In GWAtoolbox: GWAS Quality Control

Description

Usage

Arguments

Specifying The Input Data Files

Field Separator

Renaming Columns

Column Names

Columns Ordering

Case Sensitivity

Specifying Filters

Inflation Factor and Genomic Control

Effective Sample Size

Output File Name

The Output Files

Author(s)

Examples

Related to gwasformat in GWAtoolbox...

R Package Documentation

Browse R Packages

We want your feedback!

GWAtoolbox
GWAS Quality Control

gwasformat: Formatting of GWAS data files.
In GWAtoolbox: GWAS Quality Control