Description Usage Arguments Specifying The Input Data Files Field Separator Renaming Columns Column Names Columns Ordering Case Sensitivity Specifying Filters Inflation Factor and Genomic Control Effective Sample Size Output File Name The Output Files Author(s) Examples
Formats headers, orders columns, calculates inflation factors and applies genomic control in GWAS result files.
The function is analogous to gwasformat
and supports parallel processing of multiple GWAS data files.
The parallelization is implemented with snow package using “SOCK” cluster type.
1 | pgwasformat(script, logfile, processes)
|
script |
Name of a textual input file with processing instructions. The file should contain the names and locations of all GWAS data files to be processed along with basic information from each individual study, and instructions for the quality check. |
logfile |
Name of a log file with processing output. The output contains calculated inflation factors, total number of markers and number of filtered markers. |
processes |
An integer greater than 1, which indicates the number of parallel processes. All processes are created on a localhost and communicate through sockets. |
The names of the GWAS data files are specified in the input script with the command PROCESS (one line per file). A different directory path can be specified for each file.
Example:
PROCESS input_file_1.txt |
PROCESS /dir_1/dir_2/input_file_2.csv |
The formatting is applied first to ‘input_file_1.txt’ and then to ‘input_file_2.csv’.
The field (column) separator can be different for each GWAS data file and during the formatting it is changed to tabulation.
pgwasformat()
automatically detects the original separator field for each input file based on the first 10 rows.
However, the user has the possibility to specify the original separator manually for each individual file using the command SEPARATOR.
The supported arguments and related separators are listed below:
Argument | Separator |
COMMA | comma |
TAB | tabulation |
WHITESPACE | whitespace |
SEMICOLON | semicolon |
Example:
PROCESS input_file_1.txt |
SEPARATOR COMMA |
PROCESS input_file_2.csv |
PROCESS input_file_3.txt |
For the input file ‘input_file_1.txt’ the field separator is determined automatically by the program but, for the input files ‘input_file_2.csv’ and ‘input_file_3.txt’ the separator is manually set to comma by the user. After the formatting all three files will have tabulation as new field separator.
The original column names in the GWAS data files are renamed using the command RENAME in the input script. The command is followed by two words: the first one corresponds to the original column name, and the second one corresponds to the new column name. The column names can't contain tabulation or space characters.
Example:
Let's assume to have three input files: ‘input_file_1.txt’, ‘input_file_2.csv’ and ‘input_file_3.txt’. The files have column marker, which should be renamed. The new column name should be SNPID for ‘input_file_1.txt’, and rsId for ‘input_file_2.csv’ and ‘input_file_2.txt’. The correct column renaming is as follows:
RENAME marker SNPID |
PROCESS input_file_1.txt |
RENAME marker rsId |
PROCESS input_file_2.csv |
PROCESS input_file_3.txt |
In the table below, the complete list of the default column names for the GWAS data file is reported. These names identify uniquely the items in the GWAS data file.
Default column name(s) | Description |
MARKER | Marker name |
CHR | Chromosome number or name |
POSITION | Marker position |
ALLELE1, ALLELE2 | Coded and non-coded alleles |
FREQLABEL | Allele frequency for the coded allele |
STRAND | Strand |
IMPUTED | Label value indicating if the marker |
was imputed (1) or genotyped (0) | |
IMP_QUALITY | Imputation quality statistics; this can be |
different depending on the software used | |
for imputation: MACH's Rsq, IMPUTE's properinfo, ... | |
EFFECT | Effect size |
STDERR | Standard error |
PVALUE | P-value |
HWE_PVAL | Hardy-Weinberg equilibrium p-value |
CALLRATE | Genotype callrate |
N | Sample size |
USED_FOR_IMP | Label value indicating if a marker |
was used for imputation (1) or not (0) | |
AVPOSTPROB | Average posterior probability for imputed marker allele dosage |
Given that different names can be provided for each GWAS data file, pgwasformat()
allows to redefine the default values for
every input file in the input script.
The redefinition command consists of the default column name followed by the present column name.
To redefine the default column names for coded and non-coded alleles, the command ALLELE followed
by two present column names is used.
If the present column name was renamed to the new column name with the command RENAME, then the new column name must be used in the redefinition command.
Example 1:
Let's assume to have two input files, ‘input_file_1.txt’ and ‘input_file_2.csv’. In the ‘input_file_1.txt’, the column names for P-value and standard error are pval and SE, respectively. In the ‘input_file_2.csv’, the column name for the P-value is the same as in ‘input_file_1.txt’, but the column name for the standard error is STDERR. The correct column redefinition is as follows:
PVALUE pval |
STDERR SE |
PROCESS input_file_1.txt |
STDERR STDERR |
PROCESS input_file_2.csv |
There is no need to redefine the PVALUE field. Alternatively, if the column pval in ‘input_file_1.txt’ and ‘input_file_2.csv’ needs to be renamed to p-value, then the input script is as follows:
RENAME pval p-value |
PVALUE p-value |
PROCESS input_file_1.txt |
STDERR STDERR |
PROCESS input_file_2.csv |
Example 2:
Consider an input file, ‘input_file_1.txt’, with the following names for ALLELE1 and ALLELE2: myRefAllele and myNonRefAllele. The new column definition is applied as follows:
ALLELE myRefAllele myNonRefAllele |
PROCESS input_file_1.txt |
By default the pgwasformat()
doesn't change the original ordering of columns in the input file.
This behaviour can be modified for every input file in the input script using the command ORDER as specified below:
Argument | Description |
OFF | The original ordering of columns is preserved |
ON | Columns are re-ordered following the alphabetical ordering |
ON column_1 column_2 ... column_n | Columns are re-ordered following the specified |
order column_1 column_2 ... column_n |
Example:
Let's assume to have three input files: ‘input_file_1.txt’, ‘input_file_2.csv’ and ‘input_file_3.txt’. Each file contains columns marker, chromosome and bp in the order as they are listed. The following input script renames the column marker to SNPID and switches the ordering mode for every input file:
RENAME marker SNPID |
MARKER SNPID |
CHR chromosome |
POSITION bp |
ORDER ON chromosome bp SNPID |
PROCESS input_file_1.txt |
ORDER OFF |
PROCESS input_file_2.csv |
ORDER ON |
PROCESS input_file_3.txt |
For the input file ‘input_file_1.txt’ the columns are re-ordered to: chromosome, bp, SNPID. For the input file ‘input_file_2.csv’ the original ordering of columns is preserved: SNPID, chromosome, bp. For the input file ‘input_file_3.txt’ the columns are re-ordered following the alphabetical ordering: bp, chromosome, SNPID.
By default the pgwasformat()
assumes that column names in the input files are case insensitive.
For example, the column names STDERR, StdErr, and STDErr are all perfectly equivalent.
This behaviour can be modified for every input file in the input script using the command CASESENSITIVE,
that controls case sensitivity for the column names, as specified below:
Argument | Description |
0 | Column names in the input file |
are case insensitive (default) | |
1 | Column names in the input file |
are case sensitive |
Example:
CASESENSITIVE 1 |
PROCESS input_file_1.txt |
CASESENSITIVE 0 |
PROCESS input_file_2.csv |
The pgwasformat()
filters SNPs based on minor allele frequency(MAF) and imputation quality.
The default thresholds are listed below:
Default column name | Default thresholds |
FREQLABEL | > 0.01 |
IMP_QUALITY | > 0.3 |
The default values can be redefined using the command HQ_SNP for every input file in the input script. The command is followed by two values: the first one corresponds to the threshold for the minor allele frequency, and the second one corresponds to the threshold for the imputation quality.
Example 1:
If we want to filter SNPs with MAF > 0.03 and with imputation quality > 0.4, we would add the following lines to the input script:
HQ_SNP 0.03 0.4 |
PROCESS input_file_1.txt |
Example 2:
If we want to disable filtering, we would change the input script as follows:
HQ_SNP 0 0 |
PROCESS input_file_1.txt |
By default the gwasformat
doesn't calculate the inflation factor and doesn't apply the genomic control.
This behaviour can be modified for every input file in the input script using the command GC/GENOMICCONTROL as specified below:
Argument | Description |
OFF | The inflation factor is not calculated |
and genomic control is not applied | |
ON | The inflation factor is calculated. |
Values in PVALUE and STDERR columns | |
are corrected and saved to the new columns | |
PVALUE_gc and STDERR_gc, accordingly. | |
Has no effect if PVALUE column is not present. | |
numeric value | The inflation factor is assumed to be |
equal to the specified numeric value. | |
Values in PVALUE and STDERR columns | |
are corrected and saved to the new columns | |
PVALUE_gc and STDERR_gc, accordingly. |
If the inflation factor value is less than 1.0, then the genomic control is not applied.
Example:
GC ON |
PROCESS input_file_1.txt |
GC OFF |
PROCESS input_file_2.csv |
GC 1.1 |
PROCESS input_file_3.txt |
By default, the pgwasformat()
computes the effective sample size based on IMP_QUALITY and N columns.
The computed values are saved to the new column N_effective.
The output file names are created by adding a prefix to the input file names. The prefix is specified with the command PREFIX.
Example:
PREFIX res_ |
PROCESS input_file_1.txt |
PROCESS input_file_2.csv |
PREFIX result_ |
PROCESS input_file_3.tab |
All the output files corresponding to the input files ‘input_file_1.txt’ and ‘input_file_2.csv’ will be prefixed with res_; the output files corresponding to the input file ‘input_file_3.tab’ will be prefixed with result_.
gwasformat
produces one formatted (renamed/re-ordered columns, genomic control correction and etc.) copy of every original GWA data file.
The formatting history information, containing calculated inflation factors and number of filtered markers, is saved to the log file under the provided logfile name.
Daniel Taliun, Christian Fuchsberger, Cristian Pattaro
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | ## Not run:
# all input and output files are located in the subdirectory "doc" of the installed GWAtoolbox package
# change the workspace
currentWd <- getwd()
newWd <- paste(system.file(package="GWAtoolbox"), "doc", sep="/")
setwd(newWd)
# name of an input script
script <- "PGWASFORMAT_script.txt"
# name of a logfile
logfile <- "pgwasformat_log.txt"
# load GWAtoolbox library
library(GWAtoolbox)
# show contents of the input script
file.show(script, title=script)
\dontshow{options(device.ask.default = FALSE)}
# run pgwasformat() function with 2 parallel processes
pgwasformat(script, logfile, 2)
# restore previous workspace
setwd(currentWd)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.