pannotate: SNPs annotation with regions (e.g. genes).
In GWAtoolbox: GWAS Quality Control

Description Usage Arguments Details Specifying The Input Data Files Specifying The Regions Files Specifying The Map Files Specifying Column Names in Input Data Files Specifying Column Names in Regions Files Specifying Column Names in Map Files Field Separator in Input Data Files Field Separator in Regions Files Field Separator in Map Files Case Sensitivity Specifying Window Size For Annotation Specifying Output Format Output File Name Author(s) Examples

For provided SNPs finds all genes in specified deviation (e.g. 0, +/- 50000, +/- 100000, ... ). The function is analogous to annotate and supports parallel processing of multiple GWAS data files. The parallelization is implemented with snow package using “SOCK” cluster type.

1	pannotate(script, processes)

`script`	Name of a textual input file with processing instructions. The file should contain the names and locations of all GWAS data files to be annotated along with basic information from each individual study.
`processes`	An integer greater than 1, which indicates the number of parallel processes. All processes are created on a localhost and communicate through sockets.

Function pannotate() annotates every marker in input files with regions (e.g. genes) that contain it or fall in a specified windows around it (e.g. +/-50kb, +/-100kb and etc). The arbitrary number of windows of various sizes can be specified in the input script. The regions with their chromosomal coordinates must be provided in a separate file. It is possible to annotate markers if only their names are available (e.g. rsId) in input files, or if there is a need to change chromosomal positions (e.g. if different version of human genome build should be used). In this case, their chromosomal positions must be provided in a separate map file.

The names of the GWAS data files are specified in the input script with the command PROCESS (one line per file). A different directory path can be specified for each file.

Example:

PROCESS input_file_1.txt

PROCESS /dir_1/dir_2/input_file_2.csv

The annotation is applied first to ‘input_file_1.txt’ and then to ‘input_file_2.csv’.

The names of the regions (e.g. with genes) files are specified in the input script with the command REGIONS_FILE. In the same script different regions files can be specified for different GWAS data files. Also different directory path can be specified for each regions file.

Example:

REGIONS_FILE genes_file_1.txt

PROCESS input_file_1.txt

REGIONS_FILE /dir_1/dir_2/genes_file_2.csv

PROCESS input_file_2.csv

PROCESS input_file_3.txt

The annotation is applied first to ‘input_file_1.txt’ using regions from ‘genes_file_1.txt’ file. Then, files ‘input_file_2.csv’ and ‘input_file_3.txt’ are annotated with regions in ‘genes_file_2.csv’ file.

The names of the map files are specified in the input script with the command MAP_FILE. In the same script different map files can be specified for different GWAS data files. Also different directory path can be specified for each map files.

Example:

MAP_FILE map_file_1.txt

REGIONS_FILE genes_file_1.txt

PROCESS input_file_1.txt

MAP_FILE /dir_1/dir_2/map_file_2.csv

REGIONS_FILE /dir_1/dir_2/genes_file_2.csv

PROCESS input_file_2.csv

PROCESS input_file_3.txt

The annotation is applied first to ‘input_file_1.txt’ using marker genomic positions in ‘map_file_1.txt’ file and regions in ‘genes_file_1.txt’ file. Then, files ‘input_file_2.csv’ and ‘input_file_3.txt’ are annotated with regions in ‘genes_file_2.csv’ file using marker genomic positions in ‘map_file_2.csv’.

In the table below, the complete list of the default column names for the GWAS data file is reported. These names identify uniquely the items in the GWAS data file.

Default column name(s)	Description
MARKER	Marker name
CHR	Chromosome number or name
POSITION	Marker position

Given that different names can be provided for each GWAS data file, pannotate() allows to redefine the default values for every input file in the input script. The redefinition command consists of the default column name followed by the new column name. When the map file is specified using command MAP_FILE, then CHR and POSITION columns in the GWAS data file are not required.

Example:

Let's assume to have two input files, ‘input_file_1.txt’ and ‘input_file_2.csv’. In the ‘input_file_1.txt’, the column names for marker name, chromosome name and position are SNPID, CHR and POS, respectively. In the ‘input_file_2.csv’, the column names for marker name is the same as in ‘input_file_1.txt’, but the column names for the chromosome and position are chromosome and position, respectively. The correct column redefinition is as follows:

MARKER SNPID

POSITION POS

PROCESS input_file_1.txt

CHR chromosome

POSITION position

PROCESS input_file_2.csv

There are no need to define the CHR field for the ‘input_file_1.txt’, since it matches the default name.

In the table below, the complete list of the default column names for the regions file is reported. These names identify uniquely the items in the regions file.

Default column name(s)	Description
REGION_NAME	Region name (e.g. gene name)
REGION_CHR	Chromosome number or name
REGION_START	Region (e.g. gene) start position
REGION_END	Region (e.g.) end position

Given that different names can be provided for each regions file, pannotate() allows to redefine the default values for every regions file in the input script. The redefinition command consists of the default column name followed by the present column name.

Example:

Let's assume to have two map files, ‘region_file_1.txt’ and ‘region_file_2.csv’. In the ‘region_file_1.txt’, the column names for the region name, chromosome, start and end position are name, chr, REGION_START and REGION_END, respectively. In the ‘region_file_2.csv’, the column name for the region name and chromosome are the same as in ‘regions_file_1.txt’, but the column names for the region start and end positions are start and end, respectively. The correct column redefinition is as follows:

REGIONS_FILE genes_file_1.txt

REGION_NAME name

REGION_CHR chr

PROCESS input_file_1.txt

REGIONS_FILE genes_file_2.csv

REGION_START start

REGION_END end

PROCESS input_file_2.csv

There is no need to define the REGION_START and REGION_END fields for ‘genes_file_1.txt’ regions file. Also there is no need to redefine REGION_NAME and REGION_CHR fields for the ‘genes_file_2.csv’ map file.

In the table below, the complete list of the default column names for the map file is reported. These names identify uniquely the items in the map file.

Default column name(s)	Description
MAP_MARKER	Marker name
MAP_CHR	Chromosome number or name
MAP_POSITION	Marker position

Given that different names can be provided for each map file, pannotate() allows to redefine the default values for every map file in the input script. The redefinition command consists of the default column name followed by the present column name.

Example:

Let's assume to have two map files, ‘map_file_1.txt’ and ‘map_file_2.csv’. In the ‘map_file_1.txt’, the column names for marker name, chromosome and position are name, MAP_CHR and pos, respectively. In the ‘map_file_2.csv’, the column name for the marker name and chromosome are the same as in ‘map_file_1.txt’, but the column name for the marker position is map_pos. The correct column redefinition is as follows:

MAP_FILE map_file_1.txt

MAP_MARKER name

MAP_POSITION pos

REGIONS_FILE genes_file_1.txt

PROCESS input_file_1.txt

MAP_FILE map_file_2.csv

MAP_POSITION map_pos

REGIONS_FILE genes_file_2.csv

PROCESS input_file_2.csv

There is no need to define the MAP_CHR field for both map files. Also there is no need to redefine MAP_MARKER for the ‘genes_file_2.csv’ map file.

The field (column) separator can be different for each GWAS data file. pannotate() automatically detects the original separator field for each input file based on the first 10 rows. However, the user has the possibility to specify the original separator manually for each individual file using the command SEPARATOR. The supported arguments and related separators are listed below:

Argument	Separator
COMMA	comma
TAB	tabulation
WHITESPACE	whitespace
SEMICOLON	semicolon

Example:

PROCESS input_file_1.txt

SEPARATOR COMMA

PROCESS input_file_2.csv

PROCESS input_file_3.txt

For the input file ‘input_file_1.txt’ the field separator is determined automatically by the program but, for the input files ‘input_file_2.csv’ and ‘input_file_3.txt’ the separator is manually set to comma by the user.

The field (column) separator can be different for each regions file. pannotate() automatically detects the original separator field for each regions file based on the first 10 rows. However, the user has the possibility to specify the original separator manually for each individual file using the command REGIONS_FILE_SEPARATOR. The supported arguments and related separators are listed below:

Argument	Separator
COMMA	comma
TAB	tabulation
WHITESPACE	whitespace
SEMICOLON	semicolon

Example:

REGIONS_FILE genes_file_1.txt

PROCESS input_file_1.txt

REGIONS_FILE genes_file_2.csv

REGIONS_FILE_SEPARATOR COMMA

PROCESS input_file_2.csv

REGIONS_FILE genes_file_3.txt

PROCESS input_file_3.txt

For the regions file ‘genes_file_1.txt’ the field separator is determined automatically by the program but, for the regions files ‘genes_file_2.csv’ and ‘genes_file_3.txt’ the separator is manually set to comma by the user.

The field (column) separator can be different for each map file. pannotate() automatically detects the original separator field for each map file based on the first 10 rows. However, the user has the possibility to specify the original separator manually for each individual file using the command MAP_FILE_SEPARATOR. The supported arguments and related separators are listed below:

Argument	Separator
COMMA	comma
TAB	tabulation
WHITESPACE	whitespace
SEMICOLON	semicolon

Example:

MAP_FILE map_file_1.txt

REGIONS_FILE genes_file_1.txt

PROCESS input_file_1.txt

MAP_FILE map_file_2.csv

MAP_FILE_SEPARATOR COMMA

REGIONS_FILE genes_file_2.csv

PROCESS input_file_2.csv

MAP_FILE map_file_3.txt

PROCESS input_file_3.txt

For the map file ‘map_file_1.txt’ the field separator is determined automatically by the program but, for the map files ‘map_file_2.csv’ and ‘map_file_3.txt’ the separator is manually set to comma by the user.

By default the pannotate() assumes that column names in the all specified files are case insensitive. For example, the column names CHR, Chr, and chr are all perfectly equivalent. This behaviour can be modified for every input file in the input script using the command CASESENSITIVE, that controls case sensitivity for the column names, as specified below:

Argument	Description
0	Column names in the input file
	are case insensitive (default)
1	Column names in the input file
	are case sensitive

Example:

CASESENSITIVE 1

MAP_FILE map_file_1.txt

REGIONS_FILE genes_file_1.txt

PROCESS input_file_1.txt

CASESENSITIVE 0

MAP_FILE map_file_2.csv

REGIONS_FILE genes_file_2.csv

PROCESS input_file_2.csv

Every marker in the GWAS data file is annotated with the regions (e.g. genes) that fall in a particular window around it. pannotate() allows to specify multiple window sizes using command REGIONS_DEVIATION. Command REGIONS_DEVIATION is followed by an arbitrary number of positive integers that specify window sizes around markers in base pairs. Each specified window size results in a new output column where all regions overlapping with this window are reported. The ouptut columns are ordered by window size starting with the smallest. Therefore, every new output column represents bigger window size and lists only those regions that were not reported previously. If REGIONS_DEVIATION is not specified, then the default window sizes are 0, 100000 and 250000 (i.e. 0, +/-100kb and +/- 250kb around marker). If 0 is specified, then only regions that include the marker are reported.

Example:

REGIONS_FILE genes_file_1.txt

REGIONS_DEVIATION 0 50000 100000

PROCESS input_file_1.txt

REGIONS_DEVIATION 0 100000 250000 500000

PROCESS input_file_2.csv

Every marker in ‘input_file_1.txt’ will be annotated with regions that contains it or are within +/-50kb and +/-100kb windows around it. While every marker in ‘input_file_2.csv’ will be annotated with regions that contains it or are within +/-100kb, +/-250kb and +/-500kb windows around it.

Often GWAS data file contains many columns that are not required in the output files with annotation results. By default, in addition to columns with annotated regions, pannotate() outputs only columns with marker name, chromosome name and position. This behaviour can be modified for every input file in the input script using the command REGIONS_APPEND. The supported arguments are listed below:

Argument	Separator
OFF	Only the original columns with marker name,
	chromosome name and position are preserved.
	Columns with annotated regions are appended
	to the end.
ON	All the original columns are preserved and
	columns with annotated regions are appended
	to the end.

Example:

REGIONS_FILE genes_file_1.txt

REGIONS_APPEND ON

PROCESS input_file_1.txt

REGIONS_APPEND OFF

PROCESS input_file_2.csv

The output file names are created by adding a prefix to the input file names. The prefix is specified with the command PREFIX.

Example:

REGIONS_FILE genes_file_1.txt

PREFIX annotated_

PROCESS input_file_1.txt

PROCESS input_file_2.csv

PREFIX annot_

PROCESS input_file_3.tab

All the output files corresponding to the input files ‘input_file_1.txt’ and ‘input_file_2.csv’ will be prefixed with annotated_; the output files corresponding to the input file ‘input_file_3.tab’ will be prefixed with annot_.

Daniel Taliun, Christian Fuchsberger, Cristian Pattaro

	
	
	# name of an input script
	script <- "PANNOTATE_script.txt"
	
	# load GWAtoolbox library
	library(GWAtoolbox)
	
	# show contents of the input script
	file.show(script, title=script)
	
	
	
	# run pannotate() function with 2 parallel processes
	pannotate(script, 2)

GWAtoolbox documentation built on May 2, 2019, 4:54 p.m.

GWAtoolbox index

GWAtoolbox

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

GWAtoolbox
GWAS Quality Control

pannotate: SNPs annotation with regions (e.g. genes).
In GWAtoolbox: GWAS Quality Control

Description

Usage

Arguments

Details

Specifying The Input Data Files

Specifying The Regions Files

Specifying The Map Files

Specifying Column Names in Input Data Files

Specifying Column Names in Regions Files

Specifying Column Names in Map Files

Field Separator in Input Data Files

Field Separator in Regions Files

Field Separator in Map Files

Case Sensitivity

Specifying Window Size For Annotation

Specifying Output Format

Output File Name

Author(s)

Examples

Related to pannotate in GWAtoolbox...

R Package Documentation

Browse R Packages

We want your feedback!

GWAtoolbox GWAS Quality Control

pannotate: SNPs annotation with regions (e.g. genes). In GWAtoolbox: GWAS Quality Control

Description

Usage

Arguments

Details

Specifying The Input Data Files

Specifying The Regions Files

Specifying The Map Files

Specifying Column Names in Input Data Files

Specifying Column Names in Regions Files

Specifying Column Names in Map Files

Field Separator in Input Data Files

Field Separator in Regions Files

Field Separator in Map Files

Case Sensitivity

Specifying Window Size For Annotation

Specifying Output Format

Output File Name

Author(s)

Examples

Related to pannotate in GWAtoolbox...

R Package Documentation

Browse R Packages

We want your feedback!

GWAtoolbox
GWAS Quality Control

pannotate: SNPs annotation with regions (e.g. genes).
In GWAtoolbox: GWAS Quality Control