read_gwas: Read a GWAS results file into a data frame.

Description Usage Arguments Chromosome Styles Preprocessing

View source: R/read.r

Description

Read a GWAS results file into a data frame.

Usage

1
2
3
read_gwas(input, sep = "auto", missing = c("NA", "N/A", "null", "."),
  chromosome_style = "ucsc", preprocess = NULL, nrows = -1L,
  header = TRUE, col.names = NULL, verbose = TRUE)

Arguments

input

Path to a file containing GWAS summary statistics. If multiple paths are specified all files will be read in and combined into a single data.frame.

sep

The separator between columns. Defaults to the first character in the set [,\t |;:] that exists on line autostart outside quoted ("") regions, and separates the rows above autostart into a consistent number of fields, too.

missing

Vector of characters that represent missing value codes. By default the following strings are interpreted as NA: "", ".", "NA", "N/A", and "null".

chromosome_style

Convert chromosomes to ordered factors with labels based on the specified style (default is "ucsc"; see below for a comparison of the different styles). Set to NULL to leave chromosomes unchanged.

preprocess

a shell command that preprocesses the file; see below for more details

nrows

The number of rows to read, by default -1 means all. Unlike read.table, it doesn't help speed to set this to the number of rows in the file (or an estimate), since the number of rows is automatically determined and is already fast. Only set nrows if you require the first 10 rows, for example. 'nrows=0' is a special case that just returns the column names and types; e.g., a dry run for a large file or to quickly check format consistency of a set of files before starting to read any.

header

Does the first data line contain column names? Defaults according to whether every non-empty field on the first data line is type character. If so, or TRUE is supplied, any empty column names are given a default name.

col.names

A vector of optional names for the variables (columns). The default is to use the header column if present or detected, or if not "V" followed by the column number.

verbose

Provide description of processing steps

Chromosome Styles

We use the Homo sapiens chromosome styles defined in Bioconductor's GenomeInfoDb. Valid options include "ncbi", "ensembl", "ucsc" and "dbsnp". The following table provides a preview of each style (note ncbi and ensembl are identical):

ncbi/ensembl ucsc dbsnp
1 chr1 ch1
2 chr2 ch2
3 chr3 ch3
... ... ...
X chrX chX
Y chrY chY
MT chrM chMT

Preprocessing

The preprocessor argument allows you to specify shell commands that preprocess the file before it's read into R. For example, we could use grep to filter our results to include only markers with an RS number:

1
 read_gwas("my-results.txt", preprocess = "grep -e '^rs'") 

Note that read_gwas() handles the header row separately so column labels wouldn't be filtered out by grep in this example.

By default, the input filename is appended to preprocess argument prior to execution. However, you can control where the filename should be inserted in the command by using %s as a placeholder. In the following example, tr is being used to remove null terminators:

1
  read_gwas("my-results.txt", preprocess = "tr -d '\000' < %s")

aaronwolen/gwasio documentation built on Dec. 16, 2019, 4:49 p.m.