snp2gene: Annotate Genes to SNPs

Description Usage Arguments Details Value See Also Examples

Description

Adds columns with gene annotation to an input data frame of SNPs. Two modes are available: by proximity (genes covering the SNP and the next up- and downstream genes (including overlapping genes)) or, alternatively, genes that are in linkage equilibrium/disequilibrium with the query SNPs. Also adds biomart position annotation.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
snp2gene.LD(
  snps, 
  gts.source = 2, 
  ld.win = 1000000, 
  biomart.config = biomartConfigs$hsapiens, 
  use.buffer = FALSE,
  by.genename = TRUE, 
  cores = 1
) 
snp2gene.prox(
  snps, 
  level = -1, 
  biomart.config = biomartConfigs$hsapiens, 
  use.buffer = FALSE,
  by.genename = TRUE,
  print.format = FALSE
)
snp2gene(
  snps, 
  gts.source = NULL, 
  biomart.config = biomartConfigs$hsapiens,
  use.buffer = FALSE, 
  by.genename = TRUE,
  level = -1, 
  ld.win = 1000000, 
  print.format = FALSE, 
  cores = 1
)

Arguments

snps

data frame. Has to contain a column named 'SNP' with rs-IDs (http://www.ncbi.nlm.nih.gov/projects/SNP/). May contain additional columns which are preserved. Will be made unique on the SNP column (duplicate SNPs discarded).

gts.source

vector(1). When NULL, annotates SNP by proximity. Otherwise, uses genotypes from a location as specified by the 'gts.source' argument to annotate genes by LD. Can take a numeric HapMap population ID to download genotypes for, or a GenABEL (.gwaa) or LINKAGE / PLINK format (.ped) genotype file (with corresponding .phe or .map file existing in the same directory), or an object of class snp.data. See the corresponding argument of the getGenotypes function for detailed specifications. GenABEL format is fastest and recommended.

ld.win

numeric(1). Specifies the total window size around a query SNP where genes are tested to be in LD with the query SNP.

biomart.config

list. Contains values that are needed for biomaRt connection and data retrieval (i.e. dataset name, attribute and filter names for SNPs and genes). This is organism specific. Normally, a predefined list from biomartConfigs is specified, e.g. biomartConfigs$mmusculus for the mouse organism. Available predefined lists are shown with names(biomartConfigs). See also biomartConfigs.

level

numeric(1). When 0, returns for each SNP the (single) closest gene. When > 0, returns the level-closest gene in each direction, i.e. the closest in each direction for level = 1, and the second closest in each direction for level = 2 and so on. When < 0, returns the closest gene for each direction as for level = 1, plus all genes that overlap with the closest in each direction.

print.format

boolean(1). When FALSE (default), a decomposed data frame is returned, with one row per SNP-gene association. Contains an additional column for the direction of the SNP-gene relation (upstream, downstream, ...). LD annotation always uses the decomposed return format. When TRUE, returns the input data frame (one row per SNP) with three additional columns for covering, up- and downstream genes.

cores

integer(1). When > 1, parallelizes LD calculation and genotype retrieval by forking the parent R process. Number of concurrent processes is defined by the cores argument which has to be > 0. Might consume cores-fold memory. Parallelization only works on UNIX systems and might confuse the order of status messages.

use.buffer

boolean(1). When TRUE, buffers downloaded annotation data in the buffer variables snps and genes. When these variables already exists, data is not downloaded but used from there instead. This facilitates the possibility to re-use data, alter data or provide custom data. See postgwasBuffer for more information on the buffer concept and variables. When setting use.buffer = TRUE, make sure that existing buffer data is current!

by.genename

boolean(1). When TRUE, only genes that have a name are annotated. For genes with a single name but multiple IDs, a random representative ID is selected. When FALSE, uses all genes.

Details

When gene annotation by proximity is requested, we have to deal with cases of overlapping genes. This can be many (e.g. TRBV gene variants or dense gene clusters). When level < 0, all overlapping genes will be annotated (capped by cluster.maxsize) and form multiple rows in the decomposed format or, for the print format, occur as single string where gene names are separated by '|'. LD calculation uses several measures to assess the amount of linkage disequilibrium between a SNP and a gene. In a preliminary step, the r-square correlation between the queried SNP and SNPs within the gene is calculated (uses the r2fast function of GenABEL). Due to efficency constraints, only genes within a user-defined window (default: 1 MB) are considered (genes that overlap partially with the window are included, and the nearest gene is always considered regardless of the window parameter), and there is a cap of 100 SNPs per gene for LD calculation (when there are more, 100 SNPs are selected by position, evenly distributed over the gene). Output then contains

When the gts.source is a HapMap population or linkage file, all genotypes retrieved will be written to the files snp2gene.gwaa, snp2gene.phe, snp2gene.ped and snp2gene.map in the current directory. The whole annotation process makes use of the biomaRt database interface. All positions used and returned refer to the genome build version as provided by the according biomaRt retrival. The database and variables used can be specified by the user, e.g. it is possible to annotate accession numbers by altering the biomart.config, setting myconfig$gene$attr$name <- 'uniprot_swissprot_accession'

Value

A data frame with all columns from the query data frame 'snps' plus columns 'BP' for the biomart position (renames an existing BP column to BP.original) and

Each association of a SNP to a gene forms a row of the data frame.

See Also

biomartConfigs

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
snps <- data.frame(SNP = c("rs172154", "rs759704"))



## Not run: 
## simplest annotation, online
annot.prox <- snp2gene.prox(snps)
annot.ld <- snp2gene.LD(snps)

## extract names of genes in LD
geneset <- annot.ld[annot.ld$ld.max > 0.3, "genename"]

## the same with a different organism
snps <- data.frame(SNP = c("s01-10027", "s04-1469331", "s10-240843", "s15-474479"))
snp2gene.prox(snps, biomart.config = biomartConfigs$scerevisiae)

## how to use protein accession numbers
bconf <- biomartConfigs$hsapiens
bconf$gene$attr$name <- "uniprot_swissprot_accession"
snp2gene.prox(snps, biomart.config = bconf)

## this shows a list of further biomart attributes for human genes:
listAttributes(useMart("ensembl", "hsapiens_gene_ensembl"))

## End(Not run)

## annotation using buffer variables (runs offline when set properly)
snp2gene.LD(snps, gts.source = genotype.file, use.buffer = TRUE)
snp2gene.prox(snps, use.buffer = TRUE)

merns/postgwas documentation built on May 22, 2019, 6:53 p.m.