knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
We developed a format for storing and harmonising GWAS summary data known as GWAS VCF format. This format is effective for being very fast when querying chromosome and position ranges, handling multiallelic variants and indels.
All the data in the IEU GWAS database is available for download in the GWAS VCF format. This R package provides fast and convenient functions for querying and creating GWAS summary data in GWAS VCF format. The package builds on the VariantAnnotation Bioconductor package, which itself is based on the widely used SummarizedExperiment Bioconductor package.
For some VCF querying functions it is faster to optionally use bcftools, and when available the R package will use that strategy. To set a location for the bcftools package, use
library(gwasvcf) set_bcftools('/path/to/bcftools')
Note that there is bcftools binary for Windows available, so some querying options will be slower on Windows.
For LD related functions the package uses plink 1.90. You can specify the location of your plink installation by running
set_plink('/path/to/plink')
Alternatively you can automatically use use the binaries bundled here: https://github.com/mrcieu/genetics.binaRies
remotes::install_github('mrcieu/genetics.binaRies') set_plink() set_bcftools()
To unset a path:
set_plink(NULL) set_bcftools(NULL)
For this vignette we will use the bundled binaries in genetics.binaRies
.
suppressWarnings(suppressPackageStartupMessages({ library(gwasvcf) library(VariantAnnotation) library(dplyr) library(magrittr) }))
set_bcftools()
To read an entire dataset use the readVcf
function. As an example we'll use the bundled data which is a small subset of the Speliotes et al 2010 BMI GWAS.
vcffile <- system.file("extdata", "data.vcf.gz", package="gwasvcf") vcf <- readVcf(vcffile) class(vcf)
Please refer to the VariantAnnotation
package documentation for full details about the CollapsedVCF
object. A brief summary follows.
General info about the dataset can be obtained by calling it:
vcf
There are 92 rows and 1 column which means 92 SNPs and one GWAS. See the header information:
header(vcf)
See the names of the GWAS datasets (in this case just one, and it refers to the IEU GWAS database ID name):
samples(header(vcf))
In this case you can obtain information about this study through the ieugwasr
package e.g. ieugwasr::gwasinfo("IEU-a-2")
.
There are a few components within the object:
header
which has the meta data describing the dataset, including the association result variablesrowRanges
which is information about each variantinfo
which is further metadata about each variantgeno
which is the actual association results for each GWASthe rowRanges
object is a GenomicRanges
class, which is useful for performing fast operations on chromosome position information.
rowRanges(vcf)
The VCF object is somewhat complex and you can read more about it in the 'VariantAnnotation package documentation](https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html). You can create various other formats that might be easier to use from it. For example, create a GRanges
object which is great for fast chromosome-position operations
vcf_to_granges(vcf)
Create a data frame:
vcf_to_granges(vcf) %>% dplyr::as_tibble()
The direct conversion to formats for tools such as TwoSampleMR, coloc, and many others can also be made using the https://github.com/mrcieu/gwasglue R package.
The query_gwas
function takes either a filename to a vcf file, or vcf object as the main argument. You can then query on rsid
, pval
or chrompos
. For example
vcfsubset <- query_gwas(vcffile, chrompos=c("1:1097291-1099437"))
and
vcf <- readVcf(vcffile) vcfsubset <- query_gwas(vcf, chrompos=c("1:1097291-1099437"))
are each identical, but the former saves time and memory because it is querying the file using an index and only reading in what is required.
Examples of other filters are here:
vcf <- query_gwas(vcffile, rsid=c("rs3128126", "rs3121561", "rs3813193")) vcf
vcf <- query_gwas(vcffile, pval=0.5) vcf
vcf <- query_gwas(vcffile, chrompos=c("1:1097291-1099437")) vcf
It's possible to chain filters together e.g.
vcf <- query_gwas(vcffile, rsid=c("rs3128126", "rs3121561", "rs3813193")) %>% query_gwas(pval=0.5) vcf
It's possible to have multiple GWAS studies per vcf. You can specify specific GWAS studies to read in using e.g.
vcf <- query_gwas(vcffile, rsid=c("rs3128126", "rs3121561", "rs3813193"), id="IEU-a-2")
Note that querying by chrompos is the fastest way to deal with VCFs, use this over rsid where possible when speed is an issue.
Querying by rsid is slow. If a large number of queries by rsid are to be performed then it could be worth generating an index which would speed up the querying. This approach uses SQLite to create a local database, linking rsid to chromosome and position. It strips out the 'rs' from the rs identifiers to make fast searchers by integer. The concept is based on that developed here: bioforensics/rsidx.
To create the index:
create_rsidx_index_from_vcf(vcffile, "index.rsidx")
To query using the index:
vcf <- query_gwas(vcffile, rsid=c("rs3128126", "rs3121561", "rs3813193"), rsidx="index.rsidx")
Querying by p-value is slow. It could be worth generating an index file for p-values to speed this up. Similar to rsid queries, it uses an sqlite database linking -log10 pvalues to chromosome and position.
To create the index:
create_pval_index_from_vcf(vcffile, maximum_pval=0.05, "index.pvali")
To query using the index:
vcf <- query_gwas(vcffile, pval=0.05, pvali="index.pvali")
The fastest way to query VCFs is by specifying chromosome and position. Can specify specific positions, or ranges. e.g.
cp <- c("1:10000", "2:10000-20000")
or as a data frame
cp <- dplyr::tibble(chrom=c(1,2), start=c(10000,10000), end=c(10000, 20000))
You can check what will be parsed out with:
parse_chrompos(cp)
Querying by p-value or rsid is also possible but is slower as only chrompos is indexed. On Mac and Linux, rsid and p-value queries are performed by calls to bcftools. On Windows it uses VariantAnnotation directly, because bcftools binaries are not available. This is unfortunately somewhat slower. If many operations are being performed it might be faster to read in the whole dataset and perform queries that way.
If a set of rsids are requested from a vcf but some are absent, a reference panel can be used to search for LD proxies, extract them, and align the effects and alleles against the original variants that were requested.
There are two ways to perform the LD proxy search:
An LD reference panel can be obtained from here: http://fileserve.mrcieu.ac.uk/ld/data_maf0.01_rs_ref.tgz. This dataset comprises Europeans from the 1000 genomes project, in plink format, and including only SNPs with MAF > 0.01, and with the reference alleles aligned to the human genome reference sequence. For this vignette we can use a small subset of that dataset:
ldfile <- system.file("extdata", "eur.bed", package="gwasvcf") %>% gsub(".bed", "", .)
We also need to provide a path to the plink binary used to generate LD calculations. This can be done through the genetics.binaRies
package as with bcftools
set_plink()
The rs4442317 variant is not present in the vcf file, i.e. if we query that variant:
query_gwas(vcffile, rsid="rs4442317") %>% nrow
vcf <- query_gwas(vcffile, rsid="rs4442317", proxies="yes", bfile=ldfile, tag_r2=0.05) vcf %>% vcf_to_granges()
Here we see that the proxy variant is r vcf_to_granges(vcf)$PR
.
You may also extract only the best available proxies even if the requested rsids are present, by using proxies="only"
. An example of this shows that the effect size estimates for the proxy variants are aligned to the effect alleles of the target variants:
# Read vcf a <- readVcf(vcffile) # Obtain the best LD proxy for each of the rsids b <- query_gwas(vcffile, rsid=names(a), proxies="only", bfile=ldfile, tag_r2=0.6) # Match the target data to the proxy data index <- match(names(b), names(a)) # Plot the target data effects against the proxy data effects plot(vcf_to_granges(b)$ES, vcf_to_granges(a)$ES[index])
Using the LD reference panel described above, it is possible to create a sqlite tag reference panel using the following commands. First get an example LD reference panel:
ldfile <- system.file("extdata", "eur.bed", package="gwasvcf") %>% gsub(".bed", "", .)
We also need to provide a path to the plink binary used to generate LD calculations. This can be done through the genetics.binaRies
package as with bcftools
set_plink()
Now generate the tagging database
dbfile <- tempfile() create_ldref_sqlite(ldfile, dbfile, tag_r2 = 0.05)
Perform the query
vcf <- query_gwas(vcffile, rsid="rs4442317", proxies="yes", dbfile=dbfile, tag_r2=0.05) vcf %>% vcf_to_granges()
unlink(dbfile)
If you have GWAS summary data in a text file or data frame, this can be converted to a VCF object.
vcf <- readVcf(vcffile) vv <- vcf_to_granges(vcf) %>% dplyr::as_tibble() out <- vv %$% create_vcf(chrom=seqnames, pos=start, nea=REF, ea=ALT, snp=ID, ea_af=AF, effect=ES, se=SE, pval=10^-LP, n=SS, name="a") out
It's possible to write the vcf file:
writeVcf(out, file="temp.vcf")
You may want to first harmonise the data so that all the non-effect alleles are aligned to the human genome reference. See the gwasglue package on some functions to do this.
Although still under development, if compared with its predecessor, the gwasglue2 package has several new features, including the use of S4 R objects.
It is possible to create a SummarySet
object from a GWAS-VCF file or VCF object e.g. output from VariantAnnotation::readVcf()
, create_vcf()
or query_gwas()
using the gwasvcf_to_summaryset()
function.
For example:
summaryset <- readVcf(vcffile) %>% gwasvcf_to_summaryset()
Once the SummarySet
objects are created, it is possible to use gwasglue2
to harmonise data, harmonise against a LD matrix, remap genomic coordinates to a different genome assembly, convert to other formats and more.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.