knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(mrpipeline)
The pipeline makes specific use of the "gwasvcf" format created by Bristol (see: https://github.com/MRCIEU/gwasvcf). The gwasvcf format is generally simple to use, fairly lightweight for storage of GWAS data and, most importantly, has a standard format meeting the pipeline can expect a specific format for each function. This makes it much easier to know what format of data the user will input into the pipeline and make automating many of the functions simple.
As mentioned above, the pipeline mainly accepts data in the gwasvcf format. The pipeline is bundled with some wrapper functions which can help with re-formatting your data into this format (see: XXX).
In place of file names, the pipeline will also accept OpenGWAS IDs, e.g. ieu-a-7. OpenGWAS fundamentally stores its data in the gwasvcf format and so expanding functions to accept these IDs meant that a large resource of GWAS data could be integrated into the pipeline for the user's ease of use without having to continually download and convert large files.
For more information, see the ieugwasr R package.
The pipeline follows the structure of a standard Mendelian randomisation (MR) analysis. Data are categorised as either exposure or outcome, with the pipeline expecting certain columns for both.
Generally, the exposure data will contain the following information:
exposure
- Name or identifier for exposure. Many exposures can be combined into the one dataframeSNP
- rsIDs for SNPsbeta
- The effect sizese
- The standard error of the effect sizeeffect_allele
- The allele for which the effect size has been measuredOther columns which may be present, and required for certain functions, are:
chr
- Chromosome of the SNPposition
- Position on the chromosome of the SNPpval
- P value for the SNP's association with the exposuresamplesize
- Sample size in which the effect was estimatedid
- Helper column with unique ID for the exposurefile
- Helper column with either OpenGWAS ID or filename for the exposureother_allele
- The non-effect alleleeaf
- Frequency of the effect allelemr_keep
- Should these SNPs be used in MR?exp_dat <- read_exposure("eqtl-a-ENSG00000167207") head(data.frame(exp_dat))
N.B. Many column names are shared between exposure and outcome data, and so these are prefixed with either .exposure
or .outcome
.
The exposure data can be constructed manually from, for example, a data file or another data.frame, so long as you convert the data into a format the pipeline will expect. This includes re-naming columns and ensuring the column types are as expected.
The read_exposure
function abstracts away a lot of steps that can alter the results and interpretations of an MR study. To this end, we decided upon a set of standard parameters that the pipeline will use to ensure that results for many users will be directly comparable. However, these can be changed by the more advanced user.
The steps taken by the read_exposure
function are as follows:
read_exposure
function a pathway to Plink and Plink reference files.?read_exposure
Will show the full list of arguments for this function, what they mean and their default values.
Generally, the outcome data will contain the following information:
outcome
- Name or identifier for outcome. Many outcome can be combined into the one dataframeSNP
- rsIDs for SNPsbeta
- The effect sizese
- The standard error of the effect sizeeffect_allele
- The allele for which the effect size has been measuredOther columns which may be present, and required for certain functions, are:
chr
- Chromosome of the SNPposition
- Position on the chromosome of the SNPpval
- P value for the SNP's association with the outcomesamplesize
- Sample size in which the effect was estimatedid
- Helper column with unique ID for the outcomefile
- Helper column with either OpenGWAS ID or filename for the outcomeother_allele
- The non-effect alleleeaf
- Frequency of the effect alleleColumns present when searching for proxy SNPs:
target_snp
- The original SNP providedproxy_snp
- The SNP which will be used in the analyses (may be the same as target_snp
, in which case no proxy was used)proxy
- Whether the SNP is a proxy or nottarget_a1
- Original effect alleletarget_a2
- Original non-effect alleleproxy_a1
- Proxy effect alleleproxy_a2
- Proxy non-effect alleleout_dat <- read_outcome("ieu-a-7", rsids = exp_dat$SNP) head(data.frame(out_dat))
The outcome data will require the exposure SNPs as an argument. This is to prevent the pipeline from having to load the entire outcome data into memory. The read_outcome
function will take these SNPs and perform a look up in the outcome datasets for only those SNPs.
Similar to the read_exposure
function, the read_outcome
function also incorporates searching for proxy SNPs into one function call. By default, the pipeline will attempt to look for proxies for SNPs which are not present in the outcome dataset (default: rsq = 0.8). Similar to clumping, the user can decide whether the proxy search is carried out on the OpenGWAS servers (default) or locally, in which case a pathway to Plink and a reference panel are required.
?read_outcome
For more details.
Before use in MR, the exposure and outcome datasets must be harmonised so that the alignment of the beta estimate and effect allele are the same for the exposure and outcome. This can be achieved by:
dat <- harmonise(exp_dat, out_dat) head(data.frame(dat))
For all intents and purposes, this function is a wrapper for the TwoSampleMR::harmonise_data
function (using action = 2 as default). That package has a good vignette describing how the process works, which we recommend reading.
Harmonising many exposures and outcomes can be slow and labourious. Worse still, if there are many analyses to be conducted then waiting for the harmonisation of all the data can take as long as conducting the actual functional analyses themselves. Therefore, this package has implemented a way of conducting what we are calling "pairwise" analyses by using the following function:
mr_res <- pairwise_analysis(exp_dat, out_dat, "../results")
This function parallelises (default threads: 1) the harmonise and analysis stages of the pipeline by harmonising one exposure and one outcome per thread, and then analysing these (e.g. MR) and writes the results to the disc. This way, results are output constantly as they are generated which makes this function ideal for large-scale analyses. The optional overwrite
argument (default: False
) when set to False
means that analyses will not be re-ran if those results already exist, meaning that if an analysis is interrupted then the script can be set to run again without massive overhead in computed already generated results. For more details, see the help page for this function:
?pairwise_analysis
check_snps
is a function which will check the compatibility with either exposure or outcome SNPs for use in MR or colocalisation analyses. By default, these SNPs are dropped but can be retained with a flag marking them for being dropped.
exp_dat <- check_snps(exp_dat, drop = F) head(data.frame(exp_dat))
Now if we set a beta to NA:
exp_dat[1, "beta.exposure"] <- NA exp_dat <- check_snps(exp_dat, drop = F) head(data.frame(exp_dat))
In particular, this function checks the presence of the following columns and values which are necessary as a minimum:
mr
:coloc
:The package also comes with two functions which simplify the process of saving data in the gwasvcf format, depending on how the data are stored. Both are fundamentally the same; one operates on a data.frame and the other a text file: dat_to_gwasvcf
and file_to_gwasvcf
respectively.
These functions require the column names for the requisite data; however, it is also recommended pointing the function to the bcf tools bin, as this will speed up the process. If bcf tools are not provided, then the header of the gwasvcf file will not be written correctly and will be missing some important information, including the total number of variants included and the study type which is required for colocalisation.
By default, these functions will output a zipped file. The gwasvcf files are not required to be unzipped before being used and so we strongly recommend keeping them in this format.
dat_to_gwasvcf(df, "../data/exposure_data", "chr", "pos", "nea", "ea", snp_col = "snp")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.