knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(mrpipeline)

Introduction

The pipeline makes specific use of the "gwasvcf" format created by Bristol (see: https://github.com/MRCIEU/gwasvcf). The gwasvcf format is generally simple to use, fairly lightweight for storage of GWAS data and, most importantly, has a standard format meeting the pipeline can expect a specific format for each function. This makes it much easier to know what format of data the user will input into the pipeline and make automating many of the functions simple.

Local gwasvcf Files

As mentioned above, the pipeline mainly accepts data in the gwasvcf format. The pipeline is bundled with some wrapper functions which can help with re-formatting your data into this format (see: XXX).

OpenGWAS

In place of file names, the pipeline will also accept OpenGWAS IDs, e.g. ieu-a-7. OpenGWAS fundamentally stores its data in the gwasvcf format and so expanding functions to accept these IDs meant that a large resource of GWAS data could be integrated into the pipeline for the user's ease of use without having to continually download and convert large files.

For more information, see the ieugwasr R package.


Data Types

The pipeline follows the structure of a standard Mendelian randomisation (MR) analysis. Data are categorised as either exposure or outcome, with the pipeline expecting certain columns for both.

Exposure data

Structure

Generally, the exposure data will contain the following information:

Other columns which may be present, and required for certain functions, are:

exp_dat <- read_exposure("eqtl-a-ENSG00000167207")

head(data.frame(exp_dat))

N.B. Many column names are shared between exposure and outcome data, and so these are prefixed with either .exposure or .outcome.

The exposure data can be constructed manually from, for example, a data file or another data.frame, so long as you convert the data into a format the pipeline will expect. This includes re-naming columns and ensuring the column types are as expected.

Details

The read_exposure function abstracts away a lot of steps that can alter the results and interpretations of an MR study. To this end, we decided upon a set of standard parameters that the pipeline will use to ensure that results for many users will be directly comparable. However, these can be changed by the more advanced user.

The steps taken by the read_exposure function are as follows:

  1. SNPs are selected based on a P value threshold (default: 5e-8).
  2. Clumping to ensure only independent SNPs are carried forward.
    • Clumping can be conducted either locally or on the OpenGWAS server.
    • If an OpenGWAS ID is used for your exposure data, these are generally pre-clumped on the server and will be reduced quicker than if the clumping needs to take place due to, for example, non-default parameters.
    • Local clumping requires giving the read_exposure function a pathway to Plink and Plink reference files.
    • Default parameters for clumping:
      • clump_r2 = 0.01
      • clump_kb = 10000
      • pop = "EUR"
?read_exposure

Will show the full list of arguments for this function, what they mean and their default values.

Outcome data

Structure

Generally, the outcome data will contain the following information:

Other columns which may be present, and required for certain functions, are:

Columns present when searching for proxy SNPs:

out_dat <- read_outcome("ieu-a-7", rsids = exp_dat$SNP)

head(data.frame(out_dat))

The outcome data will require the exposure SNPs as an argument. This is to prevent the pipeline from having to load the entire outcome data into memory. The read_outcome function will take these SNPs and perform a look up in the outcome datasets for only those SNPs.

Details

Similar to the read_exposure function, the read_outcome function also incorporates searching for proxy SNPs into one function call. By default, the pipeline will attempt to look for proxies for SNPs which are not present in the outcome dataset (default: rsq = 0.8). Similar to clumping, the user can decide whether the proxy search is carried out on the OpenGWAS servers (default) or locally, in which case a pathway to Plink and a reference panel are required.

?read_outcome

For more details.

Harmonised data

Structure

Before use in MR, the exposure and outcome datasets must be harmonised so that the alignment of the beta estimate and effect allele are the same for the exposure and outcome. This can be achieved by:

dat <- harmonise(exp_dat, out_dat)

head(data.frame(dat))

Details

For all intents and purposes, this function is a wrapper for the TwoSampleMR::harmonise_data function (using action = 2 as default). That package has a good vignette describing how the process works, which we recommend reading.

Pairwise analyses

Harmonising many exposures and outcomes can be slow and labourious. Worse still, if there are many analyses to be conducted then waiting for the harmonisation of all the data can take as long as conducting the actual functional analyses themselves. Therefore, this package has implemented a way of conducting what we are calling "pairwise" analyses by using the following function:

mr_res <- pairwise_analysis(exp_dat, out_dat, "../results")

This function parallelises (default threads: 1) the harmonise and analysis stages of the pipeline by harmonising one exposure and one outcome per thread, and then analysing these (e.g. MR) and writes the results to the disc. This way, results are output constantly as they are generated which makes this function ideal for large-scale analyses. The optional overwrite argument (default: False) when set to False means that analyses will not be re-ran if those results already exist, meaning that if an analysis is interrupted then the script can be set to run again without massive overhead in computed already generated results. For more details, see the help page for this function:

?pairwise_analysis

Data cleaning, preparation and saving

Checking SNPs

check_snps is a function which will check the compatibility with either exposure or outcome SNPs for use in MR or colocalisation analyses. By default, these SNPs are dropped but can be retained with a flag marking them for being dropped.

exp_dat <- check_snps(exp_dat, drop = F)

head(data.frame(exp_dat))

Now if we set a beta to NA:

exp_dat[1, "beta.exposure"] <- NA
exp_dat <- check_snps(exp_dat, drop = F)

head(data.frame(exp_dat))

In particular, this function checks the presence of the following columns and values which are necessary as a minimum:

Saving to gwasvcf format

The package also comes with two functions which simplify the process of saving data in the gwasvcf format, depending on how the data are stored. Both are fundamentally the same; one operates on a data.frame and the other a text file: dat_to_gwasvcf and file_to_gwasvcf respectively.

These functions require the column names for the requisite data; however, it is also recommended pointing the function to the bcf tools bin, as this will speed up the process. If bcf tools are not provided, then the header of the gwasvcf file will not be written correctly and will be missing some important information, including the total number of variants included and the study type which is required for colocalisation.

By default, these functions will output a zipped file. The gwasvcf files are not required to be unzipped before being used and so we strongly recommend keeping them in this format.

dat_to_gwasvcf(df, "../data/exposure_data", "chr", "pos", "nea", "ea", snp_col = "snp")


jwr-git/mrpipeline documentation built on Oct. 2, 2022, 3:41 p.m.