In jwr-git/mrpipeline: Implements a Ready-for-Use Mendelian Randomisation Pipeline

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(mrpipeline)

Introduction

The pipeline makes specific use of the "gwasvcf" format created by Bristol (see: https://github.com/MRCIEU/gwasvcf). The gwasvcf format is generally simple to use, fairly lightweight for storage of GWAS data and, most importantly, has a standard format meeting the pipeline can expect a specific format for each function. This makes it much easier to know what format of data the user will input into the pipeline and make automating many of the functions simple.

Local gwasvcf Files

As mentioned above, the pipeline mainly accepts data in the gwasvcf format. The pipeline is bundled with some wrapper functions which can help with re-formatting your data into this format (see: XXX).

OpenGWAS

In place of file names, the pipeline will also accept OpenGWAS IDs, e.g. ieu-a-7. OpenGWAS fundamentally stores its data in the gwasvcf format and so expanding functions to accept these IDs meant that a large resource of GWAS data could be integrated into the pipeline for the user's ease of use without having to continually download and convert large files.

For more information, see the ieugwasr R package.

Data Types

The pipeline follows the structure of a standard Mendelian randomisation (MR) analysis. Data are categorised as either exposure or outcome, with the pipeline expecting certain columns for both.

Exposure data

Structure

Generally, the exposure data will contain the following information:

exposure - Name or identifier for exposure. Many exposures can be combined into the one dataframe
SNP - rsIDs for SNPs
beta - The effect size
se - The standard error of the effect size
effect_allele - The allele for which the effect size has been measured

Other columns which may be present, and required for certain functions, are:

chr - Chromosome of the SNP
position - Position on the chromosome of the SNP
pval - P value for the SNP's association with the exposure
samplesize - Sample size in which the effect was estimated
id - Helper column with unique ID for the exposure
file - Helper column with either OpenGWAS ID or filename for the exposure
other_allele - The non-effect allele
eaf - Frequency of the effect allele
mr_keep - Should these SNPs be used in MR?

exp_dat <- read_exposure("eqtl-a-ENSG00000167207")

head(data.frame(exp_dat))

N.B. Many column names are shared between exposure and outcome data, and so these are prefixed with either .exposure or .outcome.

The exposure data can be constructed manually from, for example, a data file or another data.frame, so long as you convert the data into a format the pipeline will expect. This includes re-naming columns and ensuring the column types are as expected.

Details

The read_exposure function abstracts away a lot of steps that can alter the results and interpretations of an MR study. To this end, we decided upon a set of standard parameters that the pipeline will use to ensure that results for many users will be directly comparable. However, these can be changed by the more advanced user.

The steps taken by the read_exposure function are as follows:

SNPs are selected based on a P value threshold (default: 5e-8).
Clumping to ensure only independent SNPs are carried forward.
- Clumping can be conducted either locally or on the OpenGWAS server.
- If an OpenGWAS ID is used for your exposure data, these are generally pre-clumped on the server and will be reduced quicker than if the clumping needs to take place due to, for example, non-default parameters.
- Local clumping requires giving the read_exposure function a pathway to Plink and Plink reference files.
- Default parameters for clumping:
  - clump_r2 = 0.01
  - clump_kb = 10000
  - pop = "EUR"

?read_exposure

Will show the full list of arguments for this function, what they mean and their default values.

Outcome data

Structure

Generally, the outcome data will contain the following information:

outcome - Name or identifier for outcome. Many outcome can be combined into the one dataframe
SNP - rsIDs for SNPs
beta - The effect size
se - The standard error of the effect size
effect_allele - The allele for which the effect size has been measured

Other columns which may be present, and required for certain functions, are:

chr - Chromosome of the SNP
position - Position on the chromosome of the SNP
pval - P value for the SNP's association with the outcome
samplesize - Sample size in which the effect was estimated
id - Helper column with unique ID for the outcome
file - Helper column with either OpenGWAS ID or filename for the outcome
other_allele - The non-effect allele
eaf - Frequency of the effect allele

Columns present when searching for proxy SNPs:

target_snp - The original SNP provided
proxy_snp - The SNP which will be used in the analyses (may be the same as target_snp, in which case no proxy was used)
proxy - Whether the SNP is a proxy or not
target_a1 - Original effect allele
target_a2 - Original non-effect allele
proxy_a1 - Proxy effect allele
proxy_a2 - Proxy non-effect allele

out_dat <- read_outcome("ieu-a-7", rsids = exp_dat$SNP)

head(data.frame(out_dat))

The outcome data will require the exposure SNPs as an argument. This is to prevent the pipeline from having to load the entire outcome data into memory. The read_outcome function will take these SNPs and perform a look up in the outcome datasets for only those SNPs.

Details

Similar to the read_exposure function, the read_outcome function also incorporates searching for proxy SNPs into one function call. By default, the pipeline will attempt to look for proxies for SNPs which are not present in the outcome dataset (default: rsq = 0.8). Similar to clumping, the user can decide whether the proxy search is carried out on the OpenGWAS servers (default) or locally, in which case a pathway to Plink and a reference panel are required.

?read_outcome

For more details.

Harmonised data

Structure

Before use in MR, the exposure and outcome datasets must be harmonised so that the alignment of the beta estimate and effect allele are the same for the exposure and outcome. This can be achieved by:

dat <- harmonise(exp_dat, out_dat)

head(data.frame(dat))

Details

For all intents and purposes, this function is a wrapper for the TwoSampleMR::harmonise_data function (using action = 2 as default). That package has a good vignette describing how the process works, which we recommend reading.

Pairwise analyses

Harmonising many exposures and outcomes can be slow and labourious. Worse still, if there are many analyses to be conducted then waiting for the harmonisation of all the data can take as long as conducting the actual functional analyses themselves. Therefore, this package has implemented a way of conducting what we are calling "pairwise" analyses by using the following function:

mr_res <- pairwise_analysis(exp_dat, out_dat, "../results")

This function parallelises (default threads: 1) the harmonise and analysis stages of the pipeline by harmonising one exposure and one outcome per thread, and then analysing these (e.g. MR) and writes the results to the disc. This way, results are output constantly as they are generated which makes this function ideal for large-scale analyses. The optional overwrite argument (default: False) when set to False means that analyses will not be re-ran if those results already exist, meaning that if an analysis is interrupted then the script can be set to run again without massive overhead in computed already generated results. For more details, see the help page for this function:

?pairwise_analysis

Data cleaning, preparation and saving

Checking SNPs

check_snps is a function which will check the compatibility with either exposure or outcome SNPs for use in MR or colocalisation analyses. By default, these SNPs are dropped but can be retained with a flag marking them for being dropped.

exp_dat <- check_snps(exp_dat, drop = F)

head(data.frame(exp_dat))

Now if we set a beta to NA:

exp_dat[1, "beta.exposure"] <- NA
exp_dat <- check_snps(exp_dat, drop = F)

head(data.frame(exp_dat))

In particular, this function checks the presence of the following columns and values which are necessary as a minimum:

mr:
- P value
- Beta
- Standard error
coloc:
- P value
- Chromosome
- Position

Saving to gwasvcf format

The package also comes with two functions which simplify the process of saving data in the gwasvcf format, depending on how the data are stored. Both are fundamentally the same; one operates on a data.frame and the other a text file: dat_to_gwasvcf and file_to_gwasvcf respectively.

These functions require the column names for the requisite data; however, it is also recommended pointing the function to the bcf tools bin, as this will speed up the process. If bcf tools are not provided, then the header of the gwasvcf file will not be written correctly and will be missing some important information, including the total number of variants included and the study type which is required for colocalisation.

By default, these functions will output a zipped file. The gwasvcf files are not required to be unzipped before being used and so we strongly recommend keeping them in this format.

dat_to_gwasvcf(df, "../data/exposure_data", "chr", "pos", "nea", "ea", snp_col = "snp")

jwr-git/mrpipeline documentation built on Oct. 2, 2022, 3:41 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

jwr-git/mrpipeline
Implements a Ready-for-Use Mendelian Randomisation Pipeline

In jwr-git/mrpipeline: Implements a Ready-for-Use Mendelian Randomisation Pipeline

Introduction

Local gwasvcf Files

OpenGWAS

Data Types

Exposure data

Structure

Details

Outcome data

Structure

Details

Harmonised data

Structure

Details

Pairwise analyses

Data cleaning, preparation and saving

Checking SNPs

Saving to gwasvcf format

R Package Documentation

Browse R Packages

We want your feedback!

jwr-git/mrpipeline Implements a Ready-for-Use Mendelian Randomisation Pipeline

In jwr-git/mrpipeline: Implements a Ready-for-Use Mendelian Randomisation Pipeline

Introduction

Local gwasvcf Files

OpenGWAS

Data Types

Exposure data

Structure

Details

Outcome data

Structure

Details

Harmonised data

Structure

Details

Pairwise analyses

Data cleaning, preparation and saving

Checking SNPs

Saving to gwasvcf format

R Package Documentation

Browse R Packages

We want your feedback!

jwr-git/mrpipeline
Implements a Ready-for-Use Mendelian Randomisation Pipeline