In this page we walk through an analysis of pairs of 16 traits with publicly available GWAS data using CAUSE. These results are also presented in the paper (Section 2.3) so this page will serve two functions. The first is that it allows an interested person to replicate those results. The second is that it is an example of how to set up a larger scale anlaysis using CAUSE and explains some of the technical differences from an analysis of a single pair of traits. In the paper, we present results for pairs of 20 traits. However, blood pressure results from Eheret et al (2011) (PMID: 21909115) must be obtained through dbGAP and so aren't included here.
We are using Snakemake and a conda environment to run this analysis. If you don't have Miniconda or Anaconda installed you can either install one of them (recomended) or just make sure that you have Python3, pandas and Snakemake installed. We have chosen to not include R in the conda environment so you will need to have R installed outside the environment and also have the cause
and tidyverse
R packages installed and running. If you were able to work through the package tutorial you should be in good shape.
Snakemake is a tool for creating scalable workflows. In our case we are using it to perform the same set of data processing and analysis steps for all pairs of traits. We will walk through exactly what the Snakemake analysis is doing in the next sections.
The first set up step is to create the conda environment
conda create -n cause_large python=3.6 snakemake
Next create a working directory that you would like to analyze the data in. Change to that directory using cd
in Mac or Linux. For example
mkdir gwas_pairs cd gwas_pairs
The last step is to download the data and code that we will use. Inside of R, run
cause::setup_gwas_pairs()
This function sets up the analysis directory and is a shortcut to downloading all the data and code we will use in this analysis. Downloading the data might take up to half an hour depending on your network speed. Here is what the function downloads.
GWAS summary statistics, cleaned and formatted. We will talk about these more later on. These files go in the data/
subdirectory.
LD data estimated from 1000 Genomes CEU populatoin. This goes in the ld/
subdirectory.
Some R scripts that will be used to run the analysis. These are in the R/
subdirectory.
A Snakemake file called pairs_snakemake.py
.
In the rest of this document we will go through the Snakemake file in detail and explore the scripts that it calls. The analysis is designed to be run on a cluster and can be executed with a single command. First activate the conda environment.
source activate cause_large
The Snakemake command below assumes you are using a cluster with a Slurm workload manager. If your cluster uses something else you should edit the value that is given to the --cluster
argument. You may need to edit this anyway to include necessary information. For example, if you are working on the University of Chicago Research Computing Center you will need to add --account
and --partition
arguments.
nohup snakemake -s pairs_snakemake.py --keep-going --jobs 96 --cluster "sbatch --output={params.log}_%A.out --error={params.log}_%A.err --cpus-per-task={params.cpus} --ntasks=1 --mem-per-cpu={params.mem} --time=1:00:00 --job-name={params.jobname}" > pairs.out &
Currently there is no standard format for GWAS summary statistics so when you are downloading summary statistics from different sources they are likely to have different formats and may not have effect alleles oriented the same way. Many analysis steps are faster and simpler if all the data files are in the same format and have effect alleles oriented consistently.
The data files downloaded by setup_gwas_pairs()
have been pre-formatted so they are ready to use.
After running setup_gwas_pairs()
you will find a file data/gwas_info.csv
that gives information about each study and the original download source. Open up the file and take a look.
library(tidyverse) library(cause)
info <- read_csv("data/gwas_info.csv") head(info) %>% print(width = Inf)
For each study we have recorded a short string to indicate the consortium or data set used (eg. glg
= Global Lipid Genetics Consortium), a short string to inidicate the trait, the full trait name, information about the source publication (PMID, first author, and publication year), the total sample size from the publication and number of cases and controls if relevant and the original download link. The remaining columns give the column name in the original data of important fields. We don't need this information for our analysis here. All of the formatted data files that have been downloaded are named data/{consortium}_{trait}_summary_statistics.tsv.gz
.
To convert the data from their original formats to a standard format, we used a summary statistics processing tool created by Joseph Marcus. This tool is a little more complicated than is required for CAUSE and uses a large external data reference. We hope to include a simpler data formatting function in the cause
R package soon. Lets take a look at some of the formatted data. Use the following R commands
dat <- read_tsv("data/ckdgen_egfrcrea_summary_statistics.tsv.gz", col_type=list(col_character(),col_integer(), col_character(),col_character(), col_character(),col_double(),col_double(), col_double(),col_double(), col_double())) head(dat) %>% print(width = Inf)
In our analysis we will exploit the following features of the data format:
All the files have the same information fields in the same order and with the same column names.
All of the alleles are oriented the same way. This means, for example, that if rs3094315 appears in another study, it will also have reference allele G and alternative allele A in that study.
In the future, we will update this section with instructions on how to achieve this using CAUSE functions. If your data are formatted with the same columns we have used (and also saved as a tsv) then you will be able to analyze it using our code making only minimal changes to the paires_snakemake.py
file.
We will now walk through all the steps in the Snakemake file and explain what they do in detail. If your data is formatted in the same way as ours (see above) you should be able to use all of this code and will only need to change the preamble portion so that your file names match. It will be helpful to familiarize yourself with Snakemake syntax a little before you begin.
In the preamble section we set up the analysis we want to do. First we list some directories.
import pandas as pd data_dir = "data/" #where the data is ld_dir = "ld/" #where the ld data is cause_dir = "cause/" #where CAUSE results will go mr_dir = "mr/" #where other MR method results will go
Next we create a table of all the trait pairs we'd like to analyze.
consortia = ["giant", "giant", "lu", "glg", "glg", "glg", "glg", "ckdgen", "gefos", "egg", "egg", "egg", "vanderHarst", "diagram", "megastroke", "magic"] traits = ["height", "bmi", "bfp", "tg", "ldl", "hdl", "tc", "egfrcrea", "bone", "bl", "bw", "hc", "cad", "t2d", "as", "fg"] tags = [consortia[i] + "_" + traits[i] for i in range(len(traits))] tag_pairs = [(tag1, tag2) for tag1 in tags for tag2 in tags if tag1!=tag2]
We will refer to the string {consortium}_{trait}
as a "tag" in the rest of this analysis.
Finally, in Snakemake, the rule all
lists all the files we should have at the end of the analysis. The rest of the file will explain how to produce these.
rule all: input: expand(mr_dir + '{tp[0]}__{tp[1]}_mr.RDS', tp = tag_pairs), expand(mr_dir + '{tp[0]}__{tp[1]}_mrpresso.RDS', tp = tag_pairs), expand(mr_dir + '{tp[0]}__{tp[1]}_mregger.RDS', tp = tag_pairs), expand(cause_dir + '{tp[0]}__{tp[1]}_cause.RDS', tp = tag_pairs)
We use Awk to write out temporary files that will speed up some of the later analysis steps
rule data_overlap: input: file1 = data_dir + '{tag1}_summary_statistics.tsv.gz', file2 = data_dir + '{tag2}_summary_statistics.tsv.gz' output: out= temp(data_dir + "{tag1}__{tag2}_overlap.tsv.gz") params: log="tempov", mem="20G", cpus="1", jobname='dataov' shell: """ snps() {{ gzip -cd "$@" | awk '{{ if ($7!="NA" && $7 > 0 && $6!="NA") print $3 }}' ;}} full() {{ gzip -cd "$@" | awk '{{ if ($7!="NA" && $7 > 0 && $6!="NA") print $0 }}' ;}} snps {input.file2} | awk 'NR==FNR{{F1[$0];next}}$3 in F1{{print}}' - <(full {input.file1}) | gzip > {output.out} """
This step takes as input two data files, one for tag1 (trait $M$) and one for tag2 (trait $Y$). It outputs a file that is simply the subset of the tag1 data for SNPs that are in both tag1 and tag2 GWAS. The temp()
in the output:
line tells Snakemake to delete these files when we are done with them. This step also filters out SNPs that have missing effect estimates or standard errors or who's standard errors are non-positive.
CAUSE estimates posteriors using a set of LD pruned variants. To maximize power, we LD prune preferentially choosing variants with low tarit $M$ $p$-values. The next two rules in the Snakemake file write a list of LD pruned variants for each trait pair.
First there is a rule that LD prunes a single chromosome:
rule ld_prune_one_chrom: input: data = data_dir + '{tag1}__{tag2}_overlap.tsv.gz', ld1 = ld_dir + 'chr{chrom}_AF0.05_0.1.RDS', ld2 = ld_dir + 'chr{chrom}_AF0.05_snpdata.RDS' output: out=temp(data_dir + "snps_{tag1}__{tag2}.pruned.{chrom}.RDS") params: log="ldprune", mem="10G", cpus="4", pval_thresh = "1e-3", r2_thresh = 0.1 , jobname='ldprune_{chrom}', partition="broadwl" shell: 'Rscript R/ld_prune_one_chrom.R {input.data} {wildcards.chrom} \ {params.pval_thresh} {params.r2_thresh} {input.ld1} {input.ld2} {output.out}'
This rule takes as input the overlap data set for trait $M$ that we created previously and LD data. The output is a pruned list of SNPs on a given chromosome. The rule calls an R scirpt R/ld_prune_one_chrom.R
that reads in the data, removes duplicated SNPs and then LD prunes using the function cause::ld_prune
.
The next rule concatonates pruned lists for all chromosomes for a single pair into one file.
rule ld_prune_combine: input: fls = expand( data_dir + "snps_{{tag1}}__{{tag2}}.pruned.{chr}.RDS", chr = range(1, 23)) output: out1 = data_dir + "snps_{tag1}__{tag2}.pruned.txt" params: log="ld_comb", mem="2G", cpus="1", jobname='combine', partition="broadwl" shell: "Rscript R/ld_cat.R {output.out1} {input.fls}"
The next step is to run CAUSE
rule cause: input: file1 = data_dir + "{tag1}__{tag2}_overlap.tsv.gz", file2 = data_dir + '{tag2}__{tag1}_overlap.tsv.gz', snps = data_dir + 'snps_{tag1}__{tag2}.pruned.txt' output: params = cause_dir + '{tag1}__{tag2}_params.RDS', cause = cause_dir + '{tag1}__{tag2}_cause.RDS', data = data_dir + '{tag1}__{tag2}_data.RDS' params: log="cause", mem="5G", cpus="8", jobname='cause', seed = 100 shell: 'Rscript R/cause.R {input.file1} {input.file2} \ {input.snps} {output.params} \ {output.cause} {output.data} {params.seed}'
This rule takes as input the overlap tag1 and tag2 data sets and the list of pruned snps. It uses the R script R/cause.R
which we will go through below. We could have used the *_summary_statistics.tsv.gz
files in place of the *_overlap.tsv.gz
files. Using the overlap files saves reading some data into memory and can be especially helpful when one GWAS has many more variants measured than the other.The R script performs four steps:
args <- commandArgs(trailingOnly=TRUE) #Input files data_file_1 <- args[1] data_file_2 <- args[2] snp_file_asc <- args[3] #Output files params_out <- args[4] cause_out <- args[5] data_out <- args[6] seed <- as.numeric(args[7]) #if(is.na(seed)) seed <- 100 d1 <- read_tsv(data_file_1, col_type=list(col_character(),col_integer(), col_character(),col_character(), col_character(),col_double(),col_double(), col_double(),col_double(), col_double())) dup1 <- d1$snp[duplicated(d1$snp)] d1 <- d1 %>% filter(!snp %in% dup1) d2 <- read_tsv(data_file_2, col_type=list(col_character(),col_integer(), col_character(),col_character(), col_character(),col_double(),col_double(), col_double(),col_double(), col_double())) dup2 <- d1$snp[duplicated(d1$snp)] d2 <- d2 %>% filter(!snp %in% dup2)
cause::gwas_format_cause
. However, this function performs steps to ensure that alleles are oriented in the same way that are unnecessary because the data has already been formated uniformly. If we skip these steps, the merging procedure is about eight times faster. We save the mreged data to use with other MR methods. X <- d1 %>% select(snp, beta_hat, se) %>% rename(beta_hat_1 = beta_hat, seb1 = se) %>% inner_join(., d2, by="snp") %>% rename(beta_hat_2 = beta_hat, seb2 = se, A1 = ref_allele, A2 = alt_allele) %>% select(snp, beta_hat_1, seb1, beta_hat_2, seb2, A1, A2) %>% new_cause_data(.) saveRDS(X, file=data_out)
set.seed(seed) if(nrow(X) < 1e6){ snps_grid <- X$snp }else{ snps_grid <- sample(X$snp, size=1e6, replace=FALSE) } params <- est_cause_params(X, snps_grid) saveRDS(params, params_out)
snps_asc <- read_lines(snp_file_asc) res <- cause(X =X, param_ests = params, variants=snps_asc, qalpha = 1, qbeta = 10, force=TRUE) saveRDS(res, cause_out)
The force = TRUE
argument ensures that CAUSE runs even if the parameter estimates didn't converge. In our analysis this never happens.
The remaining rules in the Snakemake file run alternative MR methods: IVW regression, Egger regression and MR-PRESSO. These rules all have the same format, each calling its own R script
rule ivw: input: data = data_dir + '{tag1}__{tag2}_data.RDS' output: out = mr_dir + '{tag1}__{tag2}_mr.RDS', params: log="mr", mem="1G", cpus="1", jobname='mr_{tag1}__{tag2}' shell: 'Rscript R/ivw.R {input.data} 5e-8 {output.out} ' rule egger: input: data = data_dir + '{tag1}__{tag2}_data.RDS' output: out = mr_dir + '{tag1}__{tag2}_mregger.RDS', params: log="mr", mem="1G", cpus="1", jobname='mr_{tag1}__{tag2}' shell: 'Rscript R/mregger.R {input.data} 5e-8 {output.out} ' rule mrpresso: input: data = data_dir + '{tag1}__{tag2}_data.RDS' output: out = mr_dir + '{tag1}__{tag2}_mrpresso.RDS', params: log="mrp", mem="1G", cpus="1", jobname='mrp_{tag1}__{tag2}' shell: 'Rscript R/mrpresso.R {input.data} 5e-8 {output.out} '
Coming Soon!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.