knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction to transcriptome-wide association studies (TWAS)

TWAS are techniques that were concurrently created by Gamazon et al, 2015 and Gusev et al, 2016 to leverage eQTLs to identify gene-trait associations. A traditional TWAS pipeline is as follows:

  1. In a tissue that is relevant to a phenotype of interest (i.e. breast tumor for breast cancer outcomes or the pre-frontal cortex for neuropsychiatric disorders), we train predictive models of tissue-specific expression from genotype.
  2. Then, in a large GWAS cohort, we use these predictive models to impute tissue-specific genetically regulated expression to detect gene-trait associations.

In training predictive expression models, most TWAS methods (e.g. PrediXcan, FUSION, TIGAR, EpiXcan, etc) focus on local genetic variation around the gene of interest (0.5 to 1 Megabase around the gene). However, Boyle et al, 2017 and Liu et al, 2019 have proposed that up to 70% of gene expression can be attributed to distal variation in the genome, suggesting that the inclusion of these distal variants in TWAS models may improve prediction and power to detect gene-trait associations.

For this reason, we have developed Multi-Omic Strategies for TWAS (MOSTWAS), an intuitive suite of tools to prioritize distal variants in transcriptomic prediction and conduct TWAS-like association testing using GWAS summary statistics. MOSTWAS incorporates two methods to include distal-eQTLs in transcriptomic prediction: Mediator-enriched TWAS (MeTWAS) and distal-eQTL prioritization via mediation analysis (DePMA).

All MOSTWAS models output:

We recommend training genes with both MeTWAS and DePMA and prioritizing the model with greater cross-validation $R^2$ for association testing.

Using MOSTWAS

MOSTWAS is meant to be run on a high-performance cluster (HPC) and integrates with GCTA to perform file formatting, linkage disequilibrium estimation, and heritability estimation. If your HPC already has PLINK and GCTA installed, be sure to load and add those modules to your personal PATH with something like this:

module add plink
module add gcta

If PLINK and GCTA aren't automatically installed, you can simply download and install these software add them manually to your PATH:

module use /path/to/personal/plink
module use /path/to/persona/gcta
module load plink
module load gcta64
module add plink
module add gcta

Now, we can go to R to use MOSTWAS.

Installing MOSTWAS

MOSTWAS is available freely on Github. To install the package:

devtools::install_github('bhattacharya-a-bt/MOSTWAS')

Necessary input files

Let's assume that the reference eQTL panel has $n$ samples. MOSTWAS, at the very least, requires the input files for the following:

  1. SNP dosages and locations,
  2. gene expression and locations,
  3. mediator intensities and locations, and
  4. relevant covariates.

Mediators can include any biomarker that may influence the transcription of a gene (i.e. transcription factors, microRNAs, DNA CpG methylation sites, etc). The term "intensity" here is used as a catch-all term for expression and methylation of mediators.

MOSTWAS uses files that are formatted in the MatrixEQTL format (see the MatrixEQTL documentation). Briefly, SNP dosage, gene expression, and mediator intensity files all have $n+1$ columns, where the first column is an identifier for the biomarker (SNP, gene, or mediator) and the last $n$ columns are the scaled values for that biomarker. The location files have columns (in order) for the identifier, the chromosome, and the base-pair position of the biomarker. Gene location files have a fourth column with the end location of the gene.

We recommend that the file that contains the covariates include principal components of the SNP dosage matrix (in the range of 5-20, depending on the dataset). Relevant clinical covariates can be included as well. This file has $n+1$ columns, where the first column is the identifier for the covariate and the last $n$ columns contain the values for the covariates across the $n$ samples.

Here's an example of what a SNP dosage file looks like:

head(data.table::fread('C:/Users/Arjun/OneDrive - University of North Carolina at Chapel Hill/MOSTWAS/inst/extdata/snps.txt'))[,1:10]

Here's an example of what a SNP location file looks like:

head(data.table::fread('C:/Users/Arjun/OneDrive - University of North Carolina at Chapel Hill/MOSTWAS/inst/extdata/snpLocs.txt'))

For training models using the MeTWAS() and DePMA() models, we recommend keeping the genotype files in binary PLINK format (i.e. .bed/.bim/.fam) and reading in the file using the bigsnpr package. Here's a link to their documentation.

eQTL analysis

MeTWAS and DePMA require results for QTL analysis:

We have provided a wrapper function to do these eQTL analyses with MatrixEQTL:

eQTL_MOSTWAS(SNP_file_name = 'snps.txt',
             snps_location_file_name = 'snpLocs.txt',
             SNP_file_name_dis = 'snps_dis.txt',
             expression_file_name = 'genes.txt',
             gene_location_file_name = 'geneLocs.txt',
             mediators_file_name = 'mediators.txt',
             meds_location_file_name = 'medLocs.txt',
             covariates_file_name = 'covs.txt',
             output_file_name_loc_qtl = 'out_loc_qtl.txt',
             output_file_name_dis_qtl = 'out_dis_qtl.txt',
             output_file_name_loc_med = 'out_loc_med.txt',
             output_file_name_dis_med = 'out_dos_med.txt',
             p_loc_qtl = 1e-6,
             p_dis_qtl = 1e-6,
             FDRcut = 0.05,
             useModel = 'modelLinear',
             DePMA = F)

We have provided a toggle DePMA that takes in a logical object (i.e. TRUE or FALSE) to do either QTL analysis for DePMA or MeTWAS. MatrixEQTL is also a very intuitive software, so you can easily code your own analyses.

Please cite Shabalin 2012 if you use MOSTWAS!

Training models with MeTWAS

MOSTWAS wrapper functions have multiple options to provide users with freedom in training predictive models. It may get a little confusing, but we'll cover the options one-by-one.

You can run MeTWAS all in one, for a given gene (call this gene test):

MeTWAS(geneInt,
       snpObj,
       mediator,
       medLocs,
       covariates,
       dimNumeric,
       qtlFull,
       h2Pcutoff = .1,
       numMed = 5,
       seed = 1218,
       k = 5,
       cisDist = 1e6,
       parallel = F,
       prune = F,
       ldThresh = .5,
       cores,
       verbose = T,
       R2Cutoff = 0.01,
       modelDir,
       tempFolder)

A few notes on these options:

Training models with DePMA

You can run DePMA all in one, for a given gene (call this gene test). Many of the options are similar, so we'll highlight the different options in DePMA:

DePMA(geneInt = 'test',
      snpObj,
      mediator,
      medLocs,
      covariates,
      cisDist = 1e6,
      qtlTra,
      qtMed,
      h2Pcutoff,
      dimNumeric,
      verbose,
      seed,
      sobel = F,
      nperms = 1000,
      k,
      parallel,
      parType = 'no',
      prune,
      ldThresh = .5,
      cores,
      qtlTra_parts,
      qtMed_parts,
      modelDir,
      tempFolder,
      R2Cutoff)

A few notes on these options:

Running multiple genes in parallel

Usually, you'll need to train models for 15,000+ genes in an RNA-seq panel. Serially running DePMA is slow (MeTWAS is quite fast), even when parallelizing within a single gene. We recommend a batch submission procedure, perhaps using the rslurm package. Given that the genes of interest as in the vector geneList, we can simply define a data.frame of parameters and submit multiple jobs all in one:

pars = data.frame(geneInt = geneList)
batch_DeP <- function(geneInt){
  DePMA(geneInt,snpObj,mediator,...)
}
library(rslurm)
sjob <- slurm_apply(batch_DeP, pars, jobname = 'test_DePMA',
                    nodes = 2, cpus_per_node = 2, submit = TRUE)

This generates R scripts for every gene and submits them to your SLURM cluster. You can delete all the files using cleanup_files(sjob).

Other options include the BatchJobs package on CRAN or snakemake.

With slurm_apply(), we were able to fit a whole RNA-seq panel using both MeTWAS and DePMA in roughly 20 hours.

FUSION-like association testing

We provide functions for testing as well. The function burdenTest() does the weighted burned test of association and gives the option for permutation testing, as in FUSION:

burdenTest(wgt, ### .RData file name for a given gene's MOSTWAS model
           snps, ### data.frame/data.table for SNP dosages
           sumStats, ### data.frame/data.table for GWAS summary statistics
           snpAnnot = NULL,
           onlyCis = F, ### toggle to include only local SNPs
           beta, ### column name for effect sizes in GWAS summary stats
           se, ### column name for standard errors for betas in GWAS summary stats
           chr, ### column name for SNP chromosome locations in GWAS summary stats
           pos, ### column name for SNP basepair position in GWAS summary stats
           ref, ### column name for SNP reference allele in GWAS summary stats
           pval, ### column name for P-value in GWAS summary stats
           R2cutoff = 0.01, ### cross-validation R2 cutoff to test a gene
           alpha, ### P-value cutoff in TWAS association to conduct permuation test
           nperms = 1e3 ### number of permutations in permuation test)

A comment on permutation testing:

Distal-SNPs added last test

For genes with significant TWAS associations from burdenTest(), we may be interested in assessing whether the distal SNPs provide any added information beyond what we can observe from the local variation around the gene of interest. We developed an added-last test of distal SNPs using summary statistics. It can be run using the addedLastTest() function that take in similar options as burdenTest():

addedLastTest(wgt,
              snps,
              sumStats,
              snpAnnot = NULL,
              beta,
              se,
              chr,
              pos,
              ref,
              pval,
              R2cutoff,
              locChrom ### chromosome on which the gene of interest exists)

The added-last test need not be run only on genes prioritized for permutation testing. We recommned any significant gene in overall TWAS association to be prioritized for added-last testing. The results from this test can allow users to focus on these distal regions for functional hypothesis generation and possible follow-up studies.



bhattacharya-a-bt/MOSTWAS documentation built on April 6, 2023, 12:20 a.m.