knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(LocusXcanR)
The interpretation of TWAS results is not always straightforward and can be quite challenging, especially for large-scale analyses, such as the Genetic Epidemiology on Adult Health and Aging (GERA) cohort (N ~ 55,000) and UK Biobank (N ~ 500,000). Care must be taken to interpret TWAS results in context (e.g. alongside information known from other 'omics studies such as GWAS), understand how multiple TWAS signals at a given locus may be affected by correlated gene expression and/or the sharing of model variants, and evaluate how fine-mapping results either clarify the most likely causal genes, or may be confounded by some of the same challenges as the original TWAS analysis.
Thus, in order to facilitate the interpretation of TWAS findings, we present an R package which creates an R Shiny application designed to interactively visualize TWAS results. We discuss, in detail, the features of the R package and R Shiny application and show how it can be used to quickly interpret very complex TWAS loci in the context of prior knowledge from GWAS.
dir <- system.file("extdata", package="LocusXcanR") twas_ds <- read.table(file.path(dir, "twas_ds.txt"), header=T, stringsAsFactors = F) ld_gwas <- read.table(file.path(dir, "ld_gwas.txt"), header=T, stringsAsFactors = F) load(file.path(dir,"pred_exp_corr.Rda")) db_genes <- read.table(file.path(dir, "db_genes.txt"), header=T, stringsAsFactors = F) all_gwas <- read.table(file.path(dir, "all_gwas.txt"), header=T, stringsAsFactors = F) weight_tbl <- read.table(file.path(dir, "weight_tbl.txt"), header=T, stringsAsFactors = F) known_variants <- read.table(file.path(dir, "gwas_sentinel.txt"), header=T, stringsAsFactors = F) known_gwas <- read.table(file.path(dir, "known_gwas.txt"), header=T, stringsAsFactors = F) cytoband_ds <- read.table(file.path(dir, "cytoBandIdeo.txt.gz"), header = F, sep = "\t")
The LocusXcanR
function takes several datasets as input, including TWAS results, GWAS results, gene expression reference panel weights, GWAS LD, and predicted expression correlation. These results are then integrated into a single HTML webpage via R Shiny, and the interactive visualizations and data tables can then be used to facilitate TWAS results interpretation.
LocusXcanR(twas_result, weight_tbl, study_name = "", pred_exp_corr, conditional_present = FALSE, multiple_tissues = FALSE, known_variants, known_gwas, db_genes, all_gwas, ld_gwas, ref_expr_name = "", head_details = "", method_details = "", primary_tissue, meta_present = FALSE, ideogram_present = FALSE, genome_build, cytoband_ds, add_UI = "", add_server = "")
The LocusXcanR
function contains the following required parameters and their definitions:
twas_result
: character, file path to TWAS results (required). See TWAS results dataset section below for more details.pvalthresh
: numeric, -log10 p-value threshold for TWAS results.weight_tbl
: character, file path to TWAS weights database (required). See Weights table section below for more details.known_variants
: character, file path to known GWAS variants (required). See Known variants dataset section below for more details.pred_exp_corr
: character, file path to predicted expression correlation (required). See Predicted expression correlation dataset section for more details.known_gwas
: character, file path to study GWAS data matching known variants (required). See Known GWAS dataset section below for more details.db_genes
: character, file path to a list of genes in the database (required). See Reference panel genes for more details.all_gwas
: character, file path to study GWAS data (required). See GWAS results dataset for more details.ld_gwas
: character, file path to the LD among study variants or an LD reference panel (required). See LD GWAS section for more details.The LocusXcanR
function contains the following optional parameters and their definitions:
study_name
: character, the name of the study/cohort (optional, default is missing). For example study_name = "Genetic Epidemiology on Adult Health and Aging"
or study_name = "GERA"
.conditional_present
: logical, TRUE if conditional analysis results are available for plotting, FALSE otherwise (default is FALSE). See TWAS results dataset section for more details on including conditional analysis results.multiple_tissues
: logical, TRUE if TWAS results are available from more than one tissue, FALSE if TWAS results are only available from a single tissue or a multi-tissue analysis (default is FALSE). This parameter can be used to compare, for example, results from a primary tissue gene expression reference panel (e.g. Depression Genes and Networks whole blood gene expression) to one or more secondary tissue gene expression reference panel(s) (e.g. Genotype-tissue Expression Project whole blood gene expression), when multiple_tissues = TRUE
. See TWAS results dataset section for more details on including secondary tissue gene expression reference panels. This functionality is not currently available. The application only displays results for a single-tissue TWAS analysis or a multi-tissue analysis, but cannot be used to compare TWAS results from multiple tissues at this time.ref_expr_name
: character, name of the reference expression data set used in the analysis (optional, default is missing). For example ref_expr_name = "Depression Genes and Networks whole blood"
or ref_expr_name = "DGN"
.head_details
: character, any additional header details to be included in the app (optional, default is missing). HTML formatting commands may be used. See Frequently Asked Questions section for header_details
examples.method_details
: character, detailed methods section (optional, default is missing). HTML formatting commands may be used. See Frequently Asked Questions for method_details
examples.primary_tissue
: character, if multiple tissues are available for analysis, list the name of the primary tissue (required if multiple_tissues = TRUE
. For example, if you have results to compare from DGN and GTEx, and DGN is your primary tissue reference panel, then primary_tissue = "DGN"
. The character string given to primary_tissue
must match a tissue
name in the dataset loaded by twas_result
. See TWAS results dataset section for more details.meta_present
: logical, TRUE if results from TWAS meta-analysis are present for comparison, FALSE otherwise (optional, default is FALSE).meta_thresh
: numeric, p-value threshold for meta-analysis results (required if meta_present = TRUE
).ideogram_present
logical, TRUE if an IdeogramTrack should be plotted, FALSE otherwise (optional, default is FALSE)genome_build
: character, the genome build for the IdeogramTrack data (required if ideogram_present = TRUE
). For example, genome_build = "hg19"
for Genome Reference Consortium Human Build 37 (GRCh37).cytoband_ds
: character, file path to the cytogenetic bands data set (required if ideogram_present = TRUE
). See Ideogram plot section for more details on including an IdeogramTrack in the app.add_UI
: character, additional elements to add to UI (optional, default is missing). This parameter can be used to include your own additional items at the end of the UI.add_server
: character, additional elements to add to server (optional, default is missing). This parameter can be used to include your own additional items in the server.twas_ds
)The TWAS results dataset includes, at the very minimum, TWAS results (in the form of a p-value) from a primary TWAS analysis. P-values from a primary TWAS analysis may come from a single-tissue TWAS analysis, a multi-tissue TWAS analysis, or a TWAS meta-analysis. The TWAS results dataset may optionally include results from a TWAS meta-analysis (e.g. for replication), TWAS conditional analysis, and/or TWAS results from secondary tissue gene expression reference panels for comparison.
The dataset twas_ds
requires the following variables:
tissue
: character, name (or an abbreviated name) of the tissue reference panellocus
: integer, value indicating the locus numberlocstart
: integer, value indicating the genomic start position of the locuslocstop
: integer, value indicating the genomic stop position of the locuschr
: integer, chromosome numberindex
: character, the index (most significant) gene within the locuspheno
: character, the name of the phenotype being analyzed (may be one or more phenotypes)genestart
: integer, value indicating the genomic start position of the gene (may be the gene start position or transcription start site or whatever starting position you would like to plot for the gene)genestop
: integer, value indicating the genomic stop position of the gene (may be the gene stop position or transcription stop site or whatever stopping position you would like to plot for the gene)p
: numeric, p-value of the TWAS association test between phenotype and predicted gene expressiongenename
: character, the name of the gene (may be the gene name or the ENSMBL ID or any other name you use to identify the gene)SignifGene
: boolean, indicator of whether the gene-trait association is statistically significant (1 if significant, 0 otherwise)Optional variables may also be included in the twas_ds
dataset. These include the following:
p_meta
: numeric, p-value from a TWAS meta-analysis used for replication, if available (required if meta_present = TRUE
)p_conditional
: numeric, p-value from a TWAS conditional analysis, if available (required if conditional_present = TRUE
Additional variables (beyond those that are required or optional) may also be included in the twas_ds
dataset. Any additional variables will be ignored when plotting figures but will be printed in data tables within the application.
Here is an example of twas_ds
:
knitr::kable(twas_ds[79:83,])
all_gwas
)The GWAS results dataset all_gwas
contains all of the analytic cohort GWAS p-values that you wish to plot. It is recommended that you subset the GWAS results dataset based on a p-value threshold (e.g. only include variants with GWAS p-value < 0.05). The dataset contains variant information, the GWAS p-value, and the locus in which the variant needs to be plotted.
The dataset all_gwas
requires the following variables:
Here is an example of all_gwas
:
knitr::kable(all_gwas[c(1:5),])
weight_tbl
)The weight_tbl
dataset contains a set of genes from the primary gene expression reference panel along with their corresponding predictive model variant weights. This data set is used to connect TWAS genes to their respective GWAS model variants and report the weight and direction of the effect.
The dataset weight_tbl
requires the following variables:
Here is an example of weight_tbl
:
knitr::kable(weight_tbl[c(1:5),])
known_variants
)The known_variants
dataset contains a list of GWAS sentinel variants within the region. GWAS sentinel variants may come from GWAS Catalog or some other curated list (e.g. from the most recent and/or comprehensive GWAS for your trait of interest). This dataset is used to identify TWAS genes which have already been assigned to GWAS variants and/or identify within-cohort GWAS variants which have already been identified by previous GWAS studies.
The dataset known_variants
requires the following variables:
Additional fields may be included in the dataset. Any additional fields will be printed in the data table reported in the application.
Here is an example of known_variants
:
knitr::kable(known_variants[c(1:5),])
known_gwas
)The known_gwas
dataset contains the set of variants that are included in both the analytic GWAS cohort and are known sentinel GWAS variants from other GWAS studies. The dataset includes a variant identifier along with the GWAS p-value for the analytic cohort and the locus number in which you wish to plot the variant.
The dataset known_gwas
requires the following variables:
Additional fields (beyond those that are required) may also be included in the dataset. Any additional variables will be printed in the data table reported in the application.
Here is an example of known_gwas
:
knitr::kable(known_gwas[c(1:5),])
pred_exp_corr
)The predicted expression correlation dataset is an M X M
matrix of M
genes required for plotting, which contains pairwise correlations between each gene in the matrix. You may include all genes from your TWAS analysis; however, for computational efficiency, it is recommended that you only include those genes that you wish to plot (e.g. only the statistically significant genes and those previously known from GWAS). This correlation matrix must be in .Rdata
format, and the name of the correlation matrix object must be saved as capital M
.
Here is an example of the M
correlation matrix object loaded by the pred_exp_corr
parameter.
knitr::kable(M[1:5,1:5])
ld_gwas
)The dataset containing pair-wise LD values among the predictive model variants at the locus is specified by the ld_gwas
parameter. The dataset includes chromosome and genomic position for each variant, and each row in the table additionally contains the LD between the pair-wise variants and the locus in which the variants should be plotted. Chromosome and genomic position in this table should match the chromosome and genomic position of the model variants in the weight_tbl
dataset. The variables chra
and posa
at each locus represent the chromosome and position for the index SNP at the locus.
The dataset ld_gwas
requires the following variables:
chra
: numeric, the chromosome number for variant (a). This is for the index variant at the locus.posa
: numeric, the genomic position for variant (a). This is for the index variant at the locus.chrb
: numeric, the chromosome number for variant (b)posb
: numeric, the genomic position for variant (b)corrab
: numeric, LD between variant (a) and variant (b)locus
: numeric, the locus number in which these variants will be plottedHere is an example of ld_gwas
:
knitr::kable(ld_gwas[1:5,])
db_genes
)The db_genes
dataset contains a list of genes from the primary gene expression reference panel that were either not predicted in the analytic cohort or no gene expression model was fit for the gene. This data set is used to determine whether there are any genes from the reference panel which were not predicted in the analytic cohort or no gene expression model was fit for the gene, but one or more GWAS sentinel variants implicate the gene as potentially causal.
The dataset db_genes
requires the following variables:
Here is an example of db_genes
:
knitr::kable(db_genes[c(19:23),])
cytoband_ds
)A note on cytogenetic bands:
Each human chromosome has a short arm ("p" for "petit") and long arm ("q" for "queue"), separated by a centromere. The ends of the chromosome are called telomeres. Each chromosome arm is divided into regions, called cytogenetic bands, that can be seen using a microscope and special stains. The cytogenetic bands are labeled p1, p2, p3, q1, q2, q3, etc., counting from the centromere out toward the telomeres. At higher resolutions, sub-bands can be seen within the bands. The sub-bands are also numbered from the centromere out toward the telomere.
The dataset (cytoband_ds
) used to identify the cytogenetic bands in IdeogramTrack()
is typically downloaded directly from UCSC. For example, the example data used here was obtained from the UCSC genome annotation database for the Feb. 2009 assembly of the human genome (hg19, GRCh37 Genome Reference Consortium Human Reference 37 (GCA_000001405.1)), downloaded from http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/.
Here is an example of cytoband_ds
used in this example:
colnames(cytoband_ds) <- c("chrom", "chromStart", "chromEnd", "name", "gieStain") knitr::kable(cytoband_ds[1:5,])
Here is an example of the output from IdeogramTrack() which uses the cytogenetic band dataset (cytoband_ds
):
genome_build = "hg19" itrack = Gviz::IdeogramTrack(genome = genome_build, chromosome = "chr3", bands = cytoband_ds) Gviz::plotTracks(c(itrack), from = 51744800, to=53804965, cex=1.5)
An example of the R shiny application used on a real data analysis (i.e. results from TWAS and GWAS) is available in the R script R/example/example_script.R
.
R Shiny applications may be shared via several different formats. Shiny applications are fundamentally a web page, which subsequently allows them to be shared over the web. Essentially, in order for the application to be shared over the web, you must arrange to replace your own computer, which shares your app locally, with a web server that can share your app over the cloud.
Files must be saved in a standardized format and directory structure in order for any Shiny server to build your app. Once your app is appropriately packaged for a Shiny server, RStudio recommends starting by launching your app on shinyapps.io. Shinyapps.io is a service run by RStudio that allows you to share your apps online; the free server is built by RStudio and is easy to use, secure, and scalable to multiple users.
For additional details and recommendations about publishing your application online, see RStudio's R Shiny tutorial: https://vimeo.com/rstudioinc/review/131218530/212d8a5a7a/#t=25m49s
RStudio's R Shiny video tutorials provide a great beginner's guide to the R Shiny framework: https://shiny.rstudio.com/tutorial/. Links to more advanced Shiny topics are available from this web page as well.
The header_details
parameter is a character string parameter which may include simple text characters and/or HTML formatting commands. For example, simple text characters may include the following:
header_details = "All genomic positions are from GRCh37. TWAS results for 10 traits are presented in the figures and tables below."
HTML commands for formatting may be included as in the following example:
header_details = "All genomic positions are from GRCh37.<p>TWAS results for 10 traits are presented in the figures and tables below. Traits and trait categories are listed and defined as follows:<p><ul><li>Platelet count (PLT)</li><li>Red blood cell indices (RBC): red blood cell count (RBC), hematocrit (HCT), hemoglobin (HGB), mean corpuscular volume (MCV), and red cell distribution width (RDW)</li><li>White blood cell indices (WBC): white blood cell count (WBC), monocyte (MONO), neutrophil (NEUTRO), and lymphocyte (LYMPH)</li></ul>"
Similar to header_details
, the method_details
parameter is a character string parameter which may include simple text characters and/or HTML formatting commands. Additionally, method_details
may include R shiny functions as follows.
For example, you could save the following code in a script called methods.R
:
shiny::tabPanel("Methods", shiny::h3(shiny::strong("Methods")), "A current draft of the manuscript describing these results and detailing the methods can be found here:", shiny::tags$a(href="https://docs.google.com/document/d/1TfqzBPGhL6TEqUuxN5lpy7cJVZcZhyiy/edit", "https://docs.google.com/document/d/1TfqzBPGhL6TEqUuxN5lpy7cJVZcZhyiy/edit",target="_blank"), ". All cohorts included in the analysis are described individually below. We analyze 10 hematological phenotypes (platelet count, red blood cell count, hematocrit, hemoglobin, mean corpuscular volume, red cell distribution width, white blood cell count, monocyte count, neutrophil count, and lymphocyte count) across all cohorts.", shiny::hr(), shiny::h4(shiny::strong("PrediXcan method:")), shiny::br() )
Then load your methods.R
script into your working script, using for example:
method <- source("methods.R")
Finally, use the method
variable as the value of the method_details
parameter within the LocusXcanR
function, as in:
LocusXcanR(method_details = method)
For a more detailed example of this procedure, see the example_script.R
and the methods.R
scripts in R/example/
.
The order of the column names does not matter. However, the names of the columns in your data must match the column names (required or optional) listed for each dataset in the Data Pre-processing sections. Any additional variables included in your datasets which do not match those listed in the Data Pre-processing sections will also be printed in the data tables within the application.
methods.R
example file necessary for the shiny app?methods.R
isn't required for the app. If you don't give anything to the method_details
parameter then it defaults to method_details=""
.
gwas_sentinel.txt
and known_gwas.txt
example files?The gwas_sentinel.txt
example data file contains the data that are loaded into the known_variants
parameter. These data will represent your list of known GWAS sentinel variants from, for example, the latest GWAS analysis on your phenotype of interest. The known_gwas.txt
example file contains the analytic cohort GWAS p-values that correspond to the known variants from the latest GWAS analysis. So the variants in the known_gwas.txt
file match the variants in the gwas_sentinel.txt
file (where there are matches), but the p-value in the known_gwas.txt
file should come from GWAS in your analytic cohort.
Thank you to the TWAS working group in Dr. Yun Li's lab at the University of North Carolina at Chapel Hill. Special thanks go out to Bryce Rowland, Jia Wen, and Laura Raffield.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.