knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Why bother with ggplot2 extension for GWAS summary statistic data visualisations?
This vignette lays out the motivation why having a ggplot2 extension is needed. But it also discusses existing alternatives and ways to create a ggplot2-like Q-Q plot and Manhattanplot.
There are two main data visualisations^[There are also other tools used, such as locuszoom plots, but these are harder to implement because they need access to external information, and we will skip these for now.] that are done after the GWAS is run^[We will not discuss their usefulness here.]:
Q-Q plot: checks the deviation of the P-value distribution from the uniform distribution.
Manhattan plot: shows the P-value or summary statistic distribution along the chomosomal position.
By M. Kamran Ikram et al - Ikram MK et al (2010) Four Novel Loci (19q13, 6q24, 12q24, and 5q14) Influence the Microcirculation In Vivo. PLoS Genet. 2010 Oct 28;6(10):e1001184. doi:10.1371/journal.pgen.1001184.g001, CC BY 2.5, Link
Both plotting types can be applied in R with various functions. Just to name a few, here are some links which have also served as inspiration:
qqman
: R package for GWAS results: http://www.gettinggeneticsdone.com/2014/05/qqman-r-package-for-qq-and-manhattan-plots-for-gwas-results.htmlggman
: R package with ggplot2 wrapper with some neat hacks https://github.com/mkanai/ggmanqqman
, ggplot2
and plotly
: https://www.r-graph-gallery.com/wp-content/uploads/2018/02/Manhattan_plot_in_R.htmllattice
(for historical reasons) https://genome.sph.umich.edu/wiki/Code_Sample:_Generating_QQ_Plots_in_Rramwas
package: https://rdrr.io/bioc/ramwas/man/qqplotFast.htmlggplot2
Manhattan plots: https://github.com/pcgoddard/Burchardlab_Tutorials/wiki/GGplot2-Manhattan-Plot-Functionggbio
package, a toolkit for genomic data (plotGrandLinear
is an implementation for a Manhattan plot, uses GRanges
objects, ggplot2 wrapper): https://bioconductor.org/packages/release/bioc/html/ggbio.html ggman
, wrapper (but no ggplot2 extension) for Manhattan plots: https://github.com/drveera/ggmanAn often used package is qqman, that has a function for Q-Q plots qqman::qq
and Manhattan plots qqman::manhattan
.
And there are various ggplot2 wrappers (ggbio
, ggman
), but not actual extensions.
The problem is two fold:
ggplot2
can (e.g. facetting, colouring).The idea is therefore to create a ggplot2 extension that would facilitate all this: 1) fast plotting (through hexagons and filtering) and 2) building ggplot2
geoms and stats.
By writing a ggplot2 extension, we can inherit lots of the default ggplot2 functionalities and shorten the input.
Let's say we have GWAS summary statistics for a number of SNPs. We call this data gwas.summarystats
: for a number of SNPs listed by row, we know the SNP identifier (SNP
) and the P-value (P
).
SNP P rs3342 1e-2 rs83 1e-2 ... ...
For illustration purposes, we will use the summary statistics dataset GWAS.utils::giant
.
library(GWAS.utils) #skimr::skim(giant) summary(giant)
qqman::qq
is straightforward to use: just pass on the P-value column.
qqman::qq(giant$P)
There is a geom_qq
in ggplot2 that implements quantile-quantile plots. However, this is only useful, if we can transform the axes with -log10()
. How to create custom made ggplot2 scales is explained here.
library(ggplot2) theme_set(theme_bw()) ## source: (gg <- ggplot(data = giant) + geom_qq(aes(sample = P), distribution = stats::qunif)) + theme(aspect.ratio = 1) # + scale_y_log10() + scale_x_log10() # + coord_trans(x = "log10", y = "log10") f_trans_mlog10 <- function(x) -log10(x) f_trans_mlog10_inverse <- function(x) 10^(-x) mlog10_trans <- scales::trans_new("-log10", transform = f_trans_mlog10, inverse = f_trans_mlog10_inverse) gg + scale_y_continuous(trans = mlog10_trans) + scale_x_continuous(trans = mlog10_trans) + geom_abline(intercept = 0, slope = 1) + theme(aspect.ratio = 1)
N <- nrow(giant) expected <- (1:N) / N - 1 / (2 * N) giant <- giant %>% dplyr::arrange(P) %>% dplyr::mutate(P_expected = expected) ggplot(data = giant) + geom_point(aes(-log10(P_expected), -log10(P))) + geom_abline(intercept = 0, slope = 1) + theme(aspect.ratio = 1) + xlab(expression(Expected ~ ~-log[10](italic(p)))) + ## from qqman::qq ylab(expression(Observed ~ ~-log[10](italic(p)))) ## from qqman::qq ## or with the scale_y_continous feature ggplot(data = giant) + geom_point(aes(P_expected, P)) + scale_y_continuous(trans = mlog10_trans) + scale_x_continuous(trans = mlog10_trans) + geom_abline(intercept = 0, slope = 1) + theme(aspect.ratio = 1) #+ ## square shaped # expand_limits(x = -log10(max(giant$P)), y = -log10(max(giant$P)))
ggplot(data = giant) + geom_gwas_qq(aes(y = P))
Here is a whishlist of what a geom_gwas_qq
function should be able to do:
raster
version (for faster plotting), hexbins (also for faster plotting) and Pvalue thresholding (removing the high Pvalue SNPs from the plot)theme(aspect.ratio = 1)
)qqman::manhattan(giant, chr = "CHR" , bp = "POS", p = "P")
## computing new x axis giant <- giant %>% dplyr::arrange(CHR, POS) %>% dplyr::mutate(tmp = 1, cumsum.tmp = cumsum(tmp)) ## calculating x axis location for chromosome label med.dat <- giant %>% dplyr::group_by(CHR) %>% dplyr::summarise(median.x = median(cumsum.tmp)) ggplot(data = giant) + geom_point(aes(x = cumsum.tmp, y = -log10(P), color = CHR %%2 == 0)) + ## create alternate coloring geom_hline(yintercept = -log10(5e-8)) + ## add horizontal line scale_x_continuous(breaks = med.dat$median.x, labels = med.dat$CHR) + ## add new x labels guides(colour=FALSE) + ## remove legend xlab("Chromosome") + ylab(expression(-log[10](italic(p)))) + ## y label from qqman::qq scale_color_manual(values = c(gray(0.5), gray(0))) ## instead of colors, go for gray
ggplot(data = giant) + geom_gwas_manhattan(aes(pos = POS, chr = CHR, y = -log10(P)))
Here is a whishlist of what a geom_gwas_manhattan
function should be able to do:
raster
version (for faster plotting) and Pvalue thresholding (removing the high Pvalue SNPs from the plot)How to test plots?
One option is, to compare ggplot2 object data. In the example below, we are comparing two ggplot2 outputs, one created with qplot
and one with ggplot
.
gg1 <- qplot(Sepal.Length, Petal.Length, data = iris) gg2 <- ggplot(data = iris) + geom_point(aes(Sepal.Length, Petal.Length)) identical(gg1$data, gg2$data)
We can apply this to our package by creating the qqplot and manhattanplots manually by hand, and then comparing the to the function outputs.
Another option is to use https://github.com/lionel-/vdiffr
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.