knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(ggplot2) library(ggGWAS) library(GWAS.utils) theme_set(theme_bw())
This vignette contains some technical aspects about developing the ggGWAS
package.
-log10
There are different scales available in ggplot2 that apply a log10
transformation.
scale_x_log10
and scale_y_log10
scale_x_continuous(trans = "log10")
and scale_y_continuous(trans = "log10")
coord_trans(x = "log10", y = "log10")
They all to slightly different things.
First, there is scale_x_continuous
, which simply applies the log10
function to the data. Available transformations include "asn", "atanh", "boxcox", "exp", "identity", "log", "log10", "log1p", "log2", "logit", "probability", "probit", "reciprocal", "reverse" and "sqrt" (from ?scale_x_continuous
)
set.seed(3) df <- data.frame(y = runif(1000)) qp <- ggplot(df, aes(sample = y)) + stat_qq(distribution = stats::qunif) qp + scale_x_continuous(trans = "log10") + scale_y_continuous(trans = "log10")
Second, there is scale_{x|y}_log10
, which simply applies the log10
function to the data.
qp + scale_x_log10() + scale_y_log10()
Third, there is coord_trans
, that takes as input the transformation (which start with scales::*_trans
). Note that the grid is different.
qp + coord_trans(x = "log10", y = "log10")
For our Q-Q plot we want to create a new transformation that performs a -log10
transformation, which we will call mlog10
.
Either we use scale_y_continuous
qp + scale_y_continuous(trans = "mlog10") + scale_x_continuous(trans = "mlog10") + geom_abline(intercept = 0, slope = 1)
or coord_trans
(not sure why this won't work)
qp + coord_trans(x = "mlog10", y = "mlog10") + geom_abline(intercept = 0, slope = 1)
Three major problems with ggplot2 plots of large datasets are 1) object size, 2) rendering speed and 3) the resulting image file size (e.g. pdf or png).
Passing on a large dataset to ggplot2 will create a large object (because the dataset will actually - in various forms - become part of the object). To manipulate this, would mean to manipulate some core parts of ggplot2.
However, rendering a large object does not necessarily take long. For example, we can avoid plotting each single observation by using hexagons.
A more complex topic is rastering. Rastering actually increases the time to render a plot, but decreases the file output size.
In the following we are going through options to optimise plotting and then measure object size, rendering speed and file output size (as a pdf).
Here are the options:
ggrastr:::GeomPointRast
)ggplot::geom_hex
)And here are the the metrics used to measure object size, rendering time and file output size.
print_object_size <- function(x) { pryr::object_size(x) ## in bytes ## do not use object.size(), as not accurate, or use pryr::compare_size(x) } print_rendering_time <- function(x) { system.time(x) } ## check out a better benchmarking time post: https://www.alexejgossmann.com/benchmarking_r/ ## maybe microbenchmark package? print_file_size <- function(gg) { invisible(ggsave("tmp.pdf", gg, width = 4, height = 4)) cat(file.info("tmp.pdf")$size / 1024, " Kb.\n", sep = "") unlink("tmp.pdf") }
Our benchmark is the base plot version from the qqman
package:
n.sample <- 1e3 set.seed(3) df <- data.frame(P = runif(n.sample), GWAS = sample(c("a","b"), n.sample, replace = TRUE))
## Object size print_object_size(qqman::qq(df$P)) ## Rendering speed print_rendering_time(qqman::qq(df$P)) ## File output size print_file_size(qqman::qq(df$P))
qp_points <- ggplot(df, aes(y = P)) + stat_gwas_qq() + geom_abline(intercept = 0, slope = 1)
## Object size print_object_size(qp_points) ## Rendering speed print_rendering_time(print(qp_points)) ## File output size print_file_size(qp_points)
The R-package ggrastr
does ???. (for the price of longer rendering time).
It creates very light output, but is not necessarily fast.
We can use the rastering functionality as geom in our function.
qp_raster <- ggplot(df, aes(y = P)) + stat_gwas_qq(geom = ggrastr:::GeomPointRast) + geom_abline(intercept = 0, slope = 1)
## Object size print_object_size(qp_raster) ## Rendering speed print_rendering_time(print(qp_raster)) ## File output size print_file_size(qp_raster)
qp_hex <- ggplot(df, aes(y = P)) + stat_gwas_qq_hex() + geom_abline(intercept = 0, slope = 1)
## Object size print_object_size(qp_hex) ## Rendering speed print_rendering_time(print(qp_hex)) ## File output size print_file_size(qp_hex)
Problem is, that the original dataset is still passed on. This is not a problem in terms of speed, but can be for memory. More about this further down.
Note, that the used hex.function
called hexBinSummarise_custom
does not have the usual properties, in that it will have the hexagon shape, but not any count legend. Also has a tweaked hexbinheight
. If hex.function = hexBinSummarise
is used, the plot will have the usual properties.
## old version ggplot(data = GWAS.utils::giant) + stat_gwas_qq_hex(aes(y = P), hex.function = ggplot2:::hexBinSummarise) ## new version ggplot(data = GWAS.utils::giant) + stat_gwas_qq_hex(aes(y = P), hex.function = ggGWAS:::hexBinSummarise)
Something very simple is to subset the data, since we are usually not interested in the very high P-values (although, we are at times, when it comes to model sanity checks).
qp_subset <- ggplot(df, aes(y = P)) + stat_gwas_qq(y.thresh = 0.05) + geom_abline(intercept = 0, slope = 1)
## Object size print_object_size(qp_subset) ## Rendering speed print_rendering_time(print(qp_subset)) ## File output size print_file_size(qp_subset)
What makes a ggplot2 object really heavy, is that the dataset is passed on several times into the object.
## data structure #ggplot_build(qp_hex) %>% str() %>% sapply(dim) object.size(ggplot_build(qp_hex)) object.size(ggplot_build(qp_points)) ## grid object ## ggplot_gtable(ggplot_build(qp_hex)) %>% str() object.size(ggplot_gtable(ggplot_build(qp_points))) object.size(ggplot_gtable(ggplot_build(qp_hex)))
See also: https://github.com/mkanai/ggman/blob/master/R/scale-colour-dichromatic.R
See also: https://github.com/mkanai/ggman/blob/master/R/theme-publication.R
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.