knitr::opts_chunk$set(
  fig.width = 8,
  collapse = TRUE,
  comment = "#>"
)
library(ggpower)
library(scales)  # for trans_new
library(tidyverse)
library(cowplot)

Histograms and axis-transformations are a non-trivial combination, perhaps because people are not aware what histograms are, or because of the subtle difference between ggplot2's scale_x_log10 and coord_trans.

Bar height != number of observations

I used to think so too, but the height of each histogram bar does NOT indicate the number of observations in that bin. It is the bar's area that indicates that number. This has a beautiful consequence: the bar height is not an absolute number, but an estimate of density in the bin - histograms thus show estimates of the probability density irrespective of binwidth.

Why coord_trans is better than scale_x_continuous

tl;dr (too long, didn't read):

It is not uncommon to have the histogram of some variable x and the (allegedly) underlying theoretical distribution in one plot. On the linear scale, this is done easily by combining geom_histogram and geom_line. Note how the code creating xseq seems a bit complicated and for geom_line we could use a linear sequence as well. For geom_histogram, however, using exponential, non-zero breaks in the histogram is required for using coord_trans(trans="log"), so we do it already here.

# generate data from two log-normal distributions
dat <- data.frame(class=c("a","b"),
                  obs = rlnorm(1000, meanlog = c(0, 2), sdlog = c(.5, .5)))
# xseq: exponentially spaced and avoiding zero:
xseq <- pmax(1e-5, exp(seq(log(.9*min(dat$obs)), log(1.1*max(dat$obs)), length.out = 100)))
# empirical histogram and theoretical density in same plot:
p <- ggplot() +
  geom_histogram(data=dat, aes(obs, stat(density)), breaks = xseq)+
  geom_line(aes(xseq, .5*dlnorm(xseq, 0, .5))) + # half of dat is from this distribution
  geom_line(aes(xseq, .5*dlnorm(xseq, 2, .5)))   #  other half from herep
p

We see two log-normal distributions, and it would make sense to display them on the log-scale:

plot_grid(p + ggtitle("linear"),
          p+scale_x_continuous(trans = "log")+ggtitle("scale_x_continuous"),
          p+coord_trans(x="log")+ggtitle("coord_trans"), ncol=3 )

Evidently, log-transforming with scale_x_log10 leads to discrepancies between the histogram (empirical density) and the pdf (theoretical density). The help of coord_trans tells us in the Examples the reason: scale_x_* transform BEFORE statistics and coord_trans after. In our example, this means that scale_x creates a histogram of log(x), which has a different density function than x. Coord_trans, on the other hand, does what we want because it first computes the histogram and than transforms the coordinate system.

Note on exponentially spaced, non-zero breaks

We created xseq above somewhat complicated so that coord_trans works and here is the reason.

Log-transforming histograms is tricky because the left boundary of the left-most bin can be a negative number, which causes errors with coord_trans (while scale_x_continuous is fine). One could use trans_new to modify log_trans (e.g. replacing 0s with a small non-zero number), but I found it more useful to use geom_histogram's breaks argument to define the bin borders, starting from a small positive and non-zero number. While we are at it, we can make those breaks exponentially-spaced, as linear spacing leads to very broad left bars, masking information. I do this above with xseq, which is then passed to geom_histogram as breaks.

Using lhist makes it easy

Let's do the plot from above again, but this time with nicer breaks. We will also sqrt-transform the y-axis to visualize both distributions in spite of their drastic differences in maximum density:

p+coord_trans(x="log", y = "sqrt") + scale_x_continuous(breaks = function(lims) power_breaks(lims, .5, 100), labels = semi_scientific_formatting)

For quick plotting, there's the wrapper function lhist:

dat <- data.frame(class=c("a","b"),
                  obs = rlnorm(10000, meanlog = c(0, 2), sdlog = c(.5, .5)) )

lhist(dat$obs) + 
  geom_line(aes(dat$obs, .5*dlnorm(dat$obs, 0, .5))) +
  geom_line(aes(dat$obs, .5*dlnorm(dat$obs, 2, .5))) +
  geom_line(aes(dat$obs, .5*dlnorm(dat$obs, 2, .5) + .5*dlnorm(dat$obs, 0, .5)),
            linetype = "dashed", color = "red")

Using scale_x_log10 and geom_density

It is hacky, but if you replace geom_line with geom_density you can use scale_x_log10 after all:

dat <- data.frame(class=c("a","b"),
                  obs = rlnorm(10000, meanlog = c(0, 2), sdlog = c(.5, .5)),
                  model=rlnorm(10000, meanlog = c(0, 2), sdlog = c(.5, .5)), 
                  modelA=rlnorm(10000, meanlog = 0, sdlog = c(.5, .5)), 
                  modelB=rlnorm(10000, meanlog = 2, sdlog = c(.5, .5)) 
                  )

ggplot(data = dat)+
  geom_histogram(aes(obs, stat(density)), bins=100) + 
  geom_density(aes(model), color = "red") +
  geom_density(aes(modelA, .5*stat(density))) +
  geom_density(aes(modelB, .5*stat(density))) +
  scale_x_log10() 

This is not perfect yet: note how the red line is supposed to be the sum of the two black lines and clearly isn't. Probably because of the density estimation being off.



FelixTheStudent/ggpower documentation built on March 5, 2020, 8:37 p.m.