knitr::opts_chunk$set(
  fig.width = 8,
  collapse = TRUE,
  comment = "#>"
)
library(ggplot2)
library(ggpower)
library(scales)
# library(tidyverse)
library(cowplot)

The log-normal distribution plays an important role in biology (and everywhere else: finances, engineering, etc.) due to the Central Limith Theorem: if many random effects are added together, the result converges to a normal distribution, if they are multiplied it converges to a log-normal distribution.

The meanlog argument

Docu says that meanlog and sdlog are the mean and standard deviation of the distribution on the log scale, and here we look at what that means.

The log of a normal

If x is log-normally distributed, log(x) is normal with mean meanlog and variance $sdlog^2$:

x <- rlnorm(100000, meanlog = 2, sdlog = 1.5)
hist(log(x), 100, freq = F)
points(seq(-4,8, length.out = 50), dnorm(seq(-4, 8, length.out = 50), mean = 2, sd = 1.5),
       type="l")

This is obvious from the Lognormal's definition and helpful if log-transforming your data is an option. Let's look at building intuition for the meanlog argument in Lognormal distribution itself, i.e. the untransformed data.

meanlog is the logged median value

I compute mean, median and mode for given meanlog (mu1) and sdlog (s1) and plot it into the density function.

mu1 = 2; s1  = 0.5
lnorm_stats <- c(mode   = exp(mu1 - s1*s1),  # see e.g. wikipedia
                 median = exp(mu1),
                 mean   = exp(mu1+s1*s1/2))
# logarithmic spaced points on x-axis are not crucial here, they become useful
# when we do histograms, see below
xseq <- pmax(1e-5, exp(seq(log(1), log(25), length.out = 1000)))
p <- ggplot() +
  geom_line(aes(xseq, dlnorm(xseq, mu1, s1))) +
  geom_vline(aes(xintercept = lnorm_stats,
                 color = names(lnorm_stats)), linetype = "dashed") +
  xlab("x") + ggtitle(paste0("dlnorm(x, meanlog=", mu1, ", sdlog=", s1, ")")) +
  scale_color_hue(name = "lognorm stats",
                  labels = c("mean(x) = exp(meanlog+sdlog^2/2)",
                             "median(x) = exp(meanlog)",
                             "mode(x) = exp(meanlog-sdlog^2)")) +
  theme(legend.justification = c(1, 0), legend.position = c(.9, .5))

p

We see that meanlog has the most direct relationship with the expected median of a Lognormally distributed random variable X. Put differently, meanlog is the mean of log(X), not the mean of X itself. Or simply: meanlog is X's logged median value.

I'm repeating this on purpose because confusion arises easily when we log-transform the x-axis:

p+scale_x_log10() + theme(legend.justification = c(1, 0), legend.position = c(.4, .5))

On the log-scale, x looks like a normal with mean $meanlog-sdlog^2$. Note the difference to the above histogram of $log(x)$, which follows a normal distribution with mean $meanlog$.

Histogram of two log-normals

Currently, this visualization is the one I find the most useful (see lhist function):

x <- rlnorm(100000, meanlog = c(-6,.4), sdlog = 1)
xseq <- pmax(1e-5, exp(seq(log(.9*min(x)), log(1.1*max(x)), length.out = 100)))

ggplot()+
  geom_histogram(data=data.frame(x=x), aes(x, stat(density)), breaks=xseq) +
  geom_line(aes(xseq, .5*dlnorm(xseq, -6, 1)), color="blue") +
  geom_line(aes(xseq, .5*dlnorm(xseq, .4, 1)), color="red") +
  coord_trans(x="log", y="sqrt")

MLE estimation

Wikipedia gives rather complicated and confusing estimators. I find the much simpler formulas:

In the following master thesis:

With experiments I find that this works pretty well, as long as sdlog is > .7. Not sure why, that's just what I observe.

x <- rlnorm(1000, meanlog = 1.2, sdlog = .7)
xseq <- pmax(1e-5, exp(seq(log(.9*min(x)), log(1.1*max(x)), length.out = 100)))

m_estim <- mean(log(x))
s_estim <- mean((log(x)-m_estim)^2)
ggplot()+
  geom_histogram(data=data.frame(x=x), aes(x, stat(density)), breaks=xseq) +
  geom_line(aes(xseq, dlnorm(xseq, m_estim, s_estim)), color="red") +
  coord_trans(x="log", y="sqrt")


FelixTheStudent/ggpower documentation built on March 5, 2020, 8:37 p.m.