In marcpaterno/rcosmosis: R functions to support CosmoSIS

knitr::opts_chunk$set(echo = TRUE, fig.width = 5)
library(tidyverse)
library(rcosmosis)

Reading data

The function read.metropolis.hastings is the way to read the CosmoSIS output generated by the Metropolis-Hastings sampler. Because the MH sampler can run multiple chains, each of which results in a separate file, it will be common to specify a file glob to identify the set of files that are to be read. The result is a single data frame (more specifically, a tibble) containing one row for each sample. The data frame contains a column for each parameter being varied by the sampler, and also containing a column chain, which indicates the identify of the chain that generated the sample, and a column sample, which indicates a sample number within that chain.

While CosmoSIS writes chain output in uncompressed files, rcosmosis also supports reading compressed files. xz compression is especially effective in reducing the size of the chain output files.

root <- rprojroot::is_r_package
samples <- read.metropolis.hastings(root$find_file("inst/extdata/metropolis-hastings/chain_metro_*.txt.xz"))
head(samples) |> knitr::kable()

nchains <- max(samples$chain)
nsamples.per.chain <- max(samples$sample)
nsamples.total <- nrow(samples)

We have r nchains chains, each with at most r nsamples.per.chain samples, for a total of r nsamples.total samples.

Have the chains converged?

Graphical evaluation

nburn <- 6000
nthin <- 50

Each chain takes some time to achieve convergence; this period is often called the burn-in period. One way to evaluate convergence of the chains is to look at how they progress as a function of sample number; this is often called a trace for the chain. We will do this for the first r nburn values from each chain.

One thing we want is see is mixing. It is important that the different chains do not produce traces that stay separated; they should "forget" their initial values, and produce traces that eventually are well-mixed.

Since there are too many chains to visualize this, we look only at the first 8 chains. We also sample the chains sparsely, showing only 1 sample in r nthin:

alpha = 0.3
filter(samples, sample < nburn, sample %% nthin == 0, chain %in% 1:8) |>
  ggplot(aes(sample, omega_m, color = factor(chain))) +
  geom_path(alpha = alpha, show.legend = FALSE) +
  ylab(latex2exp::TeX("$\\Omega_m$"))

filter(samples, sample < nburn, sample %% nthin == 0, chain %in% 1:8) |>
  ggplot(aes(sample, sigma8_input, color = factor(chain))) +
  geom_path(alpha = alpha, show.legend = FALSE) +
  ylab(latex2exp::TeX("$\\sigma_8$"))

filter(samples, sample < nburn, sample %% nthin == 0, chain %in% 1:8) |>
  ggplot(aes(sample, concentration, color = factor(chain))) +
  geom_path(alpha = alpha, show.legend = FALSE)

We can also look at how the marginal density for each parameter converges as we go through fixed-size batches of samples. We'll group the data by 100-sample batches (for each chain). As above, we need to re-organize the samples dataframe, but this time we don't thin out the samples, and we also assign samples to batches.

batches <- samples |>
  #filter(sample < 3000) |>
  mutate(batch = sample %/% 100) |>
  pivot_longer(-c(sample, chain, batch),
               names_to = "param",
               values_to = "val")
maxbatches <- max(batches$batch)

We have r maxbatches+1 batches. We can then histogram the posterior densities, and observe how they change from batch to batch. An animation would show this well, but they are prohibitively expensive to generate in vignettes. Here, we will look at the first batch and the last batch.

filter(batches, batch == 0) |>
  ggplot(aes(val)) +
  geom_histogram(bins = 100) +
  geom_histogram(aes(y = ..density..), bins = 100) +
  facet_wrap(vars(param), scales = "free", ncol = 1) +
  labs(title = "Batch 0")

filter(batches, batch == 100) |>
  ggplot(aes(val)) +
  geom_histogram(bins = 100) +
  geom_histogram(aes(y = ..density..), bins = 100) +
  facet_wrap(vars(param), scales = "free", ncol = 1) +
  labs(title = "Batch 100")

The plots for the batch 0 show that the distribution of generated samples is dominated by the priors and the initial distribution of starting points of the chains. The plots for batch 100 show that by this time in the sampling the influence of the data is seen.

There are also commonly-used diagnostic statistics to assess convergence of chains. Several of these are implemented in R, in the package coda. We demonstrate some of their uses here.

Conversion to `coda` objects

The package coda makes use of the classes mcmc and mcmc.list. The dataframe returned by read.metropolis.hastings can be converted into an mcmc.list object. But first, let's look at a quirk of the data with which we have been working.

group_by(samples, chain) |>    # group data by chain
  summarize(nsamples = n(),     # for each chain, count number of samples
            .groups = "drop") |> 
  group_by(nsamples) |>        # group that data number of samples
  summarize(nchains = n(),      # count how many groups had each number of samples
            .groups = "drop") |>
  knitr::kable()

Two of the chains have more samples than the others. It turns out that the CosmoSIS run we have been analyzing was terminated externally, rather than running to completion. As we have seen, the read.metropolis.hastings function can deal with this. The mcmc.list object must represent chains all of the same length. When we convert a dataframe that contains unequal numbers of samples for its chains, all chains are truncated to the length of the shortest chain.

samples.mcmc <- mcmc.list.from.metropolis.hastings(samples)

mcmc.list objects have many associated functions; see the help for coda for a list of them all. A few can be used to sanity-check the conversion.

coda::nchain(samples.mcmc)  # number of chains
coda::niter(samples.mcmc)   # number of samples per chain
coda::nvar(samples.mcmc)    # number of variables being sampled

Many of the coda functions seem to struggle with long chains. The function window allows us to sub-sample (or thin) the chains, without losing too much statistical power:

coda::effectiveSize(samples.mcmc)
samples.thinned <- window(samples.mcmc, thin = 40)
coda::effectiveSize(samples.thinned)

Plotting the thinned sample is approximately 20 times faster than plotting the full sample.

par(mar = rep(2, 4))
plot(samples.thinned)

Gelman-Rubin statistic

The Gelman-Rubin statistic is a "potential scale reduction factor". Approximate convergence is diagnosed with the upper limit is close to 1. The function gelman.plot shows the evolution of the statistic as a function of sample (called iteration here).

par(mar = rep(2, 4))
coda::gelman.plot(samples.thinned, transform = TRUE, autoburnin = FALSE)

Posterior densities

For all our analysis of posterior distributions, we'll drop the first r nburn elements of each chain.

1-d marginal densities

We use a combined histogram and kernel density plot to show the 1-d marginal distributions.

To make these plots, we first transform the data from a form that has one row per sample per chain, to one that has one row per sample per chain per physical parameter. This reshaping allows us to condition the plot on the name of the physical parameter, and thus to have a single graphic to plot.

The transformation of the data is done using tidyr::pivot_longer.

x <-
  samples |>
  filter(sample > nburn) |>
  pivot_longer(-c(sample,chain), names_to = "param", values_to = "val")

x |>
  ggplot(aes(val)) +
  geom_histogram(aes(y = ..density..), bins = 50) + 
  geom_density(color = "red") +
  facet_wrap(vars(param), scales = "free", ncol = 1)

The posterior for concentration has a long high-end tail, and so a log scale in x can be useful.

x |>
  filter(param == "concentration") |>
  ggplot(aes(val)) +
  geom_histogram(aes(y = ..density..), bins = 50) +
  geom_density(color = "red") +
  xlab("concentration") +
  scale_x_log10()

We can see that these data and this model has little power to constrain the parameter concentration.

2-d marginal densities

We are using hexbin plots to show 2-d posterior densities because this produces few, and easy to understand, artifacts in the plotting. We use a log scale for the density because of wide range of density. We also plot the contours produced by a 2D kernel density estimator, for comparison with the hexbin results.

samples |>
  filter(sample > nburn) |>
  plot_density_2d(omega_m, sigma8_input) +
    labs(x=latex2exp::TeX("$\\Omega_m$"),
         y=latex2exp::TeX("$\\sigma_8$"))

samples |>
  filter(sample > nburn) |>
  plot_density_2d(omega_m, concentration) +
    scale_y_log10()

samples |>
  filter(sample > nburn) |>
  plot_density_2d(sigma8_input, concentration) +
    scale_y_log10()

Statistical summaries

samples |>
  select(-c(chain, sample)) |>
  pivot_longer(everything(), names_to = "param", values_to = "val") |>
  group_by(param) |>
  summarize(mean = mean(val),
            median = median(val),
            sd = sd(val),
            mad = mad(val),
            "5%" = quantile(val, 0.05),
            "95%" = quantile(val, 0.95),
            .groups = "drop") |>
  knitr::kable()