In kkdey/dash: Dirichlet Adaptive Shrinkage (dash) of Compositional data

Compositional Data

Compositional data is used in statistics to describe parts or constituents of some discrete sample space. A typical example of compositional data is encountered in transcription factor binding sites (TFBS) models in genetics, where data is often reported in positional frequencies of the four bases $A$, $C$, $G$ and $T$ in each position of a TF binding site. In this case, knowing the frequency or proportion of $A$ base in a specific position is only informative when viewed relative to the frequency or proportions of the other bases $C$, $G$ and $T$. This is a typical characteristic of compositional data. Other examples of such data are protein sequence data where each position in a protein sequence is a composition of 20 amino acids.

Shrinkage in compositional data

Why is shrinkage important in the context of compositional data for the TFBS ? Suppose a compositional data at a frequency level for a specific site of a TFBS is

$$ (A, C, G, T) : = (6, 1, 2, 1) $$

and in another case it is

$$ (A, C, G, T) : = (600, 100, 200, 100) $$

Usually the background probability for the site is assumed to be equal or close to equal for the four bases. Now the PWM data of estimated proportions of A, C, G and T is the same in both the above scenarios. However, in the second case, we have drawn the proportion estimates from 100 times more data compared to the first case. In fact the estimated proportions in the first case are based on just 10 sites.

In such a scenario, one would want to shrink the estimated proportions to $(0.25, 0.25, 0.25, 0.25)$ more strongly in the first case than in the second, since the force of the data is strong in the second case and not so much in the first.

But the big question is : what should be the amount of shrinkage in the two cases? Can we provide an adaptive approach that will automatically learn the amount of shrinkage in the two cases, that also scales based on the amount of data available?

The answer to this question lies in our new approach called dash.

Dirichlet adaptive shrinkage (dash)

We define dash for generic compositional data, not restricted to TFBS or protein sequence data. Let us assume that there are $L$ constituents in the compositional mix. $L$ equals $4$ (corresponding to A,C, G and T bases) for the transcription factor data and $20$ corresponding to the amino acids for the protein sequence data.

Say we have observed the counts of the constituents for the $L$ categories for $n$ compositional samples. Here $n$ represents the number of binding site positions for a specific TF or a number of TFs. $n$ therefore corresponds to the number of i.i.d compositional samples available to us.

We model these compositional counts vectors as follows

$$ (c_{n1}, c_{n2}, \cdots, c_{nL}) \sim Mult \left ( c_{n+} : p_{n1}, p_{n2}, \cdots, p_{nL} \right ) $$

where $c_{n+}$ is the total frequency of the different constituents of the compositional data observed for the $n$ th base. $p_{nl}$ here represents the compositional probabilities such that

$$ p_{nl} >= 0 \hspace {1 cm} \sum_{l=1}^{L} p_{nl} = 1 $$

Now we assume a prior distribution on the vector of compositional probabilities $(p_{n1}, p_{n2}, \cdots, p_{nL})$. A typical choice to maintain conjugacy of the posterior distribution is to assume a Dirichlet distribution prior. However, the choice of the concentrator parameter for the Dirichlet parameter prior can impact the posterior and hence the amount of shrinkage greatly. to bypass this problem, we choose a mixture of known Dirichlet priors, each having mean same as the background mean but with varying amounts of concentration, along with unknown mixture proportions to be estimated from the data.

$$ \left ( p_{n1}, p_{n2}, \cdots, p_{nL} \right ) : = \sum_{k=1}^{K} \pi_{k} Dir \left (\alpha_{k} \mu_{k1}, \alpha_{k} \mu_{k2}, \cdots, \alpha_{k} \mu_{kL} \right ) \hspace {1 cm} \alpha_{k} > 0 \hspace{1 cm} \sum_{l=1}^{L} \mu_{kl} = 1 $$ We assume a prior of $\pi_{k}$ to be Dirichlet

$$ f(\pi) : = \prod_{k=1}^{K} {\pi_{k}}^{\lambda_{k}-1} $$ Such a prior is similar to the ash prior introduced by Stephens (2016) for modeling False discovery rates in normal data, and we call it the dash prior.

In this specification of the dash prior, all Dirichlet components have the same mean. However, one can add other mean components, some corresponding to the corners, like $(1, 0, \cdots, 0)$, etc, in which case there will be multiple modes to the prior. But as of now, we focus on the unimodal version of the dash prior. All the following calculations will go through as is for the multimodal versions of the dash prior as well.

Model estimation and output

The marginal distribution of the counts is given by

$$ f (c_{n} | \mu, \alpha) = \int f(c_{n}| p_{n}) f (p_{n} | \mu, \alpha) d p_{n*} $$ Let

$$ l_{nk} : = \frac{ c_{n+} ! \Gamma (\delta_{0k})}{ \Gamma (c_{n+} + \delta_{0k})} \prod_{l=1}^{L} \frac{\Gamma (c_{nl} + \alpha_{k}\mu_{kl})}{c_{nl} ! \Gamma (\alpha_{k}\mu_{kl})} \hspace{1 cm} where \hspace{1 cm} \delta_{0k} = \alpha_{k} \sum_{l=1}^{L} \mu_{kl}$$

$$ f(c_{n*} | \mu, \alpha) = \sum_{k=1}^{K} \pi_{k} l_{nk} $$ We then use EM algorithm or convex programming to estimate the mixture proportions $\pi_{k}$ which is equiavlent to solving the equation

$$ \log L (\pi) + \log h (\pi) = \sum_{n} log \left (\sum_{k=1}^{K} \pi_{k} l_{nk} \right) + \sum_{k} (\lambda_{k} - 1) \pi_{k} $$ Once we estimate $\pi$ from solving the above equation, we define posterior weight of the sample $n$ int the component mixture $k$ to be

$$ \omega_{nk} : = \frac{\hat{\pi}{k} l{nk}}{\sum_{k} \hat{\pi}{k} l{nk}} $$

The posterior can be computed similarly as

$$ f(p_{n} | \hat{\pi}, c_{n}) : = \sum_{k=1}^{K} \omega_{nk} f_{k} (p_{n} | c_{n}) $$

where $f_{k} (p_{n*})$ is the posterior component with prior component equal to $k$ th component of the dash prior. the posterior component is equal to

$$ f_{k} (p_{n} | c_{n}) : = Dir \left ( c_{n1} + \alpha_{k}\mu_{k1}, c_{n2} + \alpha_{k}\mu_{k2}, \cdots, c_{nL} + \alpha_{k}\mu_{kL} \right) $$ The posterior mean therefore is equal to

$$ E(p_{n} | \hat{\pi}, c_{n}) := \sum_{k=1}^{K} \omega_{nk} \frac{c_{n} + \alpha_{k}\mu_{k}}{\sum_{l}^{L} (c_{nl} + \alpha_{k} \mu_{kl})} $$

To get an idea of how concentrated the sample is to the prior mean, we compute $\omega_{n1}$ - the posterior probability that the sample comes from the first component - where the 1st component corresponds to $\alpha = Inf$.

We can also find the corner posterior probability of each samples by computing the sum of the posterior probabilities corresponding to the components with concentration parameter less than 1. Also the center posterior probability of each sample can be computed by the sum of the posterior probabilities corresponding to the components with very high concentration parameter (say greater than 50).

Configuration of the Model

One pertinent issue is how to choose $K$. In general, $K$ can be chosen as large as possible but adding more components beyond a point is futile and time expensive.

We choose a default set of $\alpha_{k}$ to be $(Inf, 100, 50, 20, 10, 2, 1, 0.1, 0.01)$. In this case $\alpha_{k}=Inf$ corresponds essentially to point mass at the prior means, and then the subsequent choices of $\alpha_{k}$ have lower degree of concentration. $\alpha_{k} = $ corresponds to the most uniform scenario, whereas $\alpha_{k} < 1$ correspond to cases with probabiliy masses at the edges of the simplex but with the mean at the prior mean. These components would help to direct the points close to the corners towards the corners and away from the center, resulting in clearer separation of the points closer to the mean with the ones away from it. We choose the prior amount of shrinkage of $\pi_{k}$, namely $\lambda_{k}$ to be $\left( 10, 1, 1, 1, \cdots, 1 \right )$.

For $\alpha_{k} = Inf$, we use the Stirling formula (ref) with the assumption that $\alpha_{k} \rightarrow Inf$ to approximate the Gamma functions. For the other components, we use the LaplacesDemon R package to evaluate the Gamma functions in the log scale.

Example Applications of dash

Position Weight Matrix

We first provide an example application of dash in shrinking the base compositional probabilities in a Trasncription Factor Binding Site (TFBS) for a particular transcription factor.

devtools::install_github("kkdey/Logolas")

devtools::install_github("kkdey/dash")
devtools::install_github("kkdey/ecostructure")

library(Logolas)
library(grid)
library(gridBase)
library(ecostructure)
library(ggplot2)
library(Biobase)
library(dash)

We provide a simulation example of a position frequency matrix (PFM).

xmat <- cbind(c(5, 0, 2, 0),
              c(1, 1, 0, 1),
              c(100, 100, 50, 100),
              c(20, 50, 100, 10),
              c(10, 10, 200, 20),
              c(50, 54, 58, 53),
              c(1,1,1,3),
              c(2, 4, 1, 1))
rownames(xmat) <- c("A", "C", "G", "T")
colnames(xmat) <- paste0("pos-", 1:dim(xmat)[2])
xmat_norm <- apply(xmat, 2, function(x) return(x/sum(x)))

xmat

We fit the Dirichlet adaptive shrinkage (dash) model to the position frequency matrix generated above.

out <- dash(xmat, optmethod = "mixEM", verbose=FALSE, bf=TRUE)

We present the logo plot representations of the PWM matrix obtained by normalizing the sample PFM matrix and the one after applying dash. We use the R package Logolas to visually represent the logos.

grid.newpage()
layout.rows <- 1
layout.cols <- 2
top.vp <- viewport(layout=grid.layout(layout.rows, layout.cols,
                                      widths=unit(rep(6,layout.cols), rep("null", 2)),
                                      heights=unit(c(20,50), rep("lines", 2))))

plot_reg <- vpList()
l <- 1
for(i in 1:layout.rows){
  for(j in 1:layout.cols){
    plot_reg[[l]] <- viewport(layout.pos.col = j, layout.pos.row = i, name = paste0("plotlogo", l))
    l <- l+1
  }
}


plot_tree <- vpTree(top.vp, plot_reg)

color_profile = list("type" = "per_row", 
                     "col" = RColorBrewer::brewer.pal(4,name ="Spectral"))

pushViewport(plot_tree)
seekViewport(paste0("plotlogo", 1))
logomaker(xmat_norm,color_profile = color_profile,
          frame_width = 1,
          pop_name = "pre dash PWM",
          newpage = F)

seekViewport(paste0('plotlogo',2))
logomaker(out$posmean,color_profile = color_profile,
          frame_width = 1,
          pop_name = "post dash PWM",
          newpage = F)

Ecological Abundance application

We look at bird taxonomic count data collected from across 38 Himalyan grid spots for 304 bird species. The data can be loaded from the R package ecostructure.

data <- get(load(system.file("extdata", "HimalayanBirdsData.rda",
                             package = "ecostructure")))
taxonomic_counts <- t(exprs(data))
m1 <- colMeans(taxonomic_counts)

We look at the number of bird species recorded per site.

rowSums(taxonomic_counts)

We note that some sites have very low frequency of bird species. Also the number of observed species in most case is much smaller than the total number of possible species investigated (304). This calls for the application of shrinkage in compositional probabilities and motivates the use of dash.

system.time(out <- dash(comp_data = t(taxonomic_counts+1),
                        optmethod = "mixEM",
                        mode = m1,
                        bf = TRUE,
                        verbose=TRUE))

plot(out$posterior_weights[,1], log(rowSums(taxonomic_counts+1)), pch=20, col="red",
     xlab = "posterior weights", ylab = "number of bird species in grid")

Application of dash on a sparse matrix

We apply dash on a sparse matrix with small intensity of $0.3$, such that $30\%$ of the entries are filled with 1 and the others are filled with 0.This corresponds to a very weak data with most data entries being 1, and ideally under an uniform background, we expect to shrink the compositional probabilities to the prior. We validate that with an example below.

mat <- Matrix::rsparsematrix(nrow = 500, ncol = 100, density = 0.3)
mat2 <- as.matrix(mat)
mat2[mat2 != 0] = 1
m1 <- colMeans(as.matrix(mat2))
system.time(out <- dash(comp_data = t(mat2),
                        optmethod = "mixEM",
                        mode = m1,
                        bf = TRUE,
                        verbose=TRUE))

We find that all the 500 samples are shrunk to the prior mean, as attested below.

length(which(out$center_prob_local > 0.99))/dim(mat2)[1]

Notes

Some notes on the Dirichlet adaptive shrinkage (dash) method are as follows.

Since Beta distribution is a 2-D version of Dirichlet distribution, hence dash can be used for compositional probability shrinkage even in cases with two categories, when the compositional probabilities follow a mixture of Beta distributions prior.
dash learns adaptively from the data, so it benefits to have more samples (in TFBS case, more positions/sites) to learn the parameters adaptively. However, when the number of sites is not so large, we prefer to keep the weights on the mixture components to be 1 - meaning that the prior mixture proportions $\pi_{k}$ is chosen to be uniform.
Sometimes instead of counts, the user would have an average total size of counts for each sample and a composition probability matrix. In such a case, we recommend multiplying the total count by the composition probability and rounding them off to counts and input the counts matrix thus generated into dash.
dash is an example application of the adaptive shrinkage (ash) or ashr package due to Matthew Stephens. For more details on ash, check the paper and software. Many other packages based on ash have been developed in the lab, you can check them out in the Stephens Lab Github page.