In mstrazar/csDEX: Condition-specific differential exon expression.

The notebook reproduces the comparison of dispersion fitting in the csDEX article (Stražar & Curk, 2017), Supplementary Figure 4. For this experiment, csDEX and DEXSeq packages are required. A temporary directory is created in the current working directory.

require(csDEX)
require(DEXSeq)
data.dir = file.path(getwd(), "csdex-temp/")

The terminology csDEX package estimates precisions (inverse dispersions), to keep a consistent terminology for both count and Percent Spliced-in (PSI) models, based on negative binomial and Beta distributions, respectively. Since csDEX assumes sum as the default replicate aggregation function, the estimated dispersion is multiplied by the number of replicates for correct comparison.

dispersion.csdex <- function(obj, nreps=1){
  # obj: a list returned by the csDEX::generate function
  # nreps: number of replicates
  cdx = obj$data
  cdx = csDEX::estimateSizeFactors(cdx)
  cdx = csDEX::estimatePrecisions(cdx)
  disp = nreps / csDEX::rowData(cdx)$precision
  return(disp)
}

Estimating dispersion with DEXSeq (Anders et. al, 2012) is also straightforward. The count files are loaded from disk and a standard workflow is used to first estimate the size factors (differences in sample means due to sequencing depth) and the using the Cox-Reid dispersion estimate.

dispersion.dexseq <- function(obj){
    # obj: a list returned by the csDEX::generate function
    # nreps: number of replicates
    sampleData = obj$design
    countfiles = file.path(obj$data.dir, "data", paste0(sampleData$File.accession, ".txt"))
    dex = DEXSeqDataSetFromHTSeq(
      countfiles = countfiles,
      sampleData = sampleData)
    dex = DEXSeq::estimateSizeFactors(dex)
    dex = DEXSeq::estimateDispersions(dex)
    disp = dispersions(dex)
    return(disp)
}

Generate a dataset with a fixed vector of dispersions. The values are the same for each of the repeats, enabling to compute mean predictions and standard deviations. The DEXSeq dispersion fitting can fail in the case of a large number of zero counts.

conditions = 10
exons = 20
repeats = 30
replicates = 3
genes = 3

dispersions = rev(sort(rgamma(exons*genes, 1, 2.5)))

est.cdx = matrix(0, nrow=repeats, ncol=exons*genes)
est.dex = matrix(0, nrow=repeats, ncol=exons*genes)
for (r in 1:repeats){
  obj = generate(exons=exons, conditions=conditions, 
                   interacting=0, replicates=replicates, genes=genes, 
                   dispersions = dispersions,
                   type="count", data.dir=data.dir)

  est.cdx[r,] = dispersion.csdex(obj, replicates)
  est.dex[r,] = tryCatch(dispersion.dexseq(obj), 
                          error=function(e) rep(NA, exons))
}

Compute means and stadard deviations. Finally, compare the fits on a plot and compute root mean square errors. Observe the error decreasing with a larger number of experimental conditions.

  x = 1:(exons*genes)
  mean.cdx = colMeans(est.cdx)
  std.cdx = apply(est.cdx, 2, sd)
  down.cdx = mean.cdx + std.cdx
  up.cdx = mean.cdx - std.cdx
  rmse.cdx = sqrt(sum((mean.cdx - dispersions)^2))

  est.dex[est.dex > 3] = NA
  mean.dex = colMeans(est.dex, na.rm=TRUE)
  std.dex = apply(est.dex, 2, sd, na.rm=TRUE)
  down.dex = mean.dex + std.dex
  up.dex = mean.dex - std.dex
  rmse.dex = sqrt(sum((mean.dex - dispersions)^2))

  plot(dispersions, type="l", col="black", xlab="Exon", ylab="Dispersion",
       main = sprintf("Conditions: %d", conditions))
  op = par(cex = 0.8)
  lines(mean.cdx, col="orange")
  lines(mean.dex, col="blue")
  polygon(c(x, rev(x)), c(up.cdx, rev(down.cdx)), col=rgb(1,0,0,0.2), border=NA)
  polygon(c(x, rev(x)), c(up.dex, rev(down.dex)), col=rgb(0,0,1,0.2), border=NA)
  legend("topright", inset=.01, 
         c(sprintf("qCML (%.2f)", rmse.cdx), sprintf("Cox-Reid (%.2f)", rmse.dex)),   
         fill=c("orange", "blue"), bty="n")