In LXQin/PRECISION.seq: Comparison of Depth Normalization Methods

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

rm(list = ls())
library(precision.seq)
library(limma)

MicroRNA (miRNA) sequencing data were collected for two subtypes of soft tissue sarcoma: myxofibrosarcoma (MXF) and pleomorphic malignant fibrous histiocytoma (PMFH). 27 MXF samples and 27 PMFH samples were collected from newly diagnosed untreated tumors at Memorial Sloan Kettering Cancer Center between 2002 and 2012. These tumor samples were sequenced twice: once with uniform handling and balanced sample-to-library-assignment to minimize data artifacts due to experimental handling and avoid their confounding with sample group (that is, tumor subtype); and a second time without, which resulted in excessive artifacts due to handling. The first dataset can be used to assess microRNAs’ differential expression status, serving as a benchmark; the second dataset can be used to test normalization methods against the benchmark. These two datsets are named as data.benchmark and data.test in this package.

We show (1) how to examine the overall distribution of each dataset, (2) how to perform differential expression analysis for each data set and compare the results between the two, and (3) how to create additional paired data sets based on a novel data permutation algorithm and analyze them.

Benchmark Data Analysis

Overall Data Distribution

We examine the overall distribution of the benchmark data (as log2 transformed read counts) using sample-specific boxplots.

benchmark.log <- log2(data.benchmark + 1)
boxplot(benchmark.log, 
        col = ifelse(grepl("MXF", colnames(benchmark.log)),
                     rainbow(2)[1], rainbow(2)[2]),
        ylab = "Log2 Count", ylim = c(0, 20),
        xaxt = "n", outline = FALSE)
legend("topright",c("MXF", "PMFH"), bty = "n",
       pch = "x", cex = 1, col = c(rainbow(2)[1], rainbow(2)[2]))

Differential Expression Analysis

We assess evidence for differential expression in the benchmark data using the voom-limma method. The analysis results are displayed with the volcano plot and the scatter plot of group mean difference versus group mean average. Red dots indicate the microRNAs that are significantly differentially expressed ($p < 0.01$).

benchmark.voom <- DE.voom(RC = data.benchmark, groups = data.group, Pval = 0.01)
DE.bench <- benchmark.voom$id.list
benchmark.voom.dat <- data.frame(dm = benchmark.voom$log2.FC,
                                 p.value = benchmark.voom$p.val)
mask <- with(benchmark.voom.dat, p.value < .01)
cols <- ifelse(mask,"red", "black")
with(benchmark.voom.dat, plot(dm, -log10(p.value), cex = .5, pch = 16,
                          col = cols, xlim = c(-3.6, 3.6),
                          ylim = c(0, 6),
                          xlab = "Mean Difference: PMFH - MXF"))
abline(h = 2, lty = 2)

cols <- ifelse(rownames(benchmark.log) %in% DE.bench, "red", "black")
plot(rowMeans(benchmark.log),
     apply(benchmark.log[,grepl("MXF", colnames(benchmark.log))], 1, mean) -
       apply(benchmark.log[,grepl("PMFH", colnames(benchmark.log))], 1, mean),
     pch = 16, cex = 0.5, col = cols,
     xlab = "Group Mean Average of Log2 Count", 
     ylab = "Group Mean Difference of Log2 Count",
     main = "")

Test Data Analysis

Overall Data Distribution

We examine the overall distribution of the test data (as log2 transformed read counts) using sample-specific boxplots.

test.log <- log2(data.test + 1)
boxplot(test.log, 
        col = ifelse(grepl("MXF", colnames(benchmark.log)),
                     rainbow(2)[1], rainbow(2)[2]),
        xaxt = "n", outline = FALSE, ylim = c(0, 20))
legend("topleft",c("MXF", "PMFH"), bty = "n",
       pch = "x", cex = 1, col = c(rainbow(2)[1], rainbow(2)[2]))

Differential Expression Analysis

We assess evidence for differential expression in the test data using the voom-limma method (without depth normalization). The analysis results are displayed with the volcano plot and the scatter plot of group mean difference versus group mean average. Red dots indicate the microRNAs that are significantly differentially expressed ($p < 0.01$).

test.voom <- DE.voom(RC = data.test, groups = data.group, Pval = 0.01)
test.voom.dat <- data.frame(dm = test.voom$log2.FC,
                            p.value = test.voom$p.val)
mask <- with(test.voom.dat, p.value < .01)
cols <- ifelse(mask,"red", "black")
with(test.voom.dat, plot(dm, -log10(p.value), cex = .5, pch = 16,
                         ylim = c(0, 6), xlim = c(-3.6, 3.6),
                         col = cols, xlab = "Mean Difference: PMFH - MXF"))
abline(h = 2, lty = 2)

Comparison of Differential Expression Status Between Benchmark Data and Test Data

Significance of differential expression is claimed using a p-value cutoff of 0.01 for both the benchmark data and the test data. They are compared with each other using the Venn diagram.

pval.bench.test <- data.frame(cbind(bench.pval = benchmark.voom$p.val, 
                                    test.pval = test.voom$p.val[names(benchmark.voom$p.val)]))
attach(pval.bench.test)
bench.sig <- (bench.pval < 0.01)
test.sig <- (test.pval < 0.01)
venn2 <- cbind(bench.sig, test.sig)
vennDiagram(vennCounts(venn2), 
            names = c("Benchmark", "Test"), 
            cex = 1.5, counts.col = rainbow(1))

Comparison of Estimated Group Means Between Benchmark Data and Test Data

Scatterplot for the estimated group means for MXF and PMFH between benchmark data and test data.

benchmark.log.MXF.mean <- rowMeans(benchmark.log[, grepl("MXF", colnames(benchmark.log))])
plot(benchmark.log.MXF.mean,
     rowMeans(test.log[,grepl("MXF", colnames(test.log))]), pch = 16, cex = 0.5,
     xlab = "Group Mean Average of Log2 Count in Benchmark", 
     ylab = "Group Mean Average of Log2 Count in Test",
     main = "MXF", xlim = c(0, 20), ylim = c(0, 20))
abline(0,1)

benchmark.log.PMFH.mean <- rowMeans(benchmark.log[, grepl("PMFH", colnames(benchmark.log))])
plot(benchmark.log.PMFH.mean,
     rowMeans(test.log[,grepl("PMFH", colnames(test.log))]), pch = 16, cex = 0.5,
     xlab = "Group Mean Average of Log2 Count in Benchmark", 
     ylab = "Group Mean AVerage of Log2 Count in Test",
     main = "PMFH", xlim = c(0, 20), ylim = c(0, 20))
abline(0,1)

Simulation

We use simulated.data() function to generate additional pairs of data sets by permuting the paired data sets to achieve a pre-specified proportion and magnitude of differential expression. As an example, we show how to generate data sets with the proportion of differential expression around 0.02 (more specifically, in the range of 0.0175 and 0.0225) and the median of mean differences (for log2 counts) around 0 (more specifically, in the range of -0.5 and 0.5).

simulated <- simulated.data(proportion = c(0.0175, 0.0225), median = c(-0.5, 0.5), numsets = 100)

head(simulated[[1]]$simulated_benchmark)[,1:5]
head(simulated[[1]]$simulated_test)[,1:5]