knitr::opts_chunk$set(echo = TRUE)
This is an analysis of the yeast data with 48 biological replicates in each of two conditions (analyzed in this publication). We chose this experiment because of the large number of biological replicates, which will allow us to (1) implement null comparisons on random subsets of samples within one condition, and (2) start with a null comparison and add in artificial differences to a subset of genes to define 'true positives'.
We will simulate a strongly informative and weakly informative independent covariate. We will also investigate the effects of an uninformative covariate by using a randomly generated covariate.
Processed count table is made available by the authors in their paper codebase GitHub repository.
In this Rmd we will carry out simulations for multiple replicates, and plot results averaging over the replications. Here we draw effect sizes (log2 fold changes) from the observed fold changes in the full comparison of the two conditions. Since this results in a distribution of effect sizes and test statistics under the alternative that has a mode at zero, the assumptions for FDRreg-empirical are violated. However, since this assumption is impossible to check outside of a simulation setting, we still include this method in the benchmark comparisons.
In this setting, we assume 2000 DE genes and a unimodal alternative.
# Load packages and source benchmark FDR library(SummarizedBenchmark) library(data.table) library(tidyr) library(dplyr) library(readr) library(ggplot2) library(magrittr) library(cowplot) library(purrr) library(DESeq2) library(tibble) library(ggthemes) library(R.utils) ## load helper functions for (f in list.files("../R", "\\.(r|R)$", full.names = TRUE)) { source(f) } # set up data / results directories datdir <- "yeast-data" resdir <- "../../results/RNAseq" dir.create(datdir, showWarnings = FALSE) dir.create(resdir, showWarnings = FALSE) # results files that will be generated # null resfile_n5 <- file.path(resdir, "yeast-results-null5.rds") resfile_n10 <- file.path(resdir, "yeast-results-null10.rds") # strong covariate resfile_d5 <- file.path(resdir, "yeast-results-de5.rds") resfile_d10 <- file.path(resdir, "yeast-results-de10.rds") # weak covariate resfile_d5_w <- file.path(resdir, "yeastW-results-de5.rds") resfile_d10_w <- file.path(resdir, "yeastW-results-de10.rds") # uninformative covariate resfile_d5_uninfCov <- file.path(resdir, "yeast-results-de5-uninfCov.rds") resfile_d10_uninfCov <- file.path(resdir, "yeast-results-de10-uninfCov.rds") # set up parallel backend library(parallel) nCores <- 20
First, we download the processed count data from GitHub. There is one file for the
Snf2 condition and one file for the wild type condition. The Snf2 condition is a yeast
strain that has the transcriptional regulator gene Snf2 knocked out. Each of these
files is a compressed tar.gz
archive that contains a
single bam file for each replicate in each condition, which we'll place in a subdirectory
called r datdir
.
download.file(url = "https://github.com/bartongroup/profDGE48/raw/master/Preprocessed_data/Snf2_countdata.tar.gz", destfile = file.path(datdir, "Snf2_countdata.tar.gz")) download.file(url = "https://github.com/bartongroup/profDGE48/raw/master/Preprocessed_data/WT_countdata.tar.gz", destfile = file.path(datdir, "WT_countdata.tar.gz")) gunzip(file.path(datdir, "Snf2_countdata.tar.gz")) gunzip(file.path(datdir, "WT_countdata.tar.gz")) untar(file.path(datdir, "Snf2_countdata.tar"), exdir = datdir) untar(file.path(datdir, "WT_countdata.tar"), exdir = datdir) file.remove(file.path(datdir, "Snf2_countdata.tar"), file.path(datdir, "WT_countdata.tar"))
Each of the data files contains two columns, one with a gene/feature name, and one with the count value.
We'll also download the list of 'bad' replicates which were especially poorly correlated to the others, as determined by the authors.
download.file(url = "https://github.com/bartongroup/profDGE48/raw/master/Bad_replicate_identification/exclude.lst", destfile = (file.path(datdir, "badreps.txt")))
Here we make use of the map and map2 functions in the purrr
package, to swiftly
apply the read_tsv
function from readr
to read in all of the 96 sample tables,
as well as add in the sample name (derived from the file name) to each subtable.
Finally, the reduce
function is used to join all the replicates together.
We will remove samples that failed QC in the original study.
files <- dir(path = datdir, pattern = "*.bam.gbgout", full.names = TRUE) sample_names <- sapply(strsplit(dir(path = datdir, pattern = "*.bam.gbgout"), "_MID"), function(x) x[[1]]) badreps <- read_tsv(file.path(datdir, "badreps.txt"), col_names = FALSE)$X1 badreps <- unlist(lapply(strsplit(badreps, "_MID"), function(x) x[1])) counts <- files %>% purrr::map(read_tsv, col_names = FALSE) %>% # read in all the files individually purrr::map2(sample_names, ~ dplyr:::rename(.x, !! .y := X2, feature = X1) ) %>% # add sample names purrr::reduce(left_join, by = "feature") %>% # reduce with rbind into one dataframe dplyr::select(-badreps ) # remove badreps
Here we'll carry out an analysis of a subset of the full dataset, comparing the controls to the knockout samples. The fold changes observed in this comparison will be used when generating the non-null simulated data in the following sections. Here we use a subset of the samples (10 samples in each group) to mimic the sample sizes we use in simulation.
We're ready to construct a DESeq2 object. First we pull out the feature names and add them as rownames for the count table, and next we construct a column data object that houses the sample names, replicate numbers, and condition factor (WT versus Snf2 knockout).
First we set up the DESeq2 object.
feats <- (counts %>% select(1))$feature counts <- as.matrix(counts %>% select(-1)) rownames(counts) <- feats coldat <- tibble(sample=colnames(counts)) %>% separate(sample, sep="_", into=c("condition", "replicate"), remove=FALSE) %>% mutate(condition = factor(condition)) # filter low count genes counts <- counts[rowMeans(counts) > 1,] dds_full <- DESeqDataSetFromMatrix(countData = counts, colData = coldat, design= ~ condition)
Next we run DESeq2, and subset results on genes with FDR < 0.05.
# results on full set set.seed(9384) subset.wt <- dds_full[,colData(dds_full)$condition == "WT"] subset.sn <- dds_full[,colData(dds_full)$condition == "Snf2"] subset <- cbind(subset.wt[, sample(1:ncol(subset.wt), 10)], subset.sn[, sample(1:ncol(subset.sn), 10)]) dds <- DESeq(subset) resultsNames(dds) # lists the coefficients res_10 <- results(dds, name="condition_WT_vs_Snf2", independentFiltering = FALSE) head(res_10) sum(res_10$padj < 0.05, na.rm=TRUE)
Check densities of the test statistic and effect sizes for significant DE genes.
data.frame(res_10[res_10$padj < 0.05 & !is.na(res_10$padj),]) %>% ggplot(aes(x = stat)) + geom_density() + xlab("Test Statistic") data.frame(res_10[res_10$padj < 0.05 & !is.na(res_10$padj),]) %>% ggplot(aes(x = log2FoldChange)) + geom_density() + xlab("Effect Size (log2 Fold Change)")
There are several thousand genes detected using the full set. Next, we'll build input data frame for summarized benchmark.
geneExp <- tbl_df(data.frame(geneName=rownames(res_10), pval=res_10$pvalue, SE=res_10$lfcSE, ind_covariate = res_10$baseMean, effect_size=res_10$log2FoldChange, test_statistic=res_10$stat, pzero=rowSums(counts(subset)==0)/ncol(counts(subset)))) # filter NAs and those with less than 50% expressed geneExp <- geneExp %>% na.omit() %>% dplyr::filter(pzero < 0.5)
We'll create a plot to examine the distribution of effect sizes, since the ash method assumes that the distribution of true (unobserved) effect sizes is unimodal.
ggplot(data=geneExp, aes(effect_size)) + geom_histogram(bins=30)
We'll also explore how the standard error (used by ash) correlates with the independent covariate (used by methods that incorporate covariates), in order to get an idea of how these pieces of information relate to one another.
ggplot(data=geneExp, aes(x = ind_covariate, y = SE)) + geom_hex(bins = 100) + scale_x_continuous(trans="log10") + xlab("Covariate: Mean gene expression")
Look at covariate diagnostic plots.
strat_hist(geneExp, pvalue="pval", covariate="ind_covariate", maxy=30)
rank_scatter(geneExp, pvalue="pval", covariate="ind_covariate")
Build common bench design. We also add in Scott's FDR Regression since our test statistics are approximately t-distributed. Note that the assumption of FDRreg-empirical that the distribution of non-null test statistics does not have a significant mass at zero is violated here, but we still include fdrreg-e in the evaluation since this assumption is impossible to check outside of a simulation setting (since it is not possible to know which tests are non-null). Thus, we'd like to evaluate how results change when the assumption is violated, since it's plausible that it is violated in the case studies.
bd <- initializeBenchDesign() bd <- addBMethod(bd, "fdrreg-t", FDRreg::FDRreg, function(x) { x$FDR }, z = test_statistic, features = model.matrix( ~ splines::bs(ind_covariate, df = 3) - 1), nulltype = 'theoretical', control = list(lambda = 0.01)) bd <- addBMethod(bd, "fdrreg-e", FDRreg::FDRreg, function(x) { x$FDR }, z = test_statistic, features = model.matrix( ~ splines::bs(ind_covariate, df = 3) - 1), nulltype = 'empirical', control = list(lambda = 0.01))
Run benchmark methods.
sb <- bd %>% buildBench(data=geneExp, parallel = FALSE) assayNames(sb) <- "qvalue" sb <- addDefaultMetrics(sb)
Plot results.
rejections_scatter(sb, supplementary=FALSE)
plotFDRMethodsOverlap(sb, alpha=0.05, nsets=ncol(sb), order.by="freq", decreasing=TRUE, supplementary=FALSE)
Here we'll carry out an analysis of a subset of the full dataset, comparing the controls to the knockout samples. The fold changes observed in this comparison will be used when generating the non-null simulated data in the following sections. Here we use a subset of the samples (5 samples in each group) to mimic the sample sizes we use in simulation.
First we run DESeq2.
# results on full set set.seed(728) subset.wt <- dds_full[,colData(dds_full)$condition == "WT"] subset.sn <- dds_full[,colData(dds_full)$condition == "Snf2"] subset <- cbind(subset.wt[, sample(1:ncol(subset.wt), 5)], subset.sn[, sample(1:ncol(subset.sn), 5)]) dds <- DESeq(subset) resultsNames(dds) # lists the coefficients res_5 <- results(dds, name="condition_WT_vs_Snf2", independentFiltering = FALSE) head(res_5) sum(res_5$padj < 0.05, na.rm=TRUE)
Check densities of the test statistic and effect sizes for significant DE genes.
data.frame(res_5[res_5$padj < 0.05 & !is.na(res_5$padj),]) %>% ggplot(aes(x = stat)) + geom_density() + xlab("Test Statistic") data.frame(res_5[res_5$padj < 0.05 & !is.na(res_5$padj),]) %>% ggplot(aes(x = log2FoldChange)) + geom_density() + xlab("Effect Size (log2 Fold Change)")
There are several thousand genes detected using the full set. Next, we'll build input data frame for summarized benchmark.
geneExp <- tbl_df(data.frame(geneName=rownames(res_5), pval=res_5$pvalue, SE=res_5$lfcSE, ind_covariate = res_5$baseMean, effect_size=res_5$log2FoldChange, test_statistic=res_5$stat, pzero=rowSums(counts(subset)==0)/ncol(counts(subset)))) # filter NAs and those with less than 50% expressed geneExp <- geneExp %>% na.omit() %>% dplyr::filter(pzero < 0.5)
We'll create a plot to examine the distribution of effect sizes, since the ash method assumes that the distribution of true (unobserved) effect sizes is unimodal.
ggplot(data=geneExp, aes(effect_size)) + geom_histogram(bins=30)
We'll also explore how the standard error (used by ash) correlates with the independent covariate (used by methods that incorporate covariates), in order to get an idea of how these pieces of information relate to one another.
ggplot(data=geneExp, aes(x = ind_covariate, y = SE)) + geom_hex(bins = 100) + scale_x_continuous(trans="log10") + xlab("Covariate: Mean gene expression")
Look at covariate diagnostic plots.
strat_hist(geneExp, pvalue="pval", covariate="ind_covariate", maxy=23)
rank_scatter(geneExp, pvalue="pval", covariate="ind_covariate")
Run benchmark methods.
sb <- bd %>% buildBench(data=geneExp, parallel = FALSE) assayNames(sb) <- "qvalue" sb <- addDefaultMetrics(sb)
Plot results.
rejections_scatter(sb, supplementary=FALSE)
plotFDRMethodsOverlap(sb, alpha=0.05, nsets=ncol(sb), order.by="freq", decreasing=TRUE, supplementary=FALSE)
First we'll illustrate the logistic curve used to simulate the informative covariate. In this study we'll sample from a logistic curve and use these values as probability weights when selecting DE genes. A weakly informative covariate will be generated by adding noise to the strongly informative covariate.
x <- seq(0,10, by = 0.1) infcov <- 1 / (1+exp(-x + 5)) plot(x, infcov, type = "l", ylab = "informative covariate") plot(infcov, pmin(1, abs(infcov + rnorm(length(x), 0, 0.25))), ylab = "weak covariate", xlab = "true covariate")
Next, we'll analyze random splits of one condition, both with and without the addition of simulated DE genes. Here we'll create a function that we can use to run one replicate given sample size and number of DE gene settings. This will be looped over many replications and results averaged over them.
#' @param X simnumber #' @param seed random seed #' @param sampleSize is the number of samples in each condition #' @param nDE is the number of DE genes #' @param bd is the bench design object #' @param BPPARAM is the BiocParallel bpparam argument to pass to DESeq. Since #' parallelization happens by forking multiple instances of this function, we #' will specify only one worker per DEseq intance. #' @param uninformativeCovariate logical indicating whether to replace the #' informative covariate with a randomly generated (uninformative) covariate. #' Default is FALSE. #' @param pvalHists logical whether to return histograms of pvalues by covariate #' instead of SB objects simulateOneSplit <- function(X, rseed, nDE, sampleSize, bd, uninformativeCovariate = FALSE, pvalHists = FALSE, strongCovariate = TRUE){ # set random seed set.seed(as.numeric(X)*as.numeric(rseed)) covar <- 1 / (1+exp(-runif(nrow(dds_full), 0, 10) + 5)) # select a random subset of 20 WT samples dds_test <- dds_full[,sample(1:ncol(dds_full), sampleSize*2)] # add a fake condition column to coldat colData(dds_test)$fake <- factor(c(rep("A", sampleSize), rep("B", sampleSize))[sample(1:(sampleSize*2), sampleSize*2)]) design(dds_test) <- ~fake pzero = rowSums(counts(dds_test)==0)/ncol(counts(dds_test)) dds_test <- dds_test[pzero < 0.5,] covar <- covar[pzero < 0.5] truth <- rep(FALSE, nrow(dds_test)) # make sure null comparison is truly null: if PC1 or PC2 sig different, # or if test of join means of PCs 1:4 is significant, # reshuffle sample labels dds_test <- estimateSizeFactors(dds_test) x <- t(counts(dds_test, normalize=TRUE)) pc <- prcomp(log(x + 0.5), scale.=TRUE) a1 <- pc$x[colData(dds_test)$fake=="A",1] b1 <- pc$x[colData(dds_test)$fake=="B",1] a2 <- pc$x[colData(dds_test)$fake=="A",2] b2 <- pc$x[colData(dds_test)$fake=="B",2] p1 <- t.test(a1, b1)$p.value p2 <- t.test(a2, b2)$p.value tries <- 0 while(p1 < 0.10 || p2 < 0.10 && tries < 10){ colData(dds_test)$fake <- sample(colData(dds_test)$fake, ncol(dds_test)) x <- t(counts(dds_test, normalize=TRUE)) pc <- prcomp(log(x + 0.5), scale.=TRUE) a1 <- pc$x[colData(dds_test)$fake=="A",1] b1 <- pc$x[colData(dds_test)$fake=="B",1] a2 <- pc$x[colData(dds_test)$fake=="A",2] b2 <- pc$x[colData(dds_test)$fake=="B",2] p1 <- t.test(a1, b1)$p.value p2 <- t.test(a2, b2)$p.value tries <- tries + 1 } if(nDE > 0){ # pick random set of nDE genes to add signal to DE <- sample(1:nrow(dds_test), nDE, prob = covar) truth[DE] <- TRUE # randomly sample a log2FC from original FCs (without regard to DE) counts_new <- counts(dds_test) log2FC <- rep(0, nrow(dds_test)) if(sampleSize == 5){ log2FC[DE] <- res_5$log2FoldChange[pzero < 0.5][DE] }else if (sampleSize == 10){ log2FC[DE] <- res_10$log2FoldChange[pzero < 0.5][DE] }else{ stop("Only sample sizes 5 and 10 are currently supported with pre-", "computed fold changes for sampling from.") } # randomize which condition is shifted up or down ran <- runif(nrow(dds_test)) refcond <- ifelse(ran < 0.5, "A", "B") down <- which(ran < 0.5) counts_new[down,colData(dds_test)$fake==unique(refcond[down])] <- counts(dds_test)[down, colData(dds_test)$fake==unique(refcond[down])] * 2^log2FC[down] counts_new[-down,colData(dds_test)$fake==unique(refcond[-down])] <- counts(dds_test)[-down, colData(dds_test)$fake==unique(refcond[-down])] * 2^log2FC[-down] counts_new <- apply(counts_new, 2, as.integer) counts(dds_test) <- counts_new } # replace existing size factors dds_test <- estimateSizeFactors(dds_test) dds_test <- DESeq(dds_test, parallel = FALSE) resTEST <- results(dds_test, name="fake_B_vs_A", independentFiltering = FALSE) geneExp <- tbl_df(data.frame(geneName=rownames(resTEST), pval=resTEST$pvalue, SE=resTEST$lfcSE, ind_covariate = covar, effect_size = resTEST$log2FoldChange, test_statistic = resTEST$stat, qvalue = truth)) if (uninformativeCovariate){ geneExp <- mutate(geneExp, ind_covariate = runif(length(covar))) }else if(!strongCovariate){ geneExp <- mutate(geneExp, ind_covariate = pmin(1, abs(covar + rnorm(length(covar), 0, 0.25)))) } geneExp <- geneExp %>% dplyr::filter(!is.na(pval)) if(pvalHists){ return(strat_hist(geneExp, pvalue="pval", covariate="ind_covariate", maxy =10)) } sb <- bd %>% buildBench(data=geneExp, parallel = FALSE, truthCols = "qvalue", ftCols = "ind_covariate") assayNames(sb) <- "qvalue" sb <- addDefaultMetrics(sb) rowData(sb)$log2FC <- geneExp$effect_size return(sb) }
We'll also set some parameters that will be common to all simulations. These include the number of replications, the bench design object, the set of methods to exclude in the results plots, and the alpha cutoff level to be used when plotting the aggregated Upset results.
B <- 100 excludeSet <- c("unadjusted", "bl-df02", "bl-df04", "bl-df05") ualpha <- 0.05 # only keep one condition for subsetting in simulations that follow dds_full <- dds_full[,colData(dds_full)$condition == "Snf2"]
Here's a helper function to return the number of methods with rejections at a particular alpha level (this helps us determine whether or not to plot the aggregated upset plot - if there aren't at least 2 methods it will throw an error, which is a problem for the null simulations).
# To be included in the upset agg plot, method needs to have found on average # at least one rejection per replicate. To create an upset plot, require that # at least two methods rejected at this threshold. #' @param res standardized metric data.table generated using #' standardize_results. #' @param alpha alpha cutoff #' @param filterSet which methods to exclude from consideration numberMethodsReject <- function(res, alphacutoff, filterSet){ res <- res %>% filter(is.na(param.alpha) | (param.alpha == alphacutoff)) %>% filter(!(blabel %in% filterSet)) %>% filter(alpha == alphacutoff) %>% filter(performanceMetric == "rejections") %>% select(blabel, performanceMetric, value) %>% group_by(blabel) %>% summarize(mean_value = mean(value)) %>% filter(mean_value > 1) return(nrow(res)) }
Here we'll repeat the above, but for a null comparison 5 versus 5 Snf2 samples, where the groups are selected randomly. This will be done for 100 random splits.
rseed <- 225 sampleSize <- 5 nDE <- 0 if (!file.exists(resfile_n5)){ null5 <- mclapply(X=1:B, FUN=simulateOneSplit, rseed=rseed, nDE=nDE, sampleSize=sampleSize, bd=bd, mc.cores=nCores) saveRDS(null5, file = resfile_n5) }else{ null5 <- readRDS(file = resfile_n5) }
Plot results.
# Check for missing results (if any methods threw an error for relevant metrics). rowSums(sapply(null5, function(x) colSums(is.na(assays(x)$qvalue)) > 0)) res5 <- plotsim_standardize(null5, alpha = seq(0.01, 0.10, 0.01)) plotsim_average(res5, met="rejections",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) plotsim_average(res5, met="rejectprop",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) plotsim_average(res5, met="FWER",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) plotsim_average(res5, met="TNR",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) covariateLinePlot(null5, alpha=0.05, covname="log2FC", nbins=25, trans="log1p") covariateLinePlot(null5, alpha=0.05, covname="ind_covariate", nbins=25, trans="log1p") if (numberMethodsReject(res5, alphacutoff=ualpha, filterSet=excludeSet) >= 2){ aggupset(null5, alpha=ualpha, supplementary = FALSE, return_list = FALSE) }else{ message("Not enough methods found rejections at alpha ", ualpha, "; skipping upset plot") }
Here we'll repeat the above, but for a null comparison 10 versus 10 Snf2 samples, where the groups are selected randomly. This will be done for 100 random splits. Note that here the covariate-aware methods use a covariate that by definition is uninformative (since there are no non-null observations).
rseed <- 837 sampleSize <- 10 nDE <- 0 if (!file.exists(resfile_n10)){ null10 <- mclapply(X=1:B, FUN=simulateOneSplit, rseed=rseed, nDE=nDE, sampleSize=sampleSize, bd=bd, mc.cores=nCores) saveRDS(null10, file = resfile_n10) }else{ null10 <- readRDS(file = resfile_n10) }
Plot results.
# Check for missing results (if any methods threw an error for relevant metrics). rowSums(sapply(null10, function(x) colSums(is.na(assays(x)$qvalue)) > 0)) res10 <- plotsim_standardize(null10, alpha = seq(0.01, 0.10, 0.01)) plotsim_average(res10, met="rejections",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) plotsim_average(res10, met="rejectprop",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) plotsim_average(res10, met="FWER",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) plotsim_average(res10, met="TNR",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) covariateLinePlot(null10, alpha=0.05, covname="log2FC", nbins=25, trans="log1p") covariateLinePlot(null10, alpha=0.05, covname="ind_covariate", nbins=25, trans="log1p") if (numberMethodsReject(res10, alphacutoff=ualpha, filterSet=excludeSet) >= 2){ aggupset(null10, alpha=ualpha, supplementary = FALSE, return_list = FALSE) }else{ message("Not enough methods found rejections at alpha ", ualpha, "; skipping upset plot") }
Here we'll repeat the above, but for a DE comparison 5 versus 5 Snf2 samples, where the groups are selected randomly and 2000 DE genes are added. We'll use a strongly informative covariate. This will be done for 100 random splits.
rseed <- 198 sampleSize <- 5 nDE <- 2000 if (!file.exists(resfile_d5)){ de5 <- mclapply(X=1:B, FUN=simulateOneSplit, rseed=rseed, nDE=nDE, sampleSize=sampleSize, bd=bd, mc.cores=nCores) saveRDS(de5, file = resfile_d5) }else{ de5 <- readRDS(file = resfile_d5) } # pvalue histograms plotfile <- file.path(datdir, "de5_pvalhists.pdf") if (!file.exists(plotfile)){ hists <- mclapply(X=1:B, FUN=simulateOneSplit, rseed=rseed, nDE=nDE, sampleSize=sampleSize, bd=bd, pvalHists = TRUE, mc.cores = nCores) pdf(plotfile, width=8, height=4) for(i in 1:length(hists)){ print(hists[[i]]) } dev.off() }
Plot results.
# Check for missing results (if any methods threw an error for relevant metrics). rowSums(sapply(de5, function(x) colSums(is.na(assays(x)$qvalue)) > 0)) res5d <- plotsim_standardize(de5, alpha = seq(0.01, 0.10, 0.01)) plotsim_average(res5d, met="rejections",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) plotsim_average(res5d, met="FDR",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) plotsim_average(res5d, met="TPR",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) plotsim_average(res5d, met=c("FPR", "TPR"), filter_set = excludeSet, merge_ihw = TRUE) covariateLinePlot(de5, alpha=0.05, covname="log2FC", nbins=25, trans="log1p") covariateLinePlot(de5, alpha=0.05, covname="ind_covariate", nbins=25, trans="log1p") if (numberMethodsReject(res5d, alphacutoff=ualpha, filterSet=excludeSet) >= 2){ aggupset(de5, alpha=ualpha, supplementary = FALSE, return_list = FALSE) }else{ message("Not enough methods found rejections at alpha ", ualpha, "; skipping upset plot") }
Here we'll repeat the above, but for a DE comparison 10 versus 10 Snf2 samples, where the groups are selected randomly and 2000 DE genes are added. We'll use a strongly informative covariate. This will be done for 100 random splits.
rseed <- 961 sampleSize <- 10 nDE <- 2000 if(!file.exists(resfile_d10)){ de10 <- mclapply(X=1:B, FUN=simulateOneSplit, rseed=rseed, nDE=nDE, sampleSize=sampleSize, bd=bd, mc.cores=nCores) saveRDS(de10, file = resfile_d10) }else{ de10 <- readRDS(file = resfile_d10) } # pvalue histograms plotfile <- file.path(datdir, "de10_pvalhists.pdf") if (!file.exists(plotfile)){ hists <- mclapply(X=1:B, FUN=simulateOneSplit, rseed=rseed, nDE=nDE, sampleSize=sampleSize, bd=bd, pvalHists = TRUE, mc.cores = nCores) pdf(plotfile, width=8, height=4) for(i in 1:length(hists)){ print(hists[[i]]) } dev.off() }
Plot results.
# Check for missing results (if any methods threw an error for relevant metrics). rowSums(sapply(de10, function(x) colSums(is.na(assays(x)$qvalue)) > 0)) res10d <- plotsim_standardize(de10, alpha = seq(0.01, 0.10, 0.01)) plotsim_average(res10d, met="rejections",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) plotsim_average(res10d, met="FDR",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) plotsim_average(res10d, met="TPR",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) plotsim_average(res10d, met=c("FPR", "TPR"), filter_set = excludeSet, merge_ihw = TRUE) covariateLinePlot(de10, alpha=0.05, covname="log2FC", nbins=25, trans="log1p") covariateLinePlot(de10, alpha=0.05, covname="ind_covariate", nbins=25, trans="log1p") if (numberMethodsReject(res10d, alphacutoff=ualpha, filterSet=excludeSet) >= 2){ aggupset(de10, alpha=ualpha, supplementary = FALSE, return_list = FALSE) }else{ message("Not enough methods found rejections at alpha ", ualpha, "; skipping upset plot") }
Here we'll repeat the previous section, but using a weaker informative covariate (that has noise added to it).
rseed <- 198 sampleSize <- 5 nDE <- 2000 if (!file.exists(resfile_d5_w)){ de5 <- mclapply(X=1:B, FUN=simulateOneSplit, rseed=rseed, nDE=nDE, sampleSize=sampleSize, bd=bd, mc.cores=nCores, strongCovariate = FALSE) saveRDS(de5, file = resfile_d5_w) }else{ de5 <- readRDS(file = resfile_d5_w) }
Plot results.
# Check for missing results (if any methods threw an error for relevant metrics). rowSums(sapply(de5, function(x) colSums(is.na(assays(x)$qvalue)) > 0)) res5d <- plotsim_standardize(de5, alpha = seq(0.01, 0.10, 0.01)) plotsim_average(res5d, met="rejections",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) plotsim_average(res5d, met="FDR",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) plotsim_average(res5d, met="TPR",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) plotsim_average(res5d, met=c("FPR", "TPR"), filter_set = excludeSet, merge_ihw = TRUE) covariateLinePlot(de5, alpha=0.05, covname="log2FC", nbins=25, trans="log1p") covariateLinePlot(de5, alpha=0.05, covname="ind_covariate", nbins=25, trans="log1p") if (numberMethodsReject(res5d, alphacutoff=ualpha, filterSet=excludeSet) >= 2){ aggupset(de5, alpha=ualpha, supplementary = FALSE, return_list = FALSE) }else{ message("Not enough methods found rejections at alpha ", ualpha, "; skipping upset plot") }
Here we'll repeat the above, but using a weaker informative covariate (that has noise added to it).
rseed <- 961 sampleSize <- 10 nDE <- 2000 if(!file.exists(resfile_d10_w)){ de10 <- mclapply(X=1:B, FUN=simulateOneSplit, rseed=rseed, nDE=nDE, sampleSize=sampleSize, bd=bd, mc.cores=nCores, strongCovariate = FALSE) saveRDS(de10, file = resfile_d10_w) }else{ de10 <- readRDS(file = resfile_d10_w) }
Plot results.
# Check for missing results (if any methods threw an error for relevant metrics). rowSums(sapply(de10, function(x) colSums(is.na(assays(x)$qvalue)) > 0)) res10d <- plotsim_standardize(de10, alpha = seq(0.01, 0.10, 0.01)) plotsim_average(res10d, met="rejections",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) plotsim_average(res10d, met="FDR",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) plotsim_average(res10d, met="TPR",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) plotsim_average(res10d, met=c("FPR", "TPR"), filter_set = excludeSet, merge_ihw = TRUE) covariateLinePlot(de10, alpha=0.05, covname="log2FC", nbins=25, trans="log1p") covariateLinePlot(de10, alpha=0.05, covname="ind_covariate", nbins=25, trans="log1p") if (numberMethodsReject(res10d, alphacutoff=ualpha, filterSet=excludeSet) >= 2){ aggupset(de10, alpha=ualpha, supplementary = FALSE, return_list = FALSE) }else{ message("Not enough methods found rejections at alpha ", ualpha, "; skipping upset plot") }
Here we repeat the previous section, using an uninformative covariate.
Here we'll repeat the above, but for a DE comparison 5 versus 5 Snf2 samples, where the groups are selected randomly and 2000 DE genes are added. This will be done for 100 random splits.
rseed <- 198 sampleSize <- 5 nDE <- 2000 if (!file.exists(resfile_d5_uninfCov)){ de5 <- mclapply(X=1:B, FUN=simulateOneSplit, rseed=rseed, nDE=nDE, sampleSize=sampleSize, bd=bd, uninformativeCovariate = TRUE, mc.cores=nCores) saveRDS(de5, file = resfile_d5_uninfCov) }else{ de5 <- readRDS(file = resfile_d5_uninfCov) }
Plot results.
# Check for missing results (if any methods threw an error for relevant metrics). rowSums(sapply(de5, function(x) colSums(is.na(assays(x)$qvalue)) > 0)) res5d <- plotsim_standardize(de5, alpha = seq(0.01, 0.10, 0.01)) plotsim_average(res5d, met="rejections",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) plotsim_average(res5d, met="FDR",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) plotsim_average(res5d, met="TPR",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) plotsim_average(res5d, met=c("FPR", "TPR"), filter_set = excludeSet, merge_ihw = TRUE) covariateLinePlot(de5, alpha=0.05, covname="log2FC", nbins=25, trans="log1p") covariateLinePlot(de5, alpha=0.05, covname="ind_covariate", nbins=25, trans="log1p") if (numberMethodsReject(res5d, alphacutoff=ualpha, filterSet=excludeSet) >= 2){ aggupset(de5, alpha=ualpha, supplementary = FALSE, return_list = FALSE) }else{ message("Not enough methods found rejections at alpha ", ualpha, "; skipping upset plot") }
Here we'll repeat the above, but for a DE comparison 10 versus 10 Snf2 samples, where the groups are selected randomly and 2000 DE genes are added. This will be done for 100 random splits.
rseed <- 961 sampleSize <- 10 nDE <- 2000 if(!file.exists(resfile_d10_uninfCov)){ de10 <- mclapply(X=1:B, FUN=simulateOneSplit, rseed=rseed, nDE=nDE, sampleSize=sampleSize, bd=bd, uninformativeCovariate = TRUE, mc.cores=nCores) saveRDS(de10, file = resfile_d10_uninfCov) }else{ de10 <- readRDS(file = resfile_d10_uninfCov) }
Plot results.
# Check for missing results (if any methods threw an error for relevant metrics). rowSums(sapply(de10, function(x) colSums(is.na(assays(x)$qvalue)) > 0)) res10d <- plotsim_standardize(de10, alpha = seq(0.01, 0.10, 0.01)) plotsim_average(res10d, met="rejections",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) plotsim_average(res10d, met="FDR",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) plotsim_average(res10d, met="TPR",filter_set = excludeSet, merge_ihw = TRUE, errorBars=TRUE) plotsim_average(res10d, met=c("FPR", "TPR"), filter_set = excludeSet, merge_ihw = TRUE) covariateLinePlot(de10, alpha=0.05, covname="log2FC", nbins=25, trans="log1p") covariateLinePlot(de10, alpha=0.05, covname="ind_covariate", nbins=25, trans="log1p") if (numberMethodsReject(res10d, alphacutoff=ualpha, filterSet=excludeSet) >= 2){ aggupset(de10, alpha=ualpha, supplementary = FALSE, return_list = FALSE) }else{ message("Not enough methods found rejections at alpha ", ualpha, "; skipping upset plot") }
Here we compare the method ranks for the different sample sizes and informativeness settings at alpha = 0.10.
plotMethodRanks(c(resfile_d5, resfile_d5_w, resfile_d10, resfile_d10_w, resfile_d5_uninfCov, resfile_d10_uninfCov), colLabels = c("5 S", "5 W", "10 S", "10 W", "5 UI", "10 UI"), alpha = 0.10, xlab = "Comparison", excludeMethods = NULL)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.