#' @title Empirical RNA-seq Sample Size Analysis
#'
#' @description
#' ERSSA is a package designed to test whether an currently available RNA-seq
#' dataset has sufficient biological replicates to detect a majority of
#' differentially expressed (DE) genes between two conditions. Base on the
#' number of biological replicates available, the algorithm subsamples at
#' step-wise replicate levels and uses existing differentially expression
#' analysis softwares (e.g. edgeR and DESeq2) to identify the number of DE
#' genes. This process is repeated for a given number of times with unique
#' combinations of samples to generate a distribution of DE genes at each
#' replicate level. Compare to existing RNA-seq sample size analysis
#' algorithms, ERSSA does not rely on any a priori assumptions about the
#' dataset, but rather uses an user-supplied pilot RNA-seq dataset to
#' determine whether the current replicate level is sufficient to detect a
#' majority of DE genes.
#'
#' @details
#' \code{erssa} function is a wrapper that calls several ERSSA functions in
#' sequence. For additional description of the functions called, please see
#' their respective manual.
#'
#' For the majority of current RNA-seq analysis, RNA-seq samples are aligned to
#' the reference, followed by running quantification packages to generate
#' count tables. ERSSA can then take the unfiltered count table and remove non-
#' to low-expressing genes with count_filter function. Alternatively, a
#' filtered count table can be supplied and filtering will be skipped.
#'
#' Next, unique combinations of samples at various biological replicate levels
#' will be generated by comb_gen function and passed to a differential
#' expression analysis software for statistical testing for DE genes. The
#' pipeline currently supports edgeR and DESeq2, but additional software
#' support can be easily added.
#'
#' The generated differential expression results are then analyzed by several
#' plotting functions briefly described here. ggplot2_dotplot function plots
#' the trend in DE gene identification. ggplot2_marginPlot function plots
#' the marginal change in the number of DE genes as replicate level
#' increases. ggplot2_intersectPlot function plots the number of DE genes that
#' is common across combinations. ggplot2_TPR_FPRPlot function plots the TPR
#' and FPR of DE detection using the full dataset's list of DE gene as the
#' ground truth. Base on insights from these plots, the user can determine
#' whether a desirable level of DE gene discovery has been reached.
#'
#' At the default setting, the results of statistical tests are not saved,
#' only the list of DE genes is. However, all of the test results can be
#' optionally saved for further analysis.
#'
#' At default setting, only one CPU node is employed and depend on the number
#' of tests that needs to be done, the calculations can take some time to
#' complete. If additional nodes are available, additional nodes can be
#' employed by specifying the num_workers argument. Parallel computing
#' requires BiocParallel package.
#'
#' The results including list of DE genes and ggplot2 objects are returned.
#' All runtime parameters can optionally be saved in a log file named
#' "erssa.log".
#'
#' @param count_table A RNA-seq count matrix with genes on each row and samples
#' on each column. If count_table has already been filtered to remove non- or
#' low-expressing genes, then counts_filtered argument should be changed to
#' TRUE.
#' @param condition_table A condition table with two columns and each sample as
#' a row. Column 1 contains sample names and Column 2 contains sample
#' conditions (e.g. Control, Treatment).
#' @param DE_ctrl_cond The name of control condition in the comparison. Must be
#' one of the two conditions in the condition table.
#' @param filter_cutoff The average CPM threshold set for filtering genes.
#' Default to 1.
#' @param counts_filtered Boolean. Whether count table has already been
#' filtered. Default = FALSE with the function run filtering by average CPM
#' at the cutoff specified by filter_cutoff value.
#' @param comb_gen_repeat The number of maximum unique combinations to generate
#' at each replicate level. More tests will be performed with a bigger value,
#' but run time also increases linearly. Default set to 30 unique combinations
#' at maximum.
#' @param DE_software The name of DE analysis software to use. Current options
#' include "edgeR" and "DESeq2". Default to "edgeR".
#' @param DE_cutoff_stat The cutoff in FDR or adjusted p-value used to
#' determine whether a gene is differentially expressed. Genes with lower
#' FDR or adjusted p-value pass the cutoff. Default = 0.05.
#' @param DE_cutoff_Abs_logFC The cutoff in abs(log2FoldChange) for differential
#' expression consideration. Genes with higher abs(log2FoldChange) pass the
#' cutoff. Default = 1.
#' @param DE_save_table Boolean. The results of differential expression tests
#' can be saved to the drive for further analysis. Default setting does not
#' save the results to save drive space. Default = FALSE.
#' @param marginalPlot_stat The statistic used for plotting of values in
#' marginal plot function. Options include 'mean', 'median'. Default='median'.
#' @param TPR_FPR_stat The statistics used to summarize TPR and FPR at
#' each replicate level in ggplot2_TPR_FPRPlot function. Options include
#' 'mean', 'median'. Default = 'mean'.
#' @param path The path to which the plots and results will be saved. Default
#' to current working directory.
#' @param num_workers Number of nodes to use for parallel computing the DE tests
#' @param save_log Boolean. Whether to save runtime parameters in log file.
#' Defualt to false.
#' @param save_plot Boolean. Wehther to save ggplot2 plots to drive. Default to
#' true.
#'
#' @return A list of objects generated during the analysis is returned:
#' \itemize{
#' \item{count_table.filtered}{filtered count table}
#' \item{samp.name.comb}{the samples involved in each statistical test}
#' \item{list.of.DE.genes}{list of DE genes in each statistical test}
#' \item{gg.dotplot.obj}{a list of objects that can be used to recreate the
#' dot plot. See function ggplot2_dotplot manual for more detail.}
#' \item{gg.marinPlot.obj}{a list of objects that can be used to recreate the
#' marginal num. of DE genes plot. See function ggplot2_marginPlot manual
#' for more detail.}
#' \item{gg.intersectPlot.obj}{a list of objects that can be used to
#' recreate the num. of intersect genes plot. See function
#' ggplot2_intersectPlot manual for more detail.}
#' \item{gg.TPR_FPRPlot.obj}{a list of objects that can be used to
#' recreate the TPR vs. FPR plot. See function
#' ggplot2_TPR_FPRPlot manual for more detail.}
#' }
#'
#'
#' @author Zixuan Shao, \email{Zixuanshao.zach@@gmail.com}
#'
#' @examples
#' # load example dataset containing 1000 genes, 4 replicates and 5 comb. per
#' # rep. level
#' data(condition_table.partial, package = "ERSSA")
#' data(count_table.partial, package = "ERSSA")
#'
#' # run erssa with the "partial" dataset, use default edgeR for DE
#' ssa = erssa(count_table.partial, condition_table.partial,
#' DE_ctrl_cond='heart')
#'
#' # run erssa with the "full" dataset containing 10 replicates per heart and
#' # muscle, all genes included.
#' # Remove comments to run
#' # set.seed(1)
#' # data(condition_table.full, package="ERSSA")
#' # data(count_table.full, package="ERSSA")
#' # ssa = erssa(count_table.full, condition_table.full, DE_ctrl_cond='heart')
#'
#' @references
#' Ching, Travers, Sijia Huang, and Lana X. Garmire. “Power Analysis and Sample
#' Size Estimation for RNA-Seq Differential Expression.” RNA, September 22,
#' 2014. https://doi.org/10.1261/rna.046011.114.
#'
#' Hoskins, Stephanie Page, Derek Shyr, and Yu Shyr. “Sample Size Calculation
#' for Differential Expression Analysis of RNA-Seq Data.” In Frontiers of
#' Biostatistical Methods and Applications in Clinical Oncology, 359–79.
#' Springer, Singapore, 2017. https://doi.org/10.1007/978-981-10-0126-0_22.
#'
#' @export
#'
#' @importFrom utils write.csv
erssa = function(count_table=NULL, condition_table=NULL, DE_ctrl_cond=NULL,
filter_cutoff=1, counts_filtered=FALSE,
comb_gen_repeat=30, DE_software='edgeR', DE_cutoff_stat = 0.05,
DE_cutoff_Abs_logFC = 1, DE_save_table=FALSE,
marginalPlot_stat='median', TPR_FPR_stat='mean',
path='.', num_workers=1, save_log=FALSE,
save_plot=TRUE){
# check all required arguments supplied
if (is.null(count_table)){
stop('Missing required count_table argument')
} else if (is.null(condition_table)){
stop("Missing required condition_table argument")
} else if (is.null(DE_ctrl_cond)){
stop("Missing name of control condition in comparison")
} else if (!(is.data.frame(count_table))){
stop('count_table is not an expected data.frame object')
} else if (!(is.data.frame(condition_table))){
stop('condition_table is not an expected data.frame object')
} else if (length(unique(sapply(count_table, class)))!=1){
stop(paste0('More than one data type detected in count table, please ',
'make sure count table contains only numbers and that the ',
'list of gene names is the data.frame index'))
} else if (!(DE_software %in% c('edgeR','DESeq2'))) {
stop('only edgeR or DEseq2 supported')
}
if (save_log==TRUE){
# start log file
log = file(file.path(path, "ERSSA.log"))
log_l = c('-------------', 'ERSSA run log', '-------------',' ')
log_l = c(log_l, paste0('Start time: ',Sys.time()))
}
# filter the count table by average CPM value cutoff
# default is to filter the count_table with cutoff=1
if (counts_filtered==FALSE){
count_table.filtered = count_filter(count_table=count_table,
cutoff=filter_cutoff)
} else if (counts_filtered==TRUE){
count_table.filtered = count_table
} else {
stop("counts_filtered needs to be a boolean variable!")
}
# generate sample name combinations at various replicate levels
combinations = comb_gen(condition_table=condition_table,
n_repetition=comb_gen_repeat)
# run DE software to generate DE gene list
if (DE_software=='edgeR' & num_workers==1){
deg = erssa_edger(count_table.filtered=count_table.filtered,
combinations=combinations,
condition_table=condition_table,
control=DE_ctrl_cond, cutoff_stat = DE_cutoff_stat,
cutoff_Abs_logFC = DE_cutoff_Abs_logFC,
save_table=DE_save_table, path=path)
} else if (DE_software=='DESeq2' & num_workers==1){
deg = erssa_deseq2(count_table.filtered=count_table.filtered,
combinations=combinations,
condition_table=condition_table,
control=DE_ctrl_cond,
cutoff_stat = DE_cutoff_stat,
cutoff_Abs_logFC = DE_cutoff_Abs_logFC,
save_table=DE_save_table, path=path)
} else if (DE_software=='edgeR' & num_workers>1){
deg = erssa_edger_parallel(
count_table.filtered=count_table.filtered,
combinations=combinations,
condition_table=condition_table, control=DE_ctrl_cond,
cutoff_stat = DE_cutoff_stat,
cutoff_Abs_logFC = DE_cutoff_Abs_logFC,
save_table=DE_save_table, path=path, num_workers=num_workers)
} else if (DE_software=='DESeq2' & num_workers>1){
deg = erssa_deseq2_parallel(
count_table.filtered=count_table.filtered,
combinations=combinations,
condition_table=condition_table, control=DE_ctrl_cond,
cutoff_stat = DE_cutoff_stat,
cutoff_Abs_logFC = DE_cutoff_Abs_logFC,
save_table=DE_save_table, path=path, num_workers=num_workers)
}
# plot number of DE genes as dot plot and boxplot
gg_dot = ggplot2_dotplot(deg=deg, path=path, save_plot=save_plot)
# plot the marginal increase in DE genes with more replicates
gg_margin = ggplot2_marginPlot(deg=deg, stat=marginalPlot_stat,
path=path, save_plot=save_plot)
# plot the number of DE genes that is common across combinations
gg_intersect = ggplot2_intersectPlot(deg=deg, path=path,
save_plot=save_plot)
# plots the TPR and FPR of DE detection using the full dataset's list of DE
# gene as the ground truth
gg_TPR_FPR = ggplot2_TPR_FPRPlot(deg=deg, count_table.filtered=
count_table.filtered,
stat=TPR_FPR_stat,
path=path, save_plot=save_plot)
if (save_log==TRUE){
# finish log file
log_l = c(log_l, paste0('Finish time: ',Sys.time()), ' ')
log_l = c(log_l, paste0('Control condition: ', DE_ctrl_cond))
log_l = c(log_l, paste0('Count table filtered by ERSSA: ',
!counts_filtered))
if (counts_filtered == FALSE){
log_l = c(log_l, paste0('Count table CPM filter: ', filter_cutoff))
}
log_l = c(log_l, paste0('Num. of unique sample combination tested: ',
comb_gen_repeat))
log_l = c(log_l, paste0('DE software used: ', DE_software))
log_l = c(log_l, paste0('Statistical cutoff for DE: ',
DE_cutoff_stat))
log_l = c(log_l, paste0('Absolute log2FC cutoff for DE: ',
DE_cutoff_Abs_logFC))
log_l = c(log_l, paste0('DE tables saved to drive: ', DE_save_table))
log_l = c(log_l, paste0('Statistic used in marginal plot: ',
marginalPlot_stat))
log_l = c(log_l, paste0('Number of CPU nodes used: ', num_workers))
writeLines(log_l, log)
close(log)
}
# save the condition table used to drive
write.csv(x = condition_table,
file = file.path(path, 'ERSSA_ConditionTable.csv'))
# create a list of ERSSA generated objects
ERSSA.objects = list(count_table.filtered=count_table.filtered,
samp.name.comb=combinations,
list.of.DE.genes=deg,
gg.dotPlot.obj = gg_dot,
gg.marinPlot.obj=gg_margin,
gg.intersectPlot.obj = gg_intersect,
gg.TPR_FPRPlot.obj=gg_TPR_FPR)
return(ERSSA.objects)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.