ERSSA: Empirical RNA-seq Sample Size Analysis

Documented in erssa

#' @title Empirical RNA-seq Sample Size Analysis
#'
#' @description
#' ERSSA is a package designed to test whether an currently available RNA-seq
#' dataset has sufficient biological replicates to detect a majority of
#' differentially expressed (DE) genes between two conditions. Base on the
#' number of biological replicates available, the algorithm subsamples at
#' step-wise replicate levels and uses existing differentially expression
#' analysis softwares (e.g. edgeR and DESeq2) to identify the number of DE
#' genes. This process is repeated for a given number of times with unique
#' combinations of samples to generate a distribution of DE genes at each
#' replicate level. Compare to existing RNA-seq sample size analysis
#' algorithms, ERSSA does not rely on any a priori assumptions about the
#' dataset, but rather uses an user-supplied pilot RNA-seq dataset to
#' determine whether the current replicate level is sufficient to detect a
#' majority of DE genes.
#'
#' @details
#' \code{erssa} function is a wrapper that calls several ERSSA functions in
#' sequence. For additional description of the functions called, please see
#' their respective manual.
#'
#' For the majority of current RNA-seq analysis, RNA-seq samples are aligned to
#' the reference, followed by running quantification packages to generate
#' count tables. ERSSA can then take the unfiltered count table and remove non-
#' to low-expressing genes with count_filter function. Alternatively, a
#' filtered count table can be supplied and filtering will be skipped.
#'
#' Next, unique combinations of samples at various biological replicate levels
#' will be generated by comb_gen function and passed to a differential
#' expression analysis software for statistical testing for DE genes. The
#' pipeline currently supports edgeR and DESeq2, but additional software
#' support can be easily added.
#'
#' The generated differential expression results are then analyzed by several
#' plotting functions briefly described here. ggplot2_dotplot function plots
#' the trend in DE gene identification. ggplot2_marginPlot function plots
#' the marginal change in the number of DE genes as replicate level
#' increases. ggplot2_intersectPlot function plots the number of DE genes that
#' is common across combinations. ggplot2_TPR_FPRPlot function plots the TPR
#' and FPR of DE detection using the full dataset's list of DE gene as the
#' ground truth. Base on insights from these plots, the user can determine
#' whether a desirable level of DE gene discovery has been reached.
#'
#' At the default setting, the results of statistical tests are not saved,
#' only the list of DE genes is. However, all of the test results can be
#' optionally saved for further analysis.
#'
#' At default setting, only one CPU node is employed and depend on the number
#' of tests that needs to be done, the calculations can take some time to
#' complete. If additional nodes are available, additional nodes can be
#' employed by specifying the num_workers argument. Parallel computing
#' requires BiocParallel package.
#'
#' The results including list of DE genes and ggplot2 objects are returned.
#' All runtime parameters can optionally be saved in a log file named
#' "erssa.log".
#'
#' @param count_table A RNA-seq count matrix with genes on each row and samples
#' on each column. If count_table has already been filtered to remove non- or
#' low-expressing genes, then counts_filtered argument should be changed to
#' TRUE.
#' @param condition_table A condition table with two columns and each sample as
#' a row. Column 1 contains sample names and Column 2 contains sample
#' conditions (e.g. Control, Treatment).
#' @param DE_ctrl_cond The name of control condition in the comparison. Must be
#' one of the two conditions in the condition table.
#' @param filter_cutoff The average CPM threshold set for filtering genes.
#' Default to 1.
#' @param counts_filtered Boolean. Whether count table has already been
#' filtered. Default = FALSE with the function run filtering by average CPM
#' at the cutoff specified by filter_cutoff value.
#' @param comb_gen_repeat The number of maximum unique combinations to generate
#' at each replicate level. More tests will be performed with a bigger value,
#' but run time also increases linearly. Default set to 30 unique combinations
#' at maximum.
#' @param DE_software The name of DE analysis software to use. Current options
#' include "edgeR" and "DESeq2". Default to "edgeR".
#' @param DE_cutoff_stat The cutoff in FDR or adjusted p-value used to
#' determine whether a gene is differentially expressed. Genes with lower
#' FDR or adjusted p-value pass the cutoff. Default = 0.05.
#' @param DE_cutoff_Abs_logFC The cutoff in abs(log2FoldChange) for differential
#' expression consideration. Genes with higher abs(log2FoldChange) pass the
#' cutoff. Default = 1.
#' @param DE_save_table Boolean. The results of differential expression tests
#' can be saved to the drive for further analysis. Default setting does not
#' save the results to save drive space. Default = FALSE.
#' @param marginalPlot_stat The statistic used for plotting of values in
#' marginal plot function. Options include 'mean', 'median'. Default='median'.
#' @param TPR_FPR_stat The statistics used to summarize TPR and FPR at
#' each replicate level in ggplot2_TPR_FPRPlot function. Options include
#' 'mean', 'median'. Default = 'mean'.
#' @param path The path to which the plots and results will be saved. Default
#' to current working directory.
#' @param num_workers Number of nodes to use for parallel computing the DE tests
#' @param save_log Boolean. Whether to save runtime parameters in log file.
#' Defualt to false.
#' @param save_plot Boolean. Wehther to save ggplot2 plots to drive. Default to
#' true.
#'
#' @return A list of objects generated during the analysis is returned:
#' \itemize{
#'  \item{count_table.filtered}{filtered count table}
#'  \item{samp.name.comb}{the samples involved in each statistical test}
#'  \item{list.of.DE.genes}{list of DE genes in each statistical test}
#'  \item{gg.dotplot.obj}{a list of objects that can be used to recreate the
#'  dot plot. See function ggplot2_dotplot manual for more detail.}
#'  \item{gg.marinPlot.obj}{a list of objects that can be used to recreate the
#'  marginal num. of DE genes plot. See function ggplot2_marginPlot manual
#'  for more detail.}
#'  \item{gg.intersectPlot.obj}{a list of objects that can be used to
#'  recreate the num. of intersect genes plot. See function
#'  ggplot2_intersectPlot manual for more detail.}
#'  \item{gg.TPR_FPRPlot.obj}{a list of objects that can be used to
#'  recreate the TPR vs. FPR plot. See function
#'  ggplot2_TPR_FPRPlot manual for more detail.}
#'           }
#'
#'
#' @author Zixuan Shao, \email{Zixuanshao.zach@@gmail.com}
#'
#' @examples
#' # load example dataset containing 1000 genes, 4 replicates and 5 comb. per
#' # rep. level
#' data(condition_table.partial, package = "ERSSA")
#' data(count_table.partial, package = "ERSSA")
#'
#' # run erssa with the "partial" dataset, use default edgeR for DE
#' ssa = erssa(count_table.partial, condition_table.partial,
#'             DE_ctrl_cond='heart')
#'
#' # run erssa with the "full" dataset containing 10 replicates per heart and
#' # muscle, all genes included.
#' # Remove comments to run
#' #   set.seed(1)
#' #   data(condition_table.full, package="ERSSA")
#' #   data(count_table.full, package="ERSSA")
#' #   ssa = erssa(count_table.full, condition_table.full, DE_ctrl_cond='heart')
#'
#' @references
#' Ching, Travers, Sijia Huang, and Lana X. Garmire. “Power Analysis and Sample
#' Size Estimation for RNA-Seq Differential Expression.” RNA, September 22,
#' 2014. https://doi.org/10.1261/rna.046011.114.
#'
#' Hoskins, Stephanie Page, Derek Shyr, and Yu Shyr. “Sample Size Calculation
#' for Differential Expression Analysis of RNA-Seq Data.” In Frontiers of
#' Biostatistical Methods and Applications in Clinical Oncology, 359–79.
#' Springer, Singapore, 2017. https://doi.org/10.1007/978-981-10-0126-0_22.
#'
#' @export
#'
#' @importFrom utils write.csv


erssa = function(count_table=NULL, condition_table=NULL, DE_ctrl_cond=NULL,
                 filter_cutoff=1, counts_filtered=FALSE,
                 comb_gen_repeat=30, DE_software='edgeR', DE_cutoff_stat = 0.05,
                 DE_cutoff_Abs_logFC = 1, DE_save_table=FALSE,
                 marginalPlot_stat='median', TPR_FPR_stat='mean',
                 path='.', num_workers=1, save_log=FALSE,
                 save_plot=TRUE){

    # check all required arguments supplied
    if (is.null(count_table)){
        stop('Missing required count_table argument')
    } else if (is.null(condition_table)){
        stop("Missing required condition_table argument")
    } else if (is.null(DE_ctrl_cond)){
        stop("Missing name of control condition in comparison")
    } else if (!(is.data.frame(count_table))){
        stop('count_table is not an expected data.frame object')
    } else if (!(is.data.frame(condition_table))){
        stop('condition_table is not an expected data.frame object')
    } else if (length(unique(sapply(count_table, class)))!=1){
        stop(paste0('More than one data type detected in count table, please ',
                    'make sure count table contains only numbers and that the ',
                    'list of gene names is the data.frame index'))
    } else if (!(DE_software %in% c('edgeR','DESeq2'))) {
        stop('only edgeR or DEseq2 supported')
    }

    if (save_log==TRUE){
        # start log file
        log = file(file.path(path, "ERSSA.log"))
        log_l = c('-------------', 'ERSSA run log', '-------------',' ')
        log_l = c(log_l, paste0('Start time: ',Sys.time()))
    }




    # filter the count table by average CPM value cutoff
    # default is to filter the count_table with cutoff=1
    if (counts_filtered==FALSE){
        count_table.filtered = count_filter(count_table=count_table,
                                            cutoff=filter_cutoff)
    } else if (counts_filtered==TRUE){
        count_table.filtered = count_table
    } else {
        stop("counts_filtered needs to be a boolean variable!")
    }


    # generate sample name combinations at various replicate levels
    combinations = comb_gen(condition_table=condition_table,
                            n_repetition=comb_gen_repeat)


    # run DE software to generate DE gene list
    if (DE_software=='edgeR' & num_workers==1){
        deg = erssa_edger(count_table.filtered=count_table.filtered,
                          combinations=combinations,
                          condition_table=condition_table,
                          control=DE_ctrl_cond, cutoff_stat = DE_cutoff_stat,
                          cutoff_Abs_logFC = DE_cutoff_Abs_logFC,
                          save_table=DE_save_table, path=path)

    } else if (DE_software=='DESeq2' & num_workers==1){
        deg = erssa_deseq2(count_table.filtered=count_table.filtered,
                           combinations=combinations,
                           condition_table=condition_table,
                           control=DE_ctrl_cond,
                           cutoff_stat = DE_cutoff_stat,
                           cutoff_Abs_logFC = DE_cutoff_Abs_logFC,
                           save_table=DE_save_table, path=path)

    } else if (DE_software=='edgeR' & num_workers>1){
        deg = erssa_edger_parallel(
            count_table.filtered=count_table.filtered,
            combinations=combinations,
            condition_table=condition_table, control=DE_ctrl_cond,
            cutoff_stat = DE_cutoff_stat,
            cutoff_Abs_logFC = DE_cutoff_Abs_logFC,
            save_table=DE_save_table, path=path, num_workers=num_workers)

    } else if (DE_software=='DESeq2' & num_workers>1){
        deg = erssa_deseq2_parallel(
            count_table.filtered=count_table.filtered,
            combinations=combinations,
            condition_table=condition_table, control=DE_ctrl_cond,
            cutoff_stat = DE_cutoff_stat,
            cutoff_Abs_logFC = DE_cutoff_Abs_logFC,
            save_table=DE_save_table, path=path, num_workers=num_workers)
    }

    # plot number of DE genes as dot plot and boxplot
    gg_dot = ggplot2_dotplot(deg=deg, path=path, save_plot=save_plot)

    # plot the marginal increase in DE genes with more replicates
    gg_margin = ggplot2_marginPlot(deg=deg, stat=marginalPlot_stat,
                                   path=path, save_plot=save_plot)

    # plot the number of DE genes that is common across combinations
    gg_intersect = ggplot2_intersectPlot(deg=deg, path=path,
                                         save_plot=save_plot)

    # plots the TPR and FPR of DE detection using the full dataset's list of DE
    # gene as the ground truth
    gg_TPR_FPR = ggplot2_TPR_FPRPlot(deg=deg, count_table.filtered=
                                         count_table.filtered,
                                     stat=TPR_FPR_stat,
                                     path=path, save_plot=save_plot)


    if (save_log==TRUE){
        # finish log file
        log_l = c(log_l, paste0('Finish time: ',Sys.time()), ' ')
        log_l = c(log_l, paste0('Control condition: ', DE_ctrl_cond))

        log_l = c(log_l, paste0('Count table filtered by ERSSA: ',
                                !counts_filtered))

        if (counts_filtered == FALSE){
            log_l = c(log_l, paste0('Count table CPM filter: ', filter_cutoff))
        }

        log_l = c(log_l, paste0('Num. of unique sample combination tested: ',
                                comb_gen_repeat))
        log_l = c(log_l, paste0('DE software used: ', DE_software))
        log_l = c(log_l, paste0('Statistical cutoff for DE: ',
                                DE_cutoff_stat))
        log_l = c(log_l, paste0('Absolute log2FC cutoff for DE: ',
                                DE_cutoff_Abs_logFC))
        log_l = c(log_l, paste0('DE tables saved to drive: ', DE_save_table))
        log_l = c(log_l, paste0('Statistic used in marginal plot: ',
                                marginalPlot_stat))
        log_l = c(log_l, paste0('Number of CPU nodes used: ', num_workers))
        writeLines(log_l, log)
        close(log)
    }

    # save the condition table used to drive
    write.csv(x = condition_table,
              file = file.path(path, 'ERSSA_ConditionTable.csv'))

    # create a list of ERSSA generated objects
    ERSSA.objects = list(count_table.filtered=count_table.filtered,
                         samp.name.comb=combinations,
                         list.of.DE.genes=deg,
                         gg.dotPlot.obj = gg_dot,
                         gg.marinPlot.obj=gg_margin,
                         gg.intersectPlot.obj = gg_intersect,
                         gg.TPR_FPRPlot.obj=gg_TPR_FPR)

    return(ERSSA.objects)
}