In xuqiao93/PowerExplorer: Power Estimation Tool for RNA-Seq and proteomics data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

\newpage

Abstract

This vignette demonstrates R package PowerExplorer as a power and sample size estimation tool for RNA-Seq and quantitative proteomics data.

PowerExplorer contains the following main features:

Estimation of power based on the current data
Prediction of power corresponding to the increased sample sizes
Result visualizations

Introduction

Power and sample size estimation is one of the important principles in designing next-generation sequencing experiments to discover differential expressions. PowerExplorer is a power estimation and prediction tool currently applicable to RNA-Seq and quantitative proteomics experiments.

The calculation procedure starts with estimating the distribution parameters of each gene or protein. With the obtained prior distribution of each feature, a specified amount of simulations are executed to generate data (read counts for RNA-Seq and protein abundance for proteomics) repetitively for each entry based on null and alternative hypotheses. Furthermore, the corresponding statistical tests (t-test or Wald-test) are performed and the test statistics are collected. Eventually the statistics will be summarized to calculate the statistical power.

\newpage

Input Data Preparation

For both RNA-Seq (gene expression levels) and quantitative proteomics (protein abundance levels) datasets, the data matrix should be arranged as genes/proteins in rows and samples in columns. Here we show a RNA dataset as an example:

library(PowerExplorer)
data("exampleProteomicsData")
head(exampleProteomicsData$dataMatrix)

A grouping vector indicating the sample groups to which all the samples belong should also be created, for example:

show(exampleProteomicsData$groupVec)

The sample groups corresponding to the data:

colnames(exampleProteomicsData$dataMatrix)

Note that the grouping vector length should be equal to the column number of the data matrix.

\newpage

Power Estimation

Here we use a randomly generated Proteomics dataset exampleProteomicsData as an example to estimate the current power of the dataset. The input dataset is named as dataMatrix and the grouping vector as groupVec.

To run the estimation, apart from the input, we still need to specify the following parameters:

isLogTransformed: FALSE; the input data is not log-transformed.
dataType: "Proteomics"; the datatype can be declared as "Proteomics" or "RNA-Seq".
minLFC: 0.5; the threshold of Log2 Fold Change, proteins with lower LFC will be discarded.
enableROTS: TRUE; Using Reproducibility-Optimized Test Statistic (ROTS) as the statistical model.
paraROTS: the parameters to be passed to ROTS (if enabled). Check ROTS documentation for further details on the parameters.
alpha: 0.05; the controlled false positive (Type I Error) rate.
ST: 50; the simulation of each gene will be run 50 times (ST>50 is recommended).
seed: 345; optional, a seed value for the random number generator to maintain the reproducibility.
showProcess: FALSE; no detailed processes will be shown, set to TRUE if debug is needed.
saveSimulatedData: FALSE; if TRUE, save the simulated data in ./savedData directory.

The results will be summaried in barplot, boxplot and summary table.

library(PowerExplorer)
data("exampleProteomicsData")
res <- estimatePower(inputObject = exampleProteomicsData$dataMatrix,
                     groupVec = exampleProteomicsData$groupVec,
                     isLogTransformed = FALSE,
                     dataType = "Proteomics",
                     minLFC = 0.5,
                     enableROTS = TRUE,
                     paraROTS = list(B = 1000, K = NULL),
                     alpha = 0.05,
                     ST = 50,
                     seed = 345, 
                     showProcess = FALSE, 
                     saveResultData = FALSE
                     )

\newpage

Visualization

The estimated results can be summarized using plotEstPwr, the only input needed is the estimatedPower, which should be the estimated power object returned from estimatePower.

plotEstPwr(res)

The graph contains 3 plots, the barplot vertically shows the number of genes/proteins above the minLFC threshold, columns indicates the comparison pairs, each column presents the proportions of three power levels in three colours as indicated in the legend power.level; The boxplot shows the overall power distribution of each comparsion; And the summary table summarize the power in a numerical way with the same information shown in the previous two plots.

\newpage

Result Summary

With the result PowerExplorerStorage object, summarized information can be shown by show method.

res

If interested in specific genes/proteins or a ranking list, one can use listEstPwr with the following parameters:

inputObject: A PowerExplorerStorage returned from PowerExplorer as input
decreasing: logical; TRUE, decreasing order; FALSE, increasing order.
top: numeric; the number of genes/proteins in the top list
selected: default as NA; specify as a list of geneID or protein ID to show power of a list of interested genes/proteins.

To show the top 10 genes in an example result object exampleObject in decreasing order:

data(exampleObject)
listEstPwr(exampleObject, decreasing = TRUE, top = 10)

To show the results of specific genes:

listEstPwr(exampleObject, 
           selected = c("ENSMUSG00000000303", 
                         "ENSMUSG00000087272", 
                         "ENSMUSG00000089921"))

\newpage

Power Predictions

With the same dataset, to run a prediction, a different parameter is needed:

rangeSimNumRep: the power of replicate number 5 to 20 will be predicted.

Similar to the estimation process, however, the simulations will be excuted with each sample size specified in rangeSimNumRep. (Note: the term sample size in this vignette refers to the replicate number of each group/case)

It is possible to append the prediction results within the same object by using the same result object as an input.

data("exampleProteomicsData")
res <- predictPower(inputObject = res,
                    groupVec = exampleProteomicsData$groupVec,
                    isLogTransformed = FALSE,
                    dataType = "Proteomics",
                    minLFC = 0.5,
                    rangeSimNumRep = c(5, 10, 15, 20),
                    enableROTS = TRUE,
                    paraROTS = list(B = 1000, K = NULL),
                    alpha = 0.05,
                    ST = 50,
                    seed = 345)

\newpage

Visualization

The predicted results can be summaried using plotPredPwr. The input should be the predicted power object returned from predictPower, the summary can be optionally visualized by setting the following parameters:

inputObject: A PowerExplorerStorage returned from PowerExplorer as input
minLFC and maxLFC: to observe power in a specific range of LFC
LFCscale: to determine the LFC scale of the observation

Lineplot (LFCscale = 0.5):

plotPredPwr(res, LFCscale = 0.5)

The output figure contains a lineplot and a summary table. For each comparison, the lineplot shows the power tendency across every Log2 Fold Change segment resulted from a complete LFC list divided by a specified LFCscale. Each dot on the lines represents the average power (y-axis) of the genes/proteins at a certain sample size (x-axis) within different LFC ranges. In addition, a summary table below displays the average power of each comparison across the sample sizes.

For instance, the line plot here shows the average power at four different sample sizes (5 to 30, with increment of 5) in LFCscale of 0.5. The LFC ranges from 0 to 5, and within each LFC segment, the graph shows the average power of the genes/proteins. Here, the higher LFC shows higher power, the average power of each LFC range increases with the larger sample sizes, as expected.

\newpage

Result Summary

With the result PowerExplorerStorage object, summarized information can be shown by show method. Both estimated and predicted results can be summaried.

res

If interested in specific genes/proteins or a ranking list of predicted powerat each sample size, one can use listPrePwr with the following parameters:

inputObject: A PowerExplorerStorage returned from PowerExplorer as input
decreasing: logical; TRUE, decreasing order; FALSE, increasing order.
top: numeric; the number of genes/proteins in the top list
selected: default as NA; specify as a list of geneID or protein ID to show power of a list of interested genes/proteins.

To show the top 10 genes in an example result object exampleObject in decreasing order at each sample size:

listPredPwr(exampleObject, decreasing = TRUE, top = 10)

To show the results of specific genes at each sample size:

listPredPwr(exampleObject, 
            selected = c("ENSMUSG00000000303", 
                         "ENSMUSG00000087272", 
                         "ENSMUSG00000089921"))

Parallel computation

The calculation may take much longer time when an input dataset contains more than thousands of features, especially for the power prediction process. The computational time can be significantly shortened by using parallelised computation, and the simulations will be distributed to multiple cores. This can be done by loading Bioconductor pacakge BiocParallel and then set the following arguments of estimatePower and predictPower: parallel=TRUE and BPPARAM=bpparam(). This will distribute the jobs to all the available cores. One can register the number of cores to be used by setting BPPARAM=MulticoreParam(4), for instance, distributing simulations (jobs) to 4 cores. However, MulticoreParam() only supports non-Windows platforms. For Windows platforms, one can use SnowParam() instead. For further details, please check the BiocParallel documentation.

xuqiao93/PowerExplorer documentation built on May 16, 2019, 9:13 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com