knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
\newpage
This vignette demonstrates R package PowerExplorer
as a power and sample size
estimation tool for RNA-Seq and quantitative proteomics data.
PowerExplorer
contains the following main features:
Power and sample size estimation is one of the important principles in designing
next-generation sequencing experiments to discover differential expressions.
PowerExplorer
is a power estimation and prediction tool currently applicable
to RNA-Seq and quantitative proteomics experiments.
The calculation procedure starts with estimating the distribution parameters of each gene or protein. With the obtained prior distribution of each feature, a specified amount of simulations are executed to generate data (read counts for RNA-Seq and protein abundance for proteomics) repetitively for each entry based on null and alternative hypotheses. Furthermore, the corresponding statistical tests (t-test or Wald-test) are performed and the test statistics are collected. Eventually the statistics will be summarized to calculate the statistical power.
\newpage
For both RNA-Seq (gene expression levels) and quantitative proteomics (protein abundance levels) datasets, the data matrix should be arranged as genes/proteins in rows and samples in columns. Here we show a RNA dataset as an example:
library(PowerExplorer) data("exampleProteomicsData") head(exampleProteomicsData$dataMatrix)
A grouping vector indicating the sample groups to which all the samples belong should also be created, for example:
show(exampleProteomicsData$groupVec)
The sample groups corresponding to the data:
colnames(exampleProteomicsData$dataMatrix)
Note that the grouping vector length should be equal to the column number of the data matrix.
\newpage
Here we use a randomly generated Proteomics dataset exampleProteomicsData
as an example to estimate the current power of the dataset.
The input dataset is named as dataMatrix
and the grouping vector
as groupVec
.
To run the estimation, apart from the input, we still need to specify the following parameters:
isLogTransformed
: FALSE; the input data is not log-transformed.dataType
: "Proteomics"; the datatype can be declared as "Proteomics" or
"RNA-Seq".minLFC
: 0.5; the threshold of Log2 Fold Change, proteins with lower LFC
will be discarded.enableROTS
: TRUE; Using Reproducibility-Optimized Test Statistic
(ROTS) as the statistical model.paraROTS
: the parameters to be passed to ROTS (if enabled).
Check ROTS documentation for further details on the parameters.alpha
: 0.05; the controlled false positive (Type I Error) rate.ST
: 50; the simulation of each gene will be run 50 times (ST>50 is
recommended).seed
: 345; optional, a seed value for the random number generator to
maintain the reproducibility.showProcess
: FALSE; no detailed processes will be shown, set to TRUE if
debug is needed.saveSimulatedData
: FALSE; if TRUE, save the simulated data in ./savedData
directory.The results will be summaried in barplot, boxplot and summary table.
library(PowerExplorer) data("exampleProteomicsData") res <- estimatePower(inputObject = exampleProteomicsData$dataMatrix, groupVec = exampleProteomicsData$groupVec, isLogTransformed = FALSE, dataType = "Proteomics", minLFC = 0.5, enableROTS = TRUE, paraROTS = list(B = 1000, K = NULL), alpha = 0.05, ST = 50, seed = 345, showProcess = FALSE, saveResultData = FALSE )
\newpage
The estimated results can be summarized using plotEstPwr
, the only input
needed is the estimatedPower
, which should be the estimated power object
returned from estimatePower
.
plotEstPwr(res)
The graph contains 3 plots, the barplot
vertically shows the number of
genes/proteins above the minLFC threshold, columns indicates the comparison
pairs, each column presents the proportions of three power levels in three
colours as indicated in the legend power.level
; The boxplot shows the overall
power distribution of each comparsion; And the summary table summarize the power
in a numerical way with the same information shown in the previous two plots.
\newpage
With the result PowerExplorerStorage
object, summarized information can be shown by
show
method.
res
If interested in specific genes/proteins or a ranking list, one can use
listEstPwr
with the following parameters:
inputObject
: A PowerExplorerStorage returned from PowerExplorer as inputdecreasing
: logical; TRUE, decreasing order; FALSE, increasing order.top
: numeric; the number of genes/proteins in the top listselected
: default as NA; specify as a list of geneID or protein ID
to show power of a list of interested genes/proteins.To show the top 10 genes in an example result object exampleObject
in decreasing order:
data(exampleObject) listEstPwr(exampleObject, decreasing = TRUE, top = 10)
To show the results of specific genes:
listEstPwr(exampleObject, selected = c("ENSMUSG00000000303", "ENSMUSG00000087272", "ENSMUSG00000089921"))
\newpage
With the same dataset, to run a prediction, a different parameter is needed:
rangeSimNumRep
: the power of replicate number 5 to 20 will be predicted.Similar to the estimation process, however, the simulations will be excuted with
each sample size specified in rangeSimNumRep
. (Note: the term sample size in
this vignette refers to the replicate number of each group/case)
It is possible to append the prediction results within the same object by using the same result object as an input.
data("exampleProteomicsData") res <- predictPower(inputObject = res, groupVec = exampleProteomicsData$groupVec, isLogTransformed = FALSE, dataType = "Proteomics", minLFC = 0.5, rangeSimNumRep = c(5, 10, 15, 20), enableROTS = TRUE, paraROTS = list(B = 1000, K = NULL), alpha = 0.05, ST = 50, seed = 345)
\newpage
The predicted results can be summaried using plotPredPwr
. The input should be
the predicted power object returned from predictPower
, the summary can be
optionally visualized by setting the following parameters:
inputObject
: A PowerExplorerStorage returned from PowerExplorer as inputminLFC
and maxLFC
: to observe power in a specific range of LFCLFCscale
: to determine the LFC scale of the observationLineplot (LFCscale = 0.5):
plotPredPwr(res, LFCscale = 0.5)
The output figure contains a lineplot and a summary table.
For each comparison, the lineplot shows the power tendency across
every Log2 Fold Change segment resulted from a complete LFC list divided
by a specified LFCscale
. Each dot on the lines represents the average
power (y-axis) of the genes/proteins at a certain sample size (x-axis)
within different LFC ranges. In addition, a summary table below displays
the average power of each comparison across the sample sizes.
For instance, the line plot here shows the average power at four different sample sizes (5 to 30, with increment of 5) in LFCscale of 0.5. The LFC ranges from 0 to 5, and within each LFC segment, the graph shows the average power of the genes/proteins. Here, the higher LFC shows higher power, the average power of each LFC range increases with the larger sample sizes, as expected.
\newpage
With the result PowerExplorerStorage
object, summarized information can be shown by
show
method. Both estimated and predicted results can be summaried.
res
If interested in specific genes/proteins or a ranking list of predicted
powerat each sample size, one can use listPrePwr
with the following
parameters:
inputObject
: A PowerExplorerStorage returned from PowerExplorer as inputdecreasing
: logical; TRUE, decreasing order; FALSE, increasing order.top
: numeric; the number of genes/proteins in the top listselected
: default as NA; specify as a list of geneID or protein ID
to show power of a list of interested genes/proteins.To show the top 10 genes in an example result object exampleObject
in decreasing order at each sample size:
listPredPwr(exampleObject, decreasing = TRUE, top = 10)
To show the results of specific genes at each sample size:
listPredPwr(exampleObject, selected = c("ENSMUSG00000000303", "ENSMUSG00000087272", "ENSMUSG00000089921"))
The calculation may take much longer time when an input dataset contains more than
thousands of features, especially for the power prediction process.
The computational time can be significantly shortened by using parallelised
computation, and the simulations will be distributed to multiple cores.
This can be done by loading Bioconductor pacakge BiocParallel and then set the
following arguments of estimatePower
and predictPower
:
parallel=TRUE
and BPPARAM=bpparam()
.
This will distribute the jobs to all the available cores.
One can register the number of cores to be used by
setting BPPARAM=MulticoreParam(4)
,
for instance, distributing simulations (jobs) to 4 cores.
However, MulticoreParam()
only supports non-Windows platforms.
For Windows platforms, one can use SnowParam()
instead. For further details,
please check the BiocParallel documentation.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.