library(fbseq) library(methods)
The fbseq
package part of a collection of packages for the fully Bayesian analysis of RNA-sequencing (RNA-seq) count data, where a hierarchical model is fit with Markov chain Monte Carlo (MCMC). fbseq
is the user interface, and it contains top level functions for calling the MCMC and analyzing output. The other packages, fbseqOpenMP
and fbseqCUDA
, are backend packages that run the MCMC behind the scenes. Only one is required. fbseqOpenMP
can run on most machines, but it is slow for large datasets. fbseqCUDA
requires special hardware (i.e. a CUDA general-purpose graphics processing unit), but it's much faster due to parallel computing.
An understanding of the underlying hierarchical model is important for understanding how to use this package. For best viewing, download the html file to your desktop and then open it with a browser.
See the "Depends", "Imports", and "Suggests" fields of the "package's DESCRIPTION R version and R package requirements.
fbseq
For fbseq
, you can navigate to a list of stable releases on the project's GitHub page. Download the desired tar.gz
bundle, then install it either with install.packages(..., repos = NULL, type="source")
from within R R CMD INSTALL
from the Unix/Linux command line.
install_github
to install the development version.For this option, you need the devtools
package, available from CRAN or GitHub. Open R and run
library(devtools) install_github("wlandau/fbseq")
Open a command line program such as Terminal in Mac/Linux and enter the following commands.
git clone git@github.com:wlandau/fbseq.git R CMD build fbseq R CMD INSTALL ...
where ...
is replaced by the name of the tarball produced by R CMD build
.
fbseqOpenMP
and fbseqCUDA
, are backend packages that run the MCMC behind the scenes. Only one is required. fbseqOpenMP
can run on most machines, but it is slow for large datasets. fbseqCUDA
requires special hardware (i.e. a CUDA general-purpose graphics processing unit), but it's much faster due to parallel computing. Installation is similar to that of fbseq
and is detailed in their respective README.md
files and package vignettes.
After installing fbseq
and fbseqOpenMP
, the following should take a couple seconds to run. The example walks through an example data analysis and shows a few key features of the package.
library(fbseq) back_end = "OpenMP" # change this to "CUDA" to use fbseqCUDA as the backend # Example RNA-seq dataset wrapped in an S4 class. data(paschold) # Use a small subset of the data (few genes). paschold@counts = head(paschold@counts, 20) # Run the MCMC for only a few iterations. configs = Configs(burnin = 10, iterations = 10, thin = 1) # Set up the MCMC. chain = Chain(paschold, configs) # Run 4 dispersed independent Markov chains. chain_list = fbseq(chain, backend = back_end) # Get total runtime. attr(chain_list, "runtime") # Monitor convergence with in all parameters with # Gelman-Rubin potential scale reduction factors head(psrf(chain_list)) # Means, standard deviations, and credible intervals of all parameters head(estimates(chain_list)) # Means, standard deviations, and credible intervals of # linear combinations of the "beta" parameters used to calculate # gene-specific posterior probabilities. head(contrast_estimates(chain_list)) # Gene-specific (row-specific) posterior probabilities head(probs(chain_list)) # Monte Carlo samples on a small subset of parameters m = mcmc_samples(chain_list) dim(m) m[1:5, 1:5] # Set max_iter=Inf below to automatically run until convergence. max_iter = 3 iter = 1 chain@verbose = as.integer(0) # turn off console messages chain_list = fbseq(chain, backend = back_end) # saveRDS(chain_list, "chain_list_so_far.rds") # option to save progress gelman = psrf(chain_list) if(any(gelman >= 1.1)) while(iter < max_iter){ chain_list = lapply(chain_list, fbseq, additional_chains = 0, backend = back_end) gelman = psrf(chain_list) # saveRDS(chain_list, "chain_list_so_far.rds") # option to save progress if(all(gelman < 1.1)) break iter = iter + 1 }
There are multiple options for activating parallel computing.
backend = "CUDA"
rather than backend = "OpenMP
in the fbseq
function. This option uses CUDA
to massively parallelize each individual MCMC chain if CUDA
is installed on your machine.backend = "OpenMP"
and threads = n
in the fbseq
function, where n
is greater than 1. This option uses OpenMP
to
parallelize each individual MCMC chain if OpenMP
is available on your machine.backend = "OpenMP"
and processes = n
in the fbseq
function, where n
is greater than 1. Whether OpenMP
is installed
or not, this option will distribute some of the MCMC chains over n
parallel processes. Cannot combine with backend = "CUDA"
or threads = n
(n > 1
).Here is the breakdown for finer control of the package.
Scenario
object.Scenario
is an S4 class for storing the count data, design matrix, and instructions for calculating posterior probabilities of quantities of interest.
str(new("Scenario"))
Type help("Scenario-class")
for details about the slots. When you have these slots ready, use the Scenario
function to create a Scenario
object.
scenario = Scenario(counts = counts, design = design, contrasts = contrasts, bounds = bounds, propositions = propositions, supplement = supplement)
For this constructor, arguments should be explicitly named. For example,
scenario = Scenario(counts, design, contrasts, bounds, propositions, supplement)
would throw an error.
The package comes with a Scenario
with data taken from @paschold. The data is from an RNA-sequencing (RNA-seq) data experiment on maize.
data(paschold) head(paschold@counts)
The dataset is large, so if you're just trying out the package on your home computer, it would be best to simply use a subset of the data
dim(paschold@counts) paschold@counts = head(paschold@counts, 20) dim(paschold@counts)
In the data, the genes are rows and the columns are RNA-seq replicates. Each count is the relative measure of the expression of a gene in a replicate. The observations are divided into 4 genetic varieties:
B73
: an inbred Iowa maize varietyMo17
: an inbred Missouri maize varietyB73xMo17
: a first-generation (F1) hybrid produced from pollenating B73
with Mo17
Mo17xB73
: an F1 hybrid from pollenating Mo17
with B73
The goal of this scenario is to detect genes that display heterosis, or hybrid vigor, with respect to expression level (RNA-seq count). Heterosis is the phenomenon in which a hybrid surpasses both parents with respect to a given trait. The different types of heterosis are defined as follows in terms of the log RNA-seq counts.
|------------------------------------|----------|--------------------------------------------------|
| high-parent heterosis of the hybrid mean | $\qquad$ | hybrid mean > max(B73
and Mo17
) |
| low-parent heterosis of the hybrid mean | | hybrid mean < min(B73
and Mo17
) |
| high-parent heterosis of B73xMo17
| | B73xMo17
> max(B73
and Mo17
) |
| low-parent heterosis of B73xMo17
| | B73xMo17
< min(B73
and Mo17
) |
| high-parent heterosis of Mo17xB73
| | Mo17xB73
> max(B73
and Mo17
) |
| low-parent heterosis of Mo17xB73
| | Mo17xB73
< min(B73
and Mo17
) |
To define the parameterization of the model coefficients $\beta_{\ell, g}$, we use the design matrix slot in our Scenario
object.
paschold@design
We can interpret each parameters as follows.
|------------------------------------|----------|--------------------------------------------------|
| $\beta_{1, g}$ | $\qquad$ | mean of B73
and Mo17
|
| $\beta_{2, g}$ | | half the difference between Mo17
and the hybrid mean |
| $\beta_{3, g}$ | | half the difference between B73
and the hybrid mean |
|$\beta_{4, g}$ | | difference between B73xMo17
and Mo17xB73
|
|$\beta_{5, g}$ | | experimental block effect |
In terms of the $\beta_{\ell, g}$'s, the conditions for heterosis become the following.
|------------------------------------|----------|--------------------------------------------------|
| high-parent heterosis of the hybrid mean | $\qquad$ | $\beta_{2, g} > 0$ and $\beta_{3, g} > 0$ |
| low-parent heterosis of the hybrid mean | | $-\beta_{2, g} > 0$ and $-\beta_{3, g} > 0$ |
| high-parent heterosis of B73xMo17
| | $2\beta_{2, g} + \beta_{4, g} > 0$ and $2\beta_{3, g} + \beta_{4, g} > 0$ |
| low-parent heterosis of B73xMo17
| | $-2\beta_{2, g} - \beta_{4, g} > 0$ and $-2\beta_{3, g} - \beta_{4, g} > 0$ |
| high-parent heterosis of Mo17xB73
| | $2\beta_{2, g} - \beta_{4, g} > 0$ and $2\beta_{3, g} - \beta_{4, g} > 0$ |
| low-parent heterosis of Mo17xB73
| | $-2\beta_{2, g} + \beta_{4, g} > 0$ and $-2\beta_{3, g} + \beta_{4, g} > 0$ |
Gene-specific (i.e., row-specific) posterior probabilities are calculated using Monte Carlo samples $\beta_{\ell, g}^{(m)}$ of the $\beta_{\ell, g}$'s for MCMC iteration $m = 1, \ldots, M$. Below, $\text{I}(\cdot)$ is the indicator function (which returns 1 when its argument is true and 0 otherwise).
$$ \begin{align} P(\text{high-parent heterosis, mean of hybrids} \ | \ \text{data}) &\approx \frac{1}{M} \sum_{m = 1}^M \text{I}\left (\beta_{2, g}^{(m)} > 0 \text{ and } \beta_{3, g}^{(m)} > 0 \right ) \ P(\text{low-parent heterosis, mean of hybrids}) \ | \ \text{data}) &\approx \frac{1}{M} \sum_{m = 1}^M \text{I}\left ( -\beta_{2, g}^{(m)} > 0 \text{ and } -\beta_{3, g}^{(m)} > 0 \right ) \ P(\text{high-parent heterosis, B73xMo17}) \ | \ \text{data}) &\approx \frac{1}{M} \sum_{m = 1}^M \text{I}\left ( 2\beta_{2, g}^{(m)} + \beta_{4, g}^{(m)} > 0 \text{ and } 2\beta_{3, g}^{(m)} + \beta_{4, g}^{(m)} > 0 \right ) \ P(\text{low-parent heterosis, B73xMo17}) \ | \ \text{data}) &\approx \frac{1}{M} \sum_{m = 1}^M \text{I}\left ( -2\beta_{2, g}^{(m)} - \beta_{4, g}^{(m)} > 0 \text{ and } -2\beta_{3, g}^{(m)} - \beta_{4, g}^{(m)} > 0 \right ) \ P(\text{high-parent heterosis, Mo17xB73}) \ | \ \text{data}) &\approx \frac{1}{M} \sum_{m = 1}^M \text{I}\left ( 2\beta_{2, g}^{(m)} - \beta_{4, g}^{(m)} > 0 \text{ and } 2\beta_{3, g}^{(m)} - \beta_{4, g}^{(m)} > 0 \right ) \ P(\text{low-parent heterosis, Mo17xB73}) \ | \ \text{data}) &\approx \frac{1}{M} \sum_{m = 1}^M \text{I}\left ( -2\beta_{2, g}^{(m)} + \beta_{4, g}^{(m)} > 0 \text{ and }-2\beta_{3, g}^{(m)} + \beta_{4, g}^{(m)} > 0 \right ) \end{align} $$
To tell the software to calculate these posterior probabilities, use of the contrasts
, bounds
, and propositions
slots of the Scenario
object. In general, the package can calculate probabilities of that depend on linear combinations of the $\beta_{\ell, g}$'s, and the contrasts
slot specifies these linear combinations.
paschold@contrasts
As in the heterosis example, the probabilities are calculated by comparing the linear combinations to lower bounds. In the heterosis example, linear combinations are being compared to zero, so all the slots in the bounds
slot should be zero.
paschold@bounds
The conditions for heterosis are formulated as logical propositions, and these propositions are logical conjunctions that depend on the linear combinations and lower bounds from before. To write these heterosis propositions in terms of contrasts
and bounds
, use the propositions
slot.
paschold@propositions
With the contrasts
, bounds
, and propositions
defined, the-specific posterior probabilities are fully-specified.
Lastly, every Scenario
object has an optional suppelement
slot with additional information about the scenario. In this case, paschold@supplement
is an empty list.
scenario_heterosis_model
To simulate a heterosis scenario similar to that of @paschold, use the scenario_heterosis_model
function. Count data are simulated from the model given fixed values for the hyperparameters.
s = scenario_heterosis_model(genes = 1000, libraries = 16) head(s@counts) s@design s@contrasts s@bounds s@propositions
Configs
object.The Configs
S4 class contains the practical MCMC configuration parameters, such as the number of Monte Carlo iterations and the length of burnin, and important details about model specification. All of them have default settings, so just for the sake of trying out the package, you should be able to get away with calling the Configs
function with no arguments. For a full list of slots and their explanations, type help("Configs-class")
.
The main slots to consider are the iterations
, burnin
, and thin
slots, which control the duration of the MCMC. The total number of MCMC iterations is burnin + iterations * thin
. burinin
is the number of throwaway iterations at the beginning of the MCMC. Afterwards the next iterations * thin
iterations are used to compute posterior probabilities and estimated means and mean squares for every parameter. Only iterations
Monte Carlo parameter samples are returned to the user, and then, only for parameters specified in the parameter_sets_return
slot. The parameter_sets_update
slot specifies which parameters to update during the MCMC, giving the user the option to run empirical Bayes, for example. For gene-specific parameters in parameter_sets_update
, not all parameters are returned for all the genes. (Downloading data from the GPU is expensive). Select the genes for parameter samples you want with the genes_return
slot. For the $\epsilon_{n, g}$ parameters, use genes_return_epsilon
to specify the values of $g$ and libraries_return_epsilon
to specify the values of $n$.
Starts
object.The Starts
S4 class has the starting values of an MCMC chain. Usually, the user does not need to worry about it, as the unspecified starting values are automatically generated. However, the user may wish to specify starting values by setting the slots manually. For more information, type help("Starts-class")
and be sure to read the vignette on the hierarchical model.
Chain
object.The Chain
class is the container for a whole Markov chain. It stores all the information in your Scenario
, Configs
, and Starts
objects, in addition to Monte Carlo output and results. There are many, many S4 slots. For details on the slots, type help("Chain-class")
and be sure to read the hierarchical model vignette. For help on how to create Chain
objects, type help("Chain")
.
With only a Scenario
object, you can create a Chain
object with
chain = Chain(scenario)
If you have Configs
and/or Starts
objects, you can pass them in as well.
chain = Chain(scenario, configs, starts)
As an example, here is the Chain
created from the paschold
scenario visited earlier.
chain = Chain(paschold) # Call str(chain) to inspect the slots.
You can recover the scenario, configuration, or starting values of a Chain
object with the Scenario
, Configs
, and Starts
objects. For example,
scenario = Scenario(chain) configs = Configs(chain) starts = Starts(chain) str(configs)
Note that the unspecified starting values were automatically generated in the construction of the Chain
object.
fbseq
on the Chain
object.The fbseq
function runs the MCMC algorithm to estimate parameters and calculate posterior probabilities. It takes a single Chain
object as an argument, and outputs either another Chain
object or a list of Chain
objects, depending on the value of additional_chains
. To run a single chain, use:
out = fbseq(chain, additional_chains = 0)
Here, out
is another Chain
object with the results. The starting values of out
are the final parameter samples of the previous MCMC. That way, you can call fbseq
on out
and the MCMC will resume where it last left off.
continuation = fbseq(out, additional_chains = 0)
To monitor convergence, you can run multiple chains and use the psrf
function to compute Gelman-Rubin potential scale reduction factors on a list of chains [@bda]. For example, to automatically run the MCMC until convergence, you could do the following.
chain_list = fbseq(chain) # equivalent to fbseq(chain, additional_chains = 3) saveRDS(chain, "chain_list_so_far.rds") if(any(psrf(chain_list)) >= 1.1) repeat{ chain_list = lapply(chain_list, fbseq) saveRDS(chain, "chain_list_so_far.rds") if(all(psrf(chain_list)) < 1.1) break }
If the additional_chains
argument is greater than 0, an additional additional_chains
MCMC chains will be run after the first one, and the output object from fbseq
will a list of the total 1 + additional_chains
run. This is useful for computing Gelman-Rubin potential scale reduction factors with the psrf
function to assess convergence.
out = fbseq(chain, additional_chains = 3) psrf(out)
fbseqOpenMP
or fbseqCUDA
fbseqOpenMP
can run on most machines, but it is slow for datasets with lots of genes. Select the fbseqOpenMP
backend by specifying backend = "OpenMP"
in fbseq
:
out = fbseq(chain, backend = "OpenMP")
For the OpenMP backend, you can distribute the MCMC chains after the first chain over multiple parallel processes.
out = fbseq(chain, backend = "OpenMP", processes = 3)
fbseqCUDA
requires a CUDA-capable graphics processing unit (GPU), but it is much faster. Elapsed runtimes are in hours rather than days for typical-sized RNA-seq datasets. Select this backend with either
out = fbseq(chain, backend = "CUDA")
or simply
out = fbseq(chain)
The fbseq
package has several functions for analyzing MCMC output. Below obj
, can either be a Chain
object or a list of Chain
objects.
|------------------------------------|----------|--------------------------------------------------|
| psrf(obj)
| | Gelman-Rubin potential scale reduction factors on all parameters, used to monitor convergence [@bda]. Here, obj
must be a list of at least 2 Chain
objects. |
| probs(obj)
| | Estimated gene-specific posterior probabilities specified in Scenario
|
| effect_sizes(obj)
| | Effect sizes that correspond to the posterior probabilities. For example, for "$P (2\beta_{2, g} - \beta_{4, g} > 1$ and $2\beta_{3, g} - \beta_{4, g} > 0.5 | \text{data})$", the effect size is the positive part of $\min \left (2\beta_{2, g} - \beta_{4, g} - 1, \ 2\beta_{3, g} - \beta_{4, g} - 0.5 \right )$. |
| estimates(obj, level)
| | Posterior means, standard deviations, and credible intervals (of level level
) for all parameters. The credible intervals are computed using the normal distributions given by the posterior means and standard deviations. |
|mcmc_samples(obj)
| $\qquad$ | Monte Carlo samples of returned parameters selected in Configs
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.