library(fbseq)
library(methods)

Introduction

The fbseq package part of a collection of packages for the fully Bayesian analysis of RNA-sequencing (RNA-seq) count data, where a hierarchical model is fit with Markov chain Monte Carlo (MCMC). fbseq is the user interface, and it contains top level functions for calling the MCMC and analyzing output. The other packages, fbseqOpenMP and fbseqCUDA , are backend packages that run the MCMC behind the scenes. Only one is required. fbseqOpenMP can run on most machines, but it is slow for large datasets. fbseqCUDA requires special hardware (i.e. a CUDA general-purpose graphics processing unit), but it's much faster due to parallel computing.

Read the model vignette first.

An understanding of the underlying hierarchical model is important for understanding how to use this package. For best viewing, download the html file to your desktop and then open it with a browser.

Check your system.

See the "Depends", "Imports", and "Suggests" fields of the "package's DESCRIPTION R version and R package requirements.

Install fbseq

Option 1: install a stable release (recommended).

For fbseq , you can navigate to a list of stable releases on the project's GitHub page. Download the desired tar.gz bundle, then install it either with install.packages(..., repos = NULL, type="source") from within R R CMD INSTALL from the Unix/Linux command line.

Option 2: use install_github to install the development version.

For this option, you need the devtools package, available from CRAN or GitHub. Open R and run

library(devtools)
install_github("wlandau/fbseq")

Option 3: build the development version from the source.

Open a command line program such as Terminal in Mac/Linux and enter the following commands.

git clone git@github.com:wlandau/fbseq.git
R CMD build fbseq
R CMD INSTALL ...

where ... is replaced by the name of the tarball produced by R CMD build.

Install an MCMC backend package

fbseqOpenMP and fbseqCUDA, are backend packages that run the MCMC behind the scenes. Only one is required. fbseqOpenMP can run on most machines, but it is slow for large datasets. fbseqCUDA requires special hardware (i.e. a CUDA general-purpose graphics processing unit), but it's much faster due to parallel computing. Installation is similar to that of fbseq and is detailed in their respective README.md files and package vignettes.

Quick start

After installing fbseq and fbseqOpenMP , the following should take a couple seconds to run. The example walks through an example data analysis and shows a few key features of the package.

library(fbseq)

back_end = "OpenMP" # change this to "CUDA" to use fbseqCUDA as the backend

# Example RNA-seq dataset wrapped in an S4 class.
data(paschold) 

# Use a small subset of the data (few genes).
paschold@counts = head(paschold@counts, 20) 

# Run the MCMC for only a few iterations.
configs = Configs(burnin = 10, iterations = 10, thin = 1) 

# Set up the MCMC.
chain = Chain(paschold, configs) 

# Run 4 dispersed independent Markov chains.
chain_list = fbseq(chain, backend = back_end)

# Get total runtime.
attr(chain_list, "runtime")

# Monitor convergence with in all parameters with 
# Gelman-Rubin potential scale reduction factors
head(psrf(chain_list)) 

# Means, standard deviations, and credible intervals of all parameters 
head(estimates(chain_list))

# Means, standard deviations, and credible intervals of 
# linear combinations of the "beta" parameters used to calculate
# gene-specific posterior probabilities.
head(contrast_estimates(chain_list))

# Gene-specific (row-specific) posterior probabilities
head(probs(chain_list))

# Monte Carlo samples on a small subset of parameters
m = mcmc_samples(chain_list) 
dim(m)
m[1:5, 1:5]

# Set max_iter=Inf below to automatically run until convergence.
max_iter = 3
iter = 1
chain@verbose = as.integer(0) # turn off console messages
chain_list = fbseq(chain, backend = back_end)
# saveRDS(chain_list, "chain_list_so_far.rds") # option to save progress
gelman = psrf(chain_list)
if(any(gelman >= 1.1)) while(iter < max_iter){
  chain_list = lapply(chain_list, fbseq, additional_chains = 0, backend = back_end)
  gelman = psrf(chain_list)
  # saveRDS(chain_list, "chain_list_so_far.rds") # option to save progress
  if(all(gelman < 1.1)) break
  iter = iter + 1
}

Activate parallel computing

There are multiple options for activating parallel computing.

Operational details

Here is the breakdown for finer control of the package.

Create a Scenario object.

Scenario is an S4 class for storing the count data, design matrix, and instructions for calculating posterior probabilities of quantities of interest.

str(new("Scenario"))

Type help("Scenario-class") for details about the slots. When you have these slots ready, use the Scenario function to create a Scenario object.

scenario = Scenario(counts = counts, design = design, 
  contrasts = contrasts, bounds = bounds, propositions = propositions, 
  supplement = supplement)

For this constructor, arguments should be explicitly named. For example,

scenario = Scenario(counts, design, contrasts, bounds, propositions, supplement)

would throw an error.

Example scenario from @paschold

The package comes with a Scenario with data taken from @paschold. The data is from an RNA-sequencing (RNA-seq) data experiment on maize.

data(paschold)
head(paschold@counts)

The dataset is large, so if you're just trying out the package on your home computer, it would be best to simply use a subset of the data

dim(paschold@counts)
paschold@counts = head(paschold@counts, 20)
dim(paschold@counts)

In the data, the genes are rows and the columns are RNA-seq replicates. Each count is the relative measure of the expression of a gene in a replicate. The observations are divided into 4 genetic varieties:

The goal of this scenario is to detect genes that display heterosis, or hybrid vigor, with respect to expression level (RNA-seq count). Heterosis is the phenomenon in which a hybrid surpasses both parents with respect to a given trait. The different types of heterosis are defined as follows in terms of the log RNA-seq counts.

|------------------------------------|----------|--------------------------------------------------| | high-parent heterosis of the hybrid mean | $\qquad$ | hybrid mean > max(B73 and Mo17) | | low-parent heterosis of the hybrid mean | | hybrid mean < min(B73 and Mo17) | | high-parent heterosis of B73xMo17 | | B73xMo17 > max(B73 and Mo17) | | low-parent heterosis of B73xMo17 | | B73xMo17 < min(B73 and Mo17) | | high-parent heterosis of Mo17xB73 | | Mo17xB73 > max(B73 and Mo17) | | low-parent heterosis of Mo17xB73 | | Mo17xB73 < min(B73 and Mo17) |

To define the parameterization of the model coefficients $\beta_{\ell, g}$, we use the design matrix slot in our Scenario object.

paschold@design

We can interpret each parameters as follows.

|------------------------------------|----------|--------------------------------------------------| | $\beta_{1, g}$ | $\qquad$ | mean of B73 and Mo17 | | $\beta_{2, g}$ | | half the difference between Mo17 and the hybrid mean | | $\beta_{3, g}$ | | half the difference between B73 and the hybrid mean | |$\beta_{4, g}$ | | difference between B73xMo17 and Mo17xB73 | |$\beta_{5, g}$ | | experimental block effect |

In terms of the $\beta_{\ell, g}$'s, the conditions for heterosis become the following.

|------------------------------------|----------|--------------------------------------------------| | high-parent heterosis of the hybrid mean | $\qquad$ | $\beta_{2, g} > 0$ and $\beta_{3, g} > 0$ | | low-parent heterosis of the hybrid mean | | $-\beta_{2, g} > 0$ and $-\beta_{3, g} > 0$ | | high-parent heterosis of B73xMo17 | | $2\beta_{2, g} + \beta_{4, g} > 0$ and $2\beta_{3, g} + \beta_{4, g} > 0$ | | low-parent heterosis of B73xMo17 | | $-2\beta_{2, g} - \beta_{4, g} > 0$ and $-2\beta_{3, g} - \beta_{4, g} > 0$ | | high-parent heterosis of Mo17xB73 | | $2\beta_{2, g} - \beta_{4, g} > 0$ and $2\beta_{3, g} - \beta_{4, g} > 0$ | | low-parent heterosis of Mo17xB73 | | $-2\beta_{2, g} + \beta_{4, g} > 0$ and $-2\beta_{3, g} + \beta_{4, g} > 0$ |

Gene-specific (i.e., row-specific) posterior probabilities are calculated using Monte Carlo samples $\beta_{\ell, g}^{(m)}$ of the $\beta_{\ell, g}$'s for MCMC iteration $m = 1, \ldots, M$. Below, $\text{I}(\cdot)$ is the indicator function (which returns 1 when its argument is true and 0 otherwise).

$$ \begin{align} P(\text{high-parent heterosis, mean of hybrids} \ | \ \text{data}) &\approx \frac{1}{M} \sum_{m = 1}^M \text{I}\left (\beta_{2, g}^{(m)} > 0 \text{ and } \beta_{3, g}^{(m)} > 0 \right ) \ P(\text{low-parent heterosis, mean of hybrids}) \ | \ \text{data}) &\approx \frac{1}{M} \sum_{m = 1}^M \text{I}\left ( -\beta_{2, g}^{(m)} > 0 \text{ and } -\beta_{3, g}^{(m)} > 0 \right ) \ P(\text{high-parent heterosis, B73xMo17}) \ | \ \text{data}) &\approx \frac{1}{M} \sum_{m = 1}^M \text{I}\left ( 2\beta_{2, g}^{(m)} + \beta_{4, g}^{(m)} > 0 \text{ and } 2\beta_{3, g}^{(m)} + \beta_{4, g}^{(m)} > 0 \right ) \ P(\text{low-parent heterosis, B73xMo17}) \ | \ \text{data}) &\approx \frac{1}{M} \sum_{m = 1}^M \text{I}\left ( -2\beta_{2, g}^{(m)} - \beta_{4, g}^{(m)} > 0 \text{ and } -2\beta_{3, g}^{(m)} - \beta_{4, g}^{(m)} > 0 \right ) \ P(\text{high-parent heterosis, Mo17xB73}) \ | \ \text{data}) &\approx \frac{1}{M} \sum_{m = 1}^M \text{I}\left ( 2\beta_{2, g}^{(m)} - \beta_{4, g}^{(m)} > 0 \text{ and } 2\beta_{3, g}^{(m)} - \beta_{4, g}^{(m)} > 0 \right ) \ P(\text{low-parent heterosis, Mo17xB73}) \ | \ \text{data}) &\approx \frac{1}{M} \sum_{m = 1}^M \text{I}\left ( -2\beta_{2, g}^{(m)} + \beta_{4, g}^{(m)} > 0 \text{ and }-2\beta_{3, g}^{(m)} + \beta_{4, g}^{(m)} > 0 \right ) \end{align} $$

To tell the software to calculate these posterior probabilities, use of the contrasts, bounds, and propositions slots of the Scenario object. In general, the package can calculate probabilities of that depend on linear combinations of the $\beta_{\ell, g}$'s, and the contrasts slot specifies these linear combinations.

paschold@contrasts

As in the heterosis example, the probabilities are calculated by comparing the linear combinations to lower bounds. In the heterosis example, linear combinations are being compared to zero, so all the slots in the bounds slot should be zero.

paschold@bounds

The conditions for heterosis are formulated as logical propositions, and these propositions are logical conjunctions that depend on the linear combinations and lower bounds from before. To write these heterosis propositions in terms of contrasts and bounds, use the propositions slot.

paschold@propositions

With the contrasts, bounds, and propositions defined, the-specific posterior probabilities are fully-specified.

Lastly, every Scenario object has an optional suppelement slot with additional information about the scenario. In this case, paschold@supplement is an empty list.

Example simulated scenario with scenario_heterosis_model

To simulate a heterosis scenario similar to that of @paschold, use the scenario_heterosis_model function. Count data are simulated from the model given fixed values for the hyperparameters.

s = scenario_heterosis_model(genes = 1000, libraries = 16)
head(s@counts)
s@design
s@contrasts
s@bounds
s@propositions

(Optional) create a Configs object.

The Configs S4 class contains the practical MCMC configuration parameters, such as the number of Monte Carlo iterations and the length of burnin, and important details about model specification. All of them have default settings, so just for the sake of trying out the package, you should be able to get away with calling the Configs function with no arguments. For a full list of slots and their explanations, type help("Configs-class").

The main slots to consider are the iterations, burnin, and thin slots, which control the duration of the MCMC. The total number of MCMC iterations is burnin + iterations * thin. burinin is the number of throwaway iterations at the beginning of the MCMC. Afterwards the next iterations * thin iterations are used to compute posterior probabilities and estimated means and mean squares for every parameter. Only iterations Monte Carlo parameter samples are returned to the user, and then, only for parameters specified in the parameter_sets_return slot. The parameter_sets_update slot specifies which parameters to update during the MCMC, giving the user the option to run empirical Bayes, for example. For gene-specific parameters in parameter_sets_update, not all parameters are returned for all the genes. (Downloading data from the GPU is expensive). Select the genes for parameter samples you want with the genes_return slot. For the $\epsilon_{n, g}$ parameters, use genes_return_epsilon to specify the values of $g$ and libraries_return_epsilon to specify the values of $n$.

(Optional) create a Starts object.

The Starts S4 class has the starting values of an MCMC chain. Usually, the user does not need to worry about it, as the unspecified starting values are automatically generated. However, the user may wish to specify starting values by setting the slots manually. For more information, type help("Starts-class") and be sure to read the vignette on the hierarchical model.

Create a Chain object.

The Chain class is the container for a whole Markov chain. It stores all the information in your Scenario, Configs, and Starts objects, in addition to Monte Carlo output and results. There are many, many S4 slots. For details on the slots, type help("Chain-class") and be sure to read the hierarchical model vignette. For help on how to create Chain objects, type help("Chain").

With only a Scenario object, you can create a Chain object with

chain = Chain(scenario)

If you have Configs and/or Starts objects, you can pass them in as well.

chain = Chain(scenario, configs, starts)

As an example, here is the Chain created from the paschold scenario visited earlier.

chain = Chain(paschold)
# Call str(chain) to inspect the slots.

You can recover the scenario, configuration, or starting values of a Chain object with the Scenario, Configs, and Starts objects. For example,

scenario = Scenario(chain)
configs = Configs(chain)
starts = Starts(chain)
str(configs)

Note that the unspecified starting values were automatically generated in the construction of the Chain object.

Call fbseq on the Chain object.

The fbseq function runs the MCMC algorithm to estimate parameters and calculate posterior probabilities. It takes a single Chain object as an argument, and outputs either another Chain object or a list of Chain objects, depending on the value of additional_chains. To run a single chain, use:

out = fbseq(chain, additional_chains = 0)

Here, out is another Chain object with the results. The starting values of out are the final parameter samples of the previous MCMC. That way, you can call fbseq on out and the MCMC will resume where it last left off.

continuation = fbseq(out, additional_chains = 0)

To monitor convergence, you can run multiple chains and use the psrf function to compute Gelman-Rubin potential scale reduction factors on a list of chains [@bda]. For example, to automatically run the MCMC until convergence, you could do the following.

chain_list = fbseq(chain)  # equivalent to fbseq(chain, additional_chains = 3)
saveRDS(chain, "chain_list_so_far.rds")
if(any(psrf(chain_list)) >= 1.1) repeat{
  chain_list = lapply(chain_list, fbseq)
  saveRDS(chain, "chain_list_so_far.rds")
  if(all(psrf(chain_list)) < 1.1) break
}

If the additional_chains argument is greater than 0, an additional additional_chains MCMC chains will be run after the first one, and the output object from fbseq will a list of the total 1 + additional_chains run. This is useful for computing Gelman-Rubin potential scale reduction factors with the psrf function to assess convergence.

out = fbseq(chain, additional_chains = 3)
psrf(out)

Choosing a backend: fbseqOpenMP or fbseqCUDA

fbseqOpenMP can run on most machines, but it is slow for datasets with lots of genes. Select the fbseqOpenMP backend by specifying backend = "OpenMP" in fbseq :

out = fbseq(chain, backend = "OpenMP")

For the OpenMP backend, you can distribute the MCMC chains after the first chain over multiple parallel processes.

out = fbseq(chain, backend = "OpenMP", processes = 3)

fbseqCUDA requires a CUDA-capable graphics processing unit (GPU), but it is much faster. Elapsed runtimes are in hours rather than days for typical-sized RNA-seq datasets. Select this backend with either

out = fbseq(chain, backend = "CUDA")

or simply

out = fbseq(chain)

Inspect the output.

The fbseq package has several functions for analyzing MCMC output. Below obj, can either be a Chain object or a list of Chain objects.

|------------------------------------|----------|--------------------------------------------------| | psrf(obj) | | Gelman-Rubin potential scale reduction factors on all parameters, used to monitor convergence [@bda]. Here, obj must be a list of at least 2 Chain objects. | | probs(obj) | | Estimated gene-specific posterior probabilities specified in Scenario | | effect_sizes(obj) | | Effect sizes that correspond to the posterior probabilities. For example, for "$P (2\beta_{2, g} - \beta_{4, g} > 1$ and $2\beta_{3, g} - \beta_{4, g} > 0.5 | \text{data})$", the effect size is the positive part of $\min \left (2\beta_{2, g} - \beta_{4, g} - 1, \ 2\beta_{3, g} - \beta_{4, g} - 0.5 \right )$. | | estimates(obj, level) | | Posterior means, standard deviations, and credible intervals (of level level) for all parameters. The credible intervals are computed using the normal distributions given by the posterior means and standard deviations. | |mcmc_samples(obj) | $\qquad$ | Monte Carlo samples of returned parameters selected in Configs |

References



wlandau/fbseq documentation built on May 4, 2019, 8:43 a.m.