In mnsekula/hbfm: Generating gene co-expression matrices from scRNA-seq data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This vignette provides an introduction to using the hbfm package for constructing a gene co-expression matrix from discrete single-cell RNA-sequencing (scRNA-seq) data. The hbfm package contains functions to implement the stochastic EM algorithm and MCMC sampler defined in the "A sparse Bayesian factor model for the construction of gene co-expression networks from single-cell RNA sequencing count data" manuscript.

To begin, load the hbfm package.

# Load the hbfm package
library(hbfm)

# Load example dataset from hbfm package
data("gene.mat")
dim(gene.mat)

Functions in the hbfm package require discrete scRNA-seq gene expression data, so either a matrix or data frame of read counts is acceptable. The rows of the data correspond to genes and the columns correspond to cells. Here, the gene.mat dataset is a matrix of the expression counts of G = 10 genes (rows) and N = 100 cells (columns).

Stochastic EM

The first step in constructing a gene co-expression matrix with the hbfm package is to implement a stochastic EM algorithm with the stoc.em function to obtain a "good" set of initial starting values for the MCMC sampler. The stoc.em function requires two inputs:

Y - matrix or data.frame of gene expression count where the rows correspond to genes and columns correspond to cells; Y must contain intergets and have row names
Fac - the number of latent factors to consider in the model

The number of latent factors Fac is unknown, but should be less than the number of genes (G) in the dataset. Here, we will consider 5 factors in our example because our dataset only contains G = 10 genes. However, for datasets with larger numbers of genes, we typically recommend choosing a value for Fac that is greater than 10 (e.g., 15, 20, or 25).

The other main parameters of the stoc.em function relate to the number of iterations in the stochastic EM algorithm. These parameters and their default values are listed below:

M.stoc - total number of stochastic EM iterations (default = 2000)
M.int - initial number of MCMC draws before starting the EM algorithm (default = 100)
M.eval - number iterations to be used for parameter estimation (default = 200)

## Run stochastic EM first
## Consider F=5 factors
fit1 <- stoc.em(Y=gene.mat, Fac = 5, M.stoc = 2000, M.int = 100, M.eval = 200)

MCMC sampler

The hbfm.fit function will run an MCMC sampler with the initial values generated by the stoc.em function. The only required input for hbfm.fit is stoc.em.param, which is the object created by the stoc.em function.

The other main parameters of the hbfm.fit function relate to the number of iterations in the MCMC algorithm. These parameters and their default values are listed below:

M - total number of MCMC iterations (default = 4000)
M.save - number of iterations to calculate and save correlation estimates (default = 1000)

## Run MCMC sampler with initial parameter values from stoc.em
fit.res1 <- hbfm.fit(fit1, M = 4000, M.save = 1000)

Gene co-expression matrix

After running the MCMC sampler, the corr.est function is used to analyze the MCMC samples and obtain an estimated correlation matrix along with the upper and lower bounds of the 95% credible interval (CI). The input for the corr.est function is a list of objects created by the hbfm.fit function. In our example, we are inputting a list of just one object named fit.res1.

## Obtain estimated gene-gene correlations from MCMC samples
fit.corr <- corr.est(list(fit.res1))
print(fit.corr)

The print output from the corr.est function provides a summary of all gene-gene pairs with the estimated correlation, the lower and upper bound of the 95% CI, and an approximate "p-value" that represents the proportion of posterior samples outside of the smallest CI that contains 0. The star in the CI.sig column denotes the gene pair correlations with CI's that do not overlap with 0.

The gene co-expression matrix can also be obtained directly from the object created by corr.est.

## Obtain gene co-expression matrix
fit.corr$corr

Model selection

Because the number of latent factors is unknown, model selection using the Deviance Information Criterion (DIC) can be used to find an appropriate number of factors. The input for the hbfm.DIC function is a list of objects created by the hbfm.fit function. In our example, we are inputting a list of just one object named fit.res1.

## Obtain DIC for model with 5 factors
hbfm.DIC(list(fit.res1))

Extensions

This vignette provides the details on how to use the different functions in the hbfm package. In a typical analysis, multiple chains would be run and different numbers of latent factors (Fac) would be considered. Both the corr.est function and hbfm.DIC function will combine the results from multiple runs of hbfm.fit if the input list consists of multiple objects created by this function. The choice of Fac with the lowest DIC would be considered the "best" model choice and the correlation estimates created with this model would be the most appropriate to use.

mnsekula/hbfm documentation built on June 29, 2020, 5:12 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com