knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This vignette provides an introduction to using the hbfm
package for constructing a gene co-expression matrix from discrete single-cell RNA-sequencing (scRNA-seq) data. The hbfm
package contains functions to implement the stochastic EM algorithm and MCMC sampler defined in the "A sparse Bayesian factor model for the construction of gene co-expression networks from single-cell RNA sequencing count data" manuscript.
To begin, load the hbfm
package.
# Load the hbfm package library(hbfm) # Load example dataset from hbfm package data("gene.mat") dim(gene.mat)
Functions in the hbfm
package require discrete scRNA-seq gene expression data, so either a matrix or data frame of read counts is acceptable. The rows of the data correspond to genes and the columns correspond to cells. Here, the gene.mat
dataset is a matrix of the expression counts of G = 10 genes (rows) and N = 100 cells (columns).
The first step in constructing a gene co-expression matrix with the hbfm
package is to implement a stochastic EM algorithm with the stoc.em
function to obtain a "good" set of initial starting values for the MCMC sampler. The stoc.em
function requires two inputs:
Y
- matrix or data.frame of gene expression count where the rows correspond to genes and columns correspond to cells; Y must contain intergets and have row namesFac
- the number of latent factors to consider in the modelThe number of latent factors Fac
is unknown, but should be less than the number of genes (G) in the dataset. Here, we will consider 5 factors in our example because our dataset only contains G = 10 genes. However, for datasets with larger numbers of genes, we typically recommend choosing a value for Fac
that is greater than 10 (e.g., 15, 20, or 25).
The other main parameters of the stoc.em
function relate to the number of iterations in the stochastic EM algorithm. These parameters and their default values are listed below:
M.stoc
- total number of stochastic EM iterations (default = 2000)M.int
- initial number of MCMC draws before starting the EM algorithm (default = 100)M.eval
- number iterations to be used for parameter estimation (default = 200)## Run stochastic EM first ## Consider F=5 factors fit1 <- stoc.em(Y=gene.mat, Fac = 5, M.stoc = 2000, M.int = 100, M.eval = 200)
The hbfm.fit
function will run an MCMC sampler with the initial values generated by the stoc.em
function. The only required input for hbfm.fit
is stoc.em.param
, which is the object created by the stoc.em
function.
The other main parameters of the hbfm.fit
function relate to the number of iterations in the MCMC algorithm. These parameters and their default values are listed below:
M
- total number of MCMC iterations (default = 4000)M.save
- number of iterations to calculate and save correlation estimates (default = 1000)## Run MCMC sampler with initial parameter values from stoc.em fit.res1 <- hbfm.fit(fit1, M = 4000, M.save = 1000)
After running the MCMC sampler, the corr.est
function is used to analyze the MCMC samples and obtain an estimated correlation matrix along with the upper and lower bounds of the 95% credible interval (CI). The input for the corr.est
function is a list of objects created by the hbfm.fit
function. In our example, we are inputting a list of just one object named fit.res1
.
## Obtain estimated gene-gene correlations from MCMC samples fit.corr <- corr.est(list(fit.res1)) print(fit.corr)
The print output from the corr.est
function provides a summary of all gene-gene pairs with the estimated correlation, the lower and upper bound of the 95% CI, and an approximate "p-value" that represents the proportion of posterior samples outside of the smallest CI that contains 0. The star in the CI.sig column denotes the gene pair correlations with CI's that do not overlap with 0.
The gene co-expression matrix can also be obtained directly from the object created by corr.est
.
## Obtain gene co-expression matrix fit.corr$corr
Because the number of latent factors is unknown, model selection using the Deviance Information Criterion (DIC) can be used to find an appropriate number of factors. The input for the hbfm.DIC
function is a list of objects created by the hbfm.fit
function. In our example, we are inputting a list of just one object named fit.res1
.
## Obtain DIC for model with 5 factors hbfm.DIC(list(fit.res1))
This vignette provides the details on how to use the different functions in the hbfm
package. In a typical analysis, multiple chains would be run and different numbers of latent factors (Fac
) would be considered. Both the corr.est
function and hbfm.DIC
function will combine the results from multiple runs of hbfm.fit
if the input list consists of multiple objects created by this function. The choice of Fac
with the lowest DIC would be considered the "best" model choice and the correlation estimates created with this model would be the most appropriate to use.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.