Introduction to MFA

library(mfa)
library(ggplot2)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE, cache = TRUE, fig.width = 6, fig.height = 4,
                      warning = FALSE, message = FALSE)

Introduction

mfa is an R package for fitting a Bayesian mixture of factor analysers to infer developmental trajectories with bifurcations from single-cell gene expression data. It is able to jointly infer pseudotimes, branching, and genes differentially regulated across branches using a generative, Bayesian hierarchical model. Inference is performed using fast Gibbs sampling.

Installation

mfa can be installed in one of two ways:

From Bioconductor

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("mfa")
library(mfa)

From Github

This requires the devtools package to be installed first

install.packages("devtools") # If not already installed
devtools::install_github("kieranrcampbell/mfa")
library(mfa)

An example on synthetic data

Generating synthetic data

We first create some synthetic data for 100 cells and 40 genes calling the mfa function create_synthetic. This returns a list with gene expression, pseudotime, branch allocation, and various parameter estimates:

synth <- create_synthetic(C = 100, G = 40)
print(str(synth))

We can then PCA and put into a tidy format:

df_synth <- as_data_frame(prcomp(synth$X)$x[,1:2]) %>% 
  mutate(pseudotime = synth$pst,
        branch = factor(synth$branch))

and have a look at a PCA representation, coloured by both pseudotime and branch allocation:

ggplot(df_synth, aes(x = PC1, y = PC2, color = pseudotime)) + geom_point()
ggplot(df_synth, aes(x = PC1, y = PC2, color = branch)) + geom_point()

Calling mfa

The input to mfa is either an ExpressionSet (e.g. from using the package Scater) or a cell-by-gene expression matrix. If an ExpressionSet is provided then the values in the exprs slot are used for gene expression.

We invoke mfa with a call to the mfa(...) function. Depending on the size of the dataset and number of MCMC iterations used, this may take some time:

m <- mfa(synth$X)
print(m)

Particular care must be paid to the initialisation of the pseudotimes: by default they are initialised to the first principal component, though if the researcher suspects (based on plotting marker genes) that the trajectory corresponds to a different PC, this can be set using the pc_initialise argument.

MCMC diagnostics

As in any MCMC analysis, basic care is needed to make sure the samples have converged to something resembling the stationary distribution (see e.g. @cowles1996markov for a full discussion).

For a quick summary of these, mfa provides two functions: plot_mfa_trace and plot_mfa_autocorr for quick plotting of the trace and autocorrelation of the posterior log-likelihood:

plot_mfa_trace(m)
plot_mfa_autocorr(m)

Plotting results

We can extract posterior mean estimates along with credible intervals using the summary function:

ms <- summary(m)
print(head(ms))

This has six entries:

We can compare the inferred pseudotimes to the true values:

qplot(synth$pst, ms$pseudotime, color = factor(synth$branch)) +
  xlab('True pseudotime') + ylab('Inferred pseudotime') +
  scale_color_discrete(name = 'True\nbranch')

And we can equivalently plot the PCA representation coloured by MAP branch:

mutate(df_synth, inferred_branch = ms[['branch']]) %>% 
  ggplot(aes(x = PC1, y = PC2, color = inferred_branch)) +
  geom_point() +
  scale_color_discrete(name = 'Inferred\nbranch')

Finding genes that bifurcate

A unique part of this model is that through an ARD-like prior structure on the loading matrices we can automatically infer which genes are involved in the bifurcation process. For a quick-and-dirty look we can use the plot_chi function, where larger values of inverse-chi imply the gene is associated with the bifurcation:

plot_chi(m)

To calculate the MAP values for chi we can call the calculate_chi function, which returns a data_frame with the feature names and values:

posterior_chi_df <- calculate_chi(m)
head(posterior_chi_df)

Advanced usage

The mfa class

A call to mfa(...) returns an mfa object that contains all the information about the dataset and the MCMC inference performed. Note that it does not contain a copy of the original data. We can see the structure by calling str on an mfa object:

str(m, max.level = 1)

This contains the following slots:

Accessing MCMC traces

MCMC traces can be accessed through the traces slot of an mfa object. This gives a list with an element for each variable, along with the log-likelihood:

print(names(m$traces))

For non-branch-specific variables this is simply a matrix. For example, for the variable $\tau$ is just an interation-by-gene matrix:

str(m$traces$tau_trace)

We can easily get the posterior mean by calling colMeans. More fancy posterior density estimation can be perfomed using the MCMCglmm package, such as posterior.mode(...) for MAP estimation (though in practice this is often similar to posterior mean). We can estimate posterior intervals using the HPDInterval(...) function from the coda package (note that traces must be converted to coda objects before calling either of these).

Some variables are branch dependent, meaning the traces returned are arrays (or tensors in fashionable speak) that have dimension iteration x gene x branch. An example is the $k$ variable:

str(m$traces$k_trace)

To get posterior means (or modes, or intervals) we then need to use the apply function to iterate over the branches. To find the posterior means of k, we then call

pmean_k <- apply(m$traces$k_trace, 3, colMeans)
str(pmean_k)

This returns a gene-by-branch matrix of posterior estimates.

Technical

sessionInfo()

References



Try the mfa package in your browser

Any scripts or data that you put into this service are public.

mfa documentation built on Nov. 8, 2020, 8:10 p.m.