Introduction to MFA
In mfa: Bayesian hierarchical mixture of factor analyzers for modelling genomic bifurcations

library(mfa)
library(ggplot2)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE, cache = TRUE, fig.width = 6, fig.height = 4,
                      warning = FALSE, message = FALSE)

Introduction

mfa is an R package for fitting a Bayesian mixture of factor analysers to infer developmental trajectories with bifurcations from single-cell gene expression data. It is able to jointly infer pseudotimes, branching, and genes differentially regulated across branches using a generative, Bayesian hierarchical model. Inference is performed using fast Gibbs sampling.

Installation

mfa can be installed in one of two ways:

From Bioconductor

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("mfa")
library(mfa)

From Github

This requires the devtools package to be installed first

install.packages("devtools") # If not already installed
devtools::install_github("kieranrcampbell/mfa")
library(mfa)

An example on synthetic data

Generating synthetic data

We first create some synthetic data for 100 cells and 40 genes calling the mfa function create_synthetic. This returns a list with gene expression, pseudotime, branch allocation, and various parameter estimates:

synth <- create_synthetic(C = 100, G = 40)
print(str(synth))

We can then PCA and put into a tidy format:

df_synth <- as_data_frame(prcomp(synth$X)$x[,1:2]) %>% 
  mutate(pseudotime = synth$pst,
        branch = factor(synth$branch))

and have a look at a PCA representation, coloured by both pseudotime and branch allocation:

ggplot(df_synth, aes(x = PC1, y = PC2, color = pseudotime)) + geom_point()
ggplot(df_synth, aes(x = PC1, y = PC2, color = branch)) + geom_point()

Calling `mfa`

The input to mfa is either an ExpressionSet (e.g. from using the package Scater) or a cell-by-gene expression matrix. If an ExpressionSet is provided then the values in the exprs slot are used for gene expression.

We invoke mfa with a call to the mfa(...) function. Depending on the size of the dataset and number of MCMC iterations used, this may take some time:

m <- mfa(synth$X)
print(m)

Particular care must be paid to the initialisation of the pseudotimes: by default they are initialised to the first principal component, though if the researcher suspects (based on plotting marker genes) that the trajectory corresponds to a different PC, this can be set using the pc_initialise argument.

MCMC diagnostics

As in any MCMC analysis, basic care is needed to make sure the samples have converged to something resembling the stationary distribution (see e.g. @cowles1996markov for a full discussion).

For a quick summary of these, mfa provides two functions: plot_mfa_trace and plot_mfa_autocorr for quick plotting of the trace and autocorrelation of the posterior log-likelihood:

plot_mfa_trace(m)
plot_mfa_autocorr(m)

Plotting results

We can extract posterior mean estimates along with credible intervals using the summary function:

ms <- summary(m)
print(head(ms))

This has six entries:

pseudotime The MAP pseudotime estimate
branch The MAP branch estimate
branch_certainty The proportion of MCMC traces (after burn-in) for which the cell was assigned to the MAP branch
pseudotime_lower and pseudotime_upper: the lower and upper 95\% highest-probability-density posterior credible intervals

We can compare the inferred pseudotimes to the true values:

qplot(synth$pst, ms$pseudotime, color = factor(synth$branch)) +
  xlab('True pseudotime') + ylab('Inferred pseudotime') +
  scale_color_discrete(name = 'True\nbranch')

And we can equivalently plot the PCA representation coloured by MAP branch:

mutate(df_synth, inferred_branch = ms[['branch']]) %>% 
  ggplot(aes(x = PC1, y = PC2, color = inferred_branch)) +
  geom_point() +
  scale_color_discrete(name = 'Inferred\nbranch')

Finding genes that bifurcate

A unique part of this model is that through an ARD-like prior structure on the loading matrices we can automatically infer which genes are involved in the bifurcation process. For a quick-and-dirty look we can use the plot_chi function, where larger values of inverse-chi imply the gene is associated with the bifurcation:

plot_chi(m)

To calculate the MAP values for chi we can call the calculate_chi function, which returns a data_frame with the feature names and values:

posterior_chi_df <- calculate_chi(m)
head(posterior_chi_df)

Advanced usage

The `mfa` class

A call to mfa(...) returns an mfa object that contains all the information about the dataset and the MCMC inference performed. Note that it does not contain a copy of the original data. We can see the structure by calling str on an mfa object:

str(m, max.level = 1)

This contains the following slots:

traces - the raw MCMC traces (discussed in following section)
iter - the number of MCMC iterations
thin - the thinning of the MCMC chain
burn - the number of MCMC iterations thrown away as burn-in
b - the number of branches modelled
collapse - whether collapsed Gibbs sampling was implemented
N - the number of cells
G - the number of features (e.g. genes)
feature_names - the names of the features (e.g. genes)
cell_names - the names of the cells

Accessing MCMC traces

MCMC traces can be accessed through the traces slot of an mfa object. This gives a list with an element for each variable, along with the log-likelihood:

print(names(m$traces))

For non-branch-specific variables this is simply a matrix. For example, for the variable $\tau$ is just an interation-by-gene matrix:

str(m$traces$tau_trace)

We can easily get the posterior mean by calling colMeans. More fancy posterior density estimation can be perfomed using the MCMCglmm package, such as posterior.mode(...) for MAP estimation (though in practice this is often similar to posterior mean). We can estimate posterior intervals using the HPDInterval(...) function from the coda package (note that traces must be converted to coda objects before calling either of these).

Some variables are branch dependent, meaning the traces returned are arrays (or tensors in fashionable speak) that have dimension iteration x gene x branch. An example is the $k$ variable:

str(m$traces$k_trace)

To get posterior means (or modes, or intervals) we then need to use the apply function to iterate over the branches. To find the posterior means of k, we then call

pmean_k <- apply(m$traces$k_trace, 3, colMeans)
str(pmean_k)

This returns a gene-by-branch matrix of posterior estimates.

Technical

sessionInfo()

References

Any scripts or data that you put into this service are public.

mfa documentation built on Nov. 8, 2020, 8:10 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

mfa
Bayesian hierarchical mixture of factor analyzers for modelling genomic bifurcations

Introduction to MFA
In mfa: Bayesian hierarchical mixture of factor analyzers for modelling genomic bifurcations

Introduction

Installation

From Bioconductor

From Github

An example on synthetic data

Generating synthetic data

Calling `mfa`

MCMC diagnostics

Plotting results

Finding genes that bifurcate

Advanced usage

The `mfa` class

Accessing MCMC traces

Technical

References

Try the mfa package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

mfa Bayesian hierarchical mixture of factor analyzers for modelling genomic bifurcations

Introduction to MFA In mfa: Bayesian hierarchical mixture of factor analyzers for modelling genomic bifurcations

Introduction

Installation

From Bioconductor

From Github

An example on synthetic data

Generating synthetic data

Calling mfa

MCMC diagnostics

Plotting results

Finding genes that bifurcate

Advanced usage

The mfa class

Accessing MCMC traces

Technical

References

Try the mfa package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

mfa
Bayesian hierarchical mixture of factor analyzers for modelling genomic bifurcations

Introduction to MFA
In mfa: Bayesian hierarchical mixture of factor analyzers for modelling genomic bifurcations

Calling `mfa`

The `mfa` class