Introduction to MFA In mfa: Bayesian hierarchical mixture of factor analyzers for modelling genomic bifurcations

library(mfa)
library(ggplot2)
library(dplyr)

branch = factor(synth$branch))  and have a look at a PCA representation, coloured by both pseudotime and branch allocation: ggplot(df_synth, aes(x = PC1, y = PC2, color = pseudotime)) + geom_point() ggplot(df_synth, aes(x = PC1, y = PC2, color = branch)) + geom_point()  Calling mfa The input to mfa is either an ExpressionSet (e.g. from using the package Scater) or a cell-by-gene expression matrix. If an ExpressionSet is provided then the values in the exprs slot are used for gene expression. We invoke mfa with a call to the mfa(...) function. Depending on the size of the dataset and number of MCMC iterations used, this may take some time: m <- mfa(synth$X)
print(m)


Particular care must be paid to the initialisation of the pseudotimes: by default they are initialised to the first principal component, though if the researcher suspects (based on plotting marker genes) that the trajectory corresponds to a different PC, this can be set using the pc_initialise argument.

MCMC diagnostics

As in any MCMC analysis, basic care is needed to make sure the samples have converged to something resembling the stationary distribution (see e.g. @cowles1996markov for a full discussion).

For a quick summary of these, mfa provides two functions: plot_mfa_trace and plot_mfa_autocorr for quick plotting of the trace and autocorrelation of the posterior log-likelihood:

plot_mfa_trace(m)
plot_mfa_autocorr(m)


Plotting results

We can extract posterior mean estimates along with credible intervals using the summary function:

ms <- summary(m)


This has six entries:

• pseudotime The MAP pseudotime estimate
• branch The MAP branch estimate
• branch_certainty The proportion of MCMC traces (after burn-in) for which the cell was assigned to the MAP branch
• pseudotime_lower and pseudotime_upper: the lower and upper 95\% highest-probability-density posterior credible intervals

We can compare the inferred pseudotimes to the true values:

qplot(synth$pst, ms$pseudotime, color = factor(synth$branch)) + xlab('True pseudotime') + ylab('Inferred pseudotime') + scale_color_discrete(name = 'True\nbranch')  And we can equivalently plot the PCA representation coloured by MAP branch: mutate(df_synth, inferred_branch = ms[['branch']]) %>% ggplot(aes(x = PC1, y = PC2, color = inferred_branch)) + geom_point() + scale_color_discrete(name = 'Inferred\nbranch')  Finding genes that bifurcate A unique part of this model is that through an ARD-like prior structure on the loading matrices we can automatically infer which genes are involved in the bifurcation process. For a quick-and-dirty look we can use the plot_chi function, where larger values of inverse-chi imply the gene is associated with the bifurcation: plot_chi(m)  To calculate the MAP values for chi we can call the calculate_chi function, which returns a data_frame with the feature names and values: posterior_chi_df <- calculate_chi(m) head(posterior_chi_df)  Advanced usage The mfa class A call to mfa(...) returns an mfa object that contains all the information about the dataset and the MCMC inference performed. Note that it does not contain a copy of the original data. We can see the structure by calling str on an mfa object: str(m, max.level = 1)  This contains the following slots: • traces - the raw MCMC traces (discussed in following section) • iter - the number of MCMC iterations • thin - the thinning of the MCMC chain • burn - the number of MCMC iterations thrown away as burn-in • b - the number of branches modelled • collapse - whether collapsed Gibbs sampling was implemented • N - the number of cells • G - the number of features (e.g. genes) • feature_names - the names of the features (e.g. genes) • cell_names - the names of the cells Accessing MCMC traces MCMC traces can be accessed through the traces slot of an mfa object. This gives a list with an element for each variable, along with the log-likelihood: print(names(m$traces))


For non-branch-specific variables this is simply a matrix. For example, for the variable $\tau$ is just an interation-by-gene matrix:

str(m$traces$tau_trace)


We can easily get the posterior mean by calling colMeans. More fancy posterior density estimation can be perfomed using the MCMCglmm package, such as posterior.mode(...) for MAP estimation (though in practice this is often similar to posterior mean). We can estimate posterior intervals using the HPDInterval(...) function from the coda package (note that traces must be converted to coda objects before calling either of these).

Some variables are branch dependent, meaning the traces returned are arrays (or tensors in fashionable speak) that have dimension iteration x gene x branch. An example is the $k$ variable:

str(m$traces$k_trace)


To get posterior means (or modes, or intervals) we then need to use the apply function to iterate over the branches. To find the posterior means of k, we then call

pmean_k <- apply(m$traces$k_trace, 3, colMeans)
str(pmean_k)


This returns a gene-by-branch matrix of posterior estimates.

Technical

sessionInfo()


References

Try the mfa package in your browser

Any scripts or data that you put into this service are public.

mfa documentation built on Nov. 8, 2020, 8:10 p.m.