This vignette shows how MOFA can be used to disentangle the heterogeneity in single-cell DNA methylation and RNA expression (scMT) data.

Briefly, the data set consists of 87 mouse embryonic stem cells (mESCs), comprising of 16 cells cultured in ā€˜2iā€™ media, which induces a naive pluripotency state, and 71 serum-grown cells, which commits cells to a primed pluripotency state poised for cellular differentiation (Angermueller, 2016).

Load data and create MOFA object

The data is stored as a MultiAssayExperiment object. Notice that there are 4 views, one with normalized RNA expression assay and 3 views with DNA methylation information. Each DNA methylation view is a different genomic context (i.e. Enhancers, Promoters and CpG Islands) and each feature is an individual CpG site.

library(MultiAssayExperiment)
library(MOFA)
library(MOFAdata)
library(ggplot2)
data("scMT_data")
scMT_data

First, we create the MOFA object:

MOFAobject <- createMOFAobject(scMT_data)
MOFAobject

The function plotDataOverview can be used to obtain an overview of the structure of the data:

plotDataOverview(MOFAobject, colors=c("#31A354","#377EB8","#377EB8","#377EB8"))

Fit the MOFA model

The next step is to train the model. Internally, this is done via Python, so make sure you have the corresponding package installed (see installation instructions and read the FAQ if you have problems).

Define options

Define data options

Next, we define data options. The most important are:

DataOptions <- getDefaultDataOptions()
DataOptions

Define model options

Next, we define model options. The most important are:

ModelOptions <- getDefaultModelOptions(MOFAobject)
ModelOptions

Define training options

Next, we define training options. The most important are:

TrainOptions <- getDefaultTrainOptions()
TrainOptions$seed <- 2018

# Automatically drop factors that explain less than 1% of variance in all omics
TrainOptions$DropFactorThreshold <- 0.01

TrainOptions

Prepare MOFA

prepareMOFA internally performs a set of sanity checks and fills the DataOptions, TrainOptions and ModelOptions slots of the MOFAobject

MOFAobject <- prepareMOFA(
  MOFAobject, 
  DataOptions = DataOptions,
  ModelOptions = ModelOptions,
  TrainOptions = TrainOptions
)

Run MOFA

Now we are ready to train the MOFAobject, which is done with the function runMOFA. This step can take some time (around 5 min with default parameters for a single trial). For illustration we provide an existing trained MOFAobject

MOFAobject <- runMOFA(MOFAobject)
filepath <- system.file("extdata", "scMT_model.hdf5", package = "MOFAdata")
MOFAobject <- loadModel(filepath, MOFAobject)
MOFAobject

Analyse a trained MOFA model

After training, we can explore the results from MOFA. Here we provide a semi-automated pipeline to disentangle and characterize all the identified sources of variation (the factors).

Part 1: Disentangling the heterogeneity

Calculation of variance explained by each factor in each view. This is probably the most important plot that MOFA generates, as it summarises the entire heterogeneity of the dataset in a single figure.

Part 2: Characterization of individual factors

Other analyses, including imputation of missing values and prediction of clinical data are also available. See below for a short guide on how to impute data. Detailed vignettes will follow soon.

Disentangling the heterogeneity, calculation of variance explained by each factor in each view

This is done by calculateVarianceExplained (to get the numerical values) and plotVarianceExplained (to get the plot).
The resulting figure gives an overview of which factors are active in which view(s). If a factor is active in more than one view, this means that is capturing shared signal (co-variation) between features of different data modalities.

In this data set MOFA identified 3 Factors with a minimum variance of 1%. While Factor 1 is shared across all data modalities (12% variance explained in the RNA data and between 53% and 72% in the methylation data sets), Factors 2 and 3 are active primarily in the RNA data

# Calculate the variance explained (R2) per factor in each view 
r2 <- calculateVarianceExplained(MOFAobject)
r2$R2Total

# Variance explained by each factor in each view
head(r2$R2PerFactor)

# Plot it
plotVarianceExplained(MOFAobject)

Characterisation of individual factors

Inspection of loadings for Factor 1

Plotting the RNA expression gene loadings for Factor 1, we can see that it is enriched for naive pluripotency marker genes such as Rex1/Zpf42, Tbx3, Fbxo15 and Essrb. Hence, based on previous studies (Mohammed et al, 2016) we can hypothesise that Factor 1 captures a transition from a naive pluripotent state to a primed pluripotent states.

# Plot all weights and highlight specific gene markers
plotWeights(
  object = MOFAobject, 
  view = "RNA expression", 
  factor = 1, 
  nfeatures = 0, 
  abs = FALSE, 
  manual = list(c("Zfp42","Esrrb","Morc1","Fbxo15",
                  "Jam2","Klf4","Tcl1","Tbx3","Tex19.1"))
)

# Plot top 10 genes
plotTopWeights(
  object = MOFAobject,
  view = "RNA expression", 
  factor = 1, 
  nfeatures = 10
)

Also, instead of looking at the "abstract" weights, it is useful to observe, in the original data, the heterogeneity captured by a Factor. This can be done using the plotDataHeatmap function.

# Add metadata to the plot
factor1 <- sort(getFactors(MOFAobject,"LF1")[,1])
order_samples <- names(factor1)
df <- data.frame(
  row.names = order_samples,
  culture = getCovariates(MOFAobject, "culture")[order_samples],
  factor = factor1
)

plotDataHeatmap(
  object = MOFAobject, 
  view = "RNA expression", 
  factor = "LF1", 
  features = 20, 
  transpose = TRUE, 
  show_colnames=FALSE, show_rownames=TRUE, # pheatmap options
  cluster_cols = FALSE, annotation_col=df # pheatmap options
)

We can now connect these transcriptomic changes to coordinated changes on the DNA methylation. As Factor 1 is active in all genomic contexts, it suggests that there is a massive genome-wide DNA methylation remodelling. This can confirmed by inspecting the weights in the DNA methylation views, as we do here for the CpG sites in enhancers: Notice that most features (CpG sites) have a negative weight, which suggests that their DNA methylation decrease in a manner that is inversely proportional to the direction of Factor 1.

plotWeights(
  object = MOFAobject,
  view = "Met Enhancers", 
  factor = 1, 
  nfeatures = 0,
  abs = FALSE,
  scale = FALSE
)

Similar observations can be made for the CpG sites in other genomic contexts, here CpG Islands or Promotors, by plotting the weights of the respective views (i.e. "Met CpG Islands" or "Met Promoters").

As done before, let's observe the heterogeneity captured by Factor 1 in the original data space. This clearly confirms that most of the CpG sites are getting methylated as cells progress in Factor 1 from naive to primed pluripotent stem cells

plotDataHeatmap(
  object = MOFAobject, 
  view = "Met Enhancers", 
  factor = 1, 
  features = 500,
  transpose = TRUE,
  cluster_rows=FALSE, cluster_cols=FALSE, # pheatmap options
  show_rownames=FALSE, show_colnames=FALSE, # pheatmap options
  annotation_col=df  # pheatmap options
)

Inspection of loadings for Factor 2

A similar analysis for Factor 2 reveals that it captures a second axis of differentiation from the primed pluripotency state to a differentiated state, with highest RNA loadings for known differentiation markers such as keratins and annexins.

# Plot all weights and highlight specific gene markers
plotWeights(
  object = MOFAobject, 
  view="RNA expression", 
  factor = 2, 
  nfeatures = 0, 
  manual = list(c("Krt8","Cald1","Anxa5","Tagln",
                  "Ahnak","Dsp","Anxa3","Krt19")), 
  scale = FALSE,
  abs = FALSE
)

# Plot top 10 genes
plotTopWeights(
  object = MOFAobject, 
  view="RNA expression", 
  factor = 2, 
  nfeatures = 10
)

Interestingly, the $R^2$ plot suggests that this second axis of variability is not associated with DNA methylation. We can confirm this by plotting the weights (they are all zero) and the heatmaps (no coherent pattern) as we have done before:

plotWeights(
  object = MOFAobject, 
  view = "Met Enhancers", 
  factor = 2, 
  nfeatures = 0, 
  abs = FALSE,
  scale = FALSE
)
factor2 <- sort(getFactors(MOFAobject,"LF2")[,1])
order_samples <- names(factor2)
df <- data.frame(
  row.names = order_samples,
  culture =  getCovariates(MOFAobject, "culture")[order_samples],
  factor = factor2
)
plotDataHeatmap(
  object = MOFAobject, 
  view = "Met Enhancers", 
  factor = 2, 
  features = 500,
  transpose = TRUE,
  cluster_rows=FALSE, cluster_cols=FALSE,  # pheatmap options
  show_rownames=FALSE, show_colnames=FALSE,  # pheatmap options
  annotation_col=df  # pheatmap options
)

Ordination of samples by factors

Samples can be visualized along factors of interest using the plotFactorScatter and plotFactorBeeswarm functions.

In this data set, the combination of Factor 1 and Factor 2 captured the entire differentiation trajectory from naive pluripotent cells via primed pluripotent cells to differentiated cells. This illustrates the importance of learning continuous latent factors rather than discrete sample assignments.

p <- plotFactorScatter(
  object = MOFAobject, 
  factors=1:2, 
  color_by = "culture")

p + scale_color_manual(values=c("lightsalmon","orangered3"))

Finally, Factor 3 captured the cellular detection rate, a known technical covariate that represents the number of expressed genes.

plotFactorBeeswarm(
  object = MOFAobject,
  factors = 3, 
  color_by = "cellular_detection_rate"
)

Customized analysis

For customized exploration of weights and factors, you can directly fetch the variables from the MOFAobject using 'get' functions: getWeights, getFactors and getTrainData. As an example, here we do a scatterplot between Factor 3 and the true Cellular Detection Rate values

cdr <- colMeans(getTrainData(MOFAobject)$`RNA expression`>0,na.rm=TRUE)
factor3 <- getFactors(MOFAobject,
                      factors=3)

foo <- data.frame(factor = as.numeric(factor3), cdr = cdr)
ggplot(foo, aes_string(x = "factor", y = "cdr")) + 
  geom_point() + xlab("Factor 3") +
  ylab("Cellular Detection Rate") +
  stat_smooth(method="lm") +
  theme_bw()

Further functionalities

Imputation of missing observations

The factors can be used to impute missing data (see Methods of the publication). This is done using the impute function, which stores the imputed data in the ImputedData slot of the MOFAobject. It can be accessed via the getImputedData function:

MOFAobject <- impute(MOFAobject)
nonImputedMethylation <- getTrainData(MOFAobject,
                                      view="Met CpG Islands")[[1]]
imputedMethylation <- getImputedData(MOFAobject,
                                     view="Met CpG Islands")[[1]]
# non-imputed data
pheatmap::pheatmap(nonImputedMethylation[1:100,1:20],
         cluster_rows = FALSE, cluster_cols = FALSE,
         show_rownames = FALSE, show_colnames = FALSE)

# imputed data
pheatmap::pheatmap(imputedMethylation[1:100,1:20],
         cluster_rows = FALSE, cluster_cols = FALSE,
         show_rownames = FALSE, show_colnames = FALSE)

Clustering of samples based on latent factors

Samples can be clustered according to their values on the latent factors using the clusterSamples function.

# kmeans clustering with K=3 using Factor 1
set.seed(1234)
clusters <- clusterSamples(MOFAobject, k=3, factors=1)

# Scatterplot colored by the predicted cluster labels,
# and shaped by the true culture conditions
plotFactorScatter(
  object = MOFAobject,
  factors = 1:2,
  color_by = clusters,
  shape_by = "culture"
)

SessionInfo

sessionInfo()


bioFAM/MOFA documentation built on Oct. 3, 2020, 12:53 a.m.