This simple tutorial will guide you through a typical analysis workflow with NMFEM pipeline. At the end of this tutorial the reader should be able to do unsupervised clustering and important gene calling using non-negative matrix factorization, and important gene module calling using the spin-glass based algorithm FEM. Throughout the tutorial we will be using the mouse embryonic lung epithelial cell dataset [@treutlein2014reconstructing].

Unsupervised Clustering and Important gene calling

Let's begin by loading the NMFEM package:

library(NMFEM)

and loading the dataset attached in the package:

data(fpkm)

the unsupervised clustering and important genes calling is nicely packed in a single function. If you are running this on a multi-core computer, feel free to set a higher n_threads_ to take advantage of parallelism. Depending on your hardware this might take while so grab a cup of coffee while waiting for the results. :)

nmf_results <- nmf_subpopulation(fpkm, n_threads_ = 30)

The nmf_results contains various dataframes and ggplot objects for different aspect of the analysis. Let's take a look at a few important ones. We begin by looking at the clustering result:

predict(nmf_results$nmf_result)

Then we can ask "what are the genes that provided the evidence for the separation of these two clusters?"

nmf_results$gene_info

One way to understand the connection between the called clusters and the called important genes is that these two clusters express two distinct expression patterns. That is, some genes are expressed highly only in one cluster while some other genes are expressed highly only in the other cluster.

Let's see what these two expression patterns are:

nmf_results$coef_line_plot

We can also see the D-score distribution of these two patterns. This plot also gives you an idea the relative abundance of genes for each pattern.

nmf_results$d_score_frequency_plot

There are also other interesting plots that allow various insights into the clusters, important genes, and expression patterns. We encourage the reader to explore by looking into its struture (it's just a list):

str(nmf_results, max.level=1)

By the way, all the ggplot objects in the list have dataframe counter-parts for easy access of the underlying data. For example, coef_line_dat has the data for coef_line_plot:

nmf_results$coef_line_dat

Important Module generation

Again, everything you need is nicely packed in the function spinglass_procedure:

module_results <- spinglass_procedure(fpkm, phe, leading_genes, mppi, 'mouse', n_threads_ = 30)

We can view the module graphs by:

module_results$graph_plot

A summary containing important information about these modules can be found using:

module_results$final_tb

References



lanagarmire/NMFEM documentation built on May 20, 2019, 7:34 p.m.