knitr::opts_chunk$set(
  catch = TRUE,
  collapse = TRUE,
  comment = "##"
)

Recently, the emergence of single cell RNA sequencing (scRNA-seq) is providing large amounts of single cell transcriptomics data for the unbiased quantifications of cellular heterogeneity. Though scRNA-seq data have been successfully generated by many labs, less attention has been paid to how knowledge derived from these data can be integrated across studies and leveraged by the whole single cell community. In this R package, we provide a user-friendly scRNA-seq integration tool that uses statistical methods to map new cell cluster data to the reference cell clusters.

Our method, FR-Match, is a novel application of the Friedman-Rafsky (FR) test, a non-parametric statistical test for multivariate data comparison in the context of single cell clustering results. We tailor the classical testing procedure for scRNA-seq experiment data under the null hypothesis that there is no distributional difference in the two comparing clusters (i.e. a match) and the alternative hypothesis that the distributions of the two comparing clusters are different (i.e. a non-match) in the high-dimensional data space defined by selected gene features. Our procedure takes clustered gene expression matrices of query and reference experiments, and returns the FR statistic with p-value as evidence that the pair of comparing cell clusters is matched or not.

General steps of FR-Match include:

  1. Select features (gene expression values) using the supervised NS-Forest marker gene selection algorithm [ref] recently developed by the project team, which identifies minimum sets of marker genes that maximize the classification power of differentiating clusters in the reference dataset.
  2. Construct minimum spanning trees for each pair of query and reference clusters.
  3. Calculate FR statistics and p-values by counting the number of edges that connect nodes from different clusters (different colors) in the minimum spanning tree plots.

Overview of FR-Match for cross-comparison between two scRNA-seq experiments.{width=100%}

Installing FRmatch

The R package FRmatch is undergoing active development. Latest version and version control are managed in GitHub.

To install FRmatch from GitHub repository, please use

install.packages("devtools")
devtools::install_github("JCVenterInstitute/FRmatch")

After successful installation, please load FRmatch to your R environment.

library(FRmatch)

Launching Shiny App

To shed more light on FR-Match, we complement this package with a Shiny App, which includes two preloaded data objects of the sames cells but different clusters. We show useful features of FRmatch, including MST plots and visulization of final results. You may launch the app by

runShiny()

Getting input data ready

There are many pieces of data information needed for conducting various scRNA-seq data analyses. We choose to use the SingleCellExperiment class, which is a convenient container for single-cell genomics data, to summarize gene expression data and metadata needed for FRmatch. For instructions on how to construct a SingleCellExperiment object, please see An introduction to the SingleCellExperiment class.

For FRmatch, the following data items are essential:

In addition, information such as F-meansure and cluster order are not essential, but will facilitate visualization and customized analysis provided in this package.

An example data object

In this package, we include an example data object in sce.example. More details please see help("sce.example").

data(sce.example)
sce.example

In this example data, there are 16487 genes (in rows) and 865 cells (in columns). A quick check of the clusters and their sizes (we will need the SingleCellExperiment package to work with data object of this class).

library(SingleCellExperiment)
knitr::kable(table(colData(sce.example)), col.names=c("Cluster", "Size"))

FR-Match

For illustration purpose, we show how FRmatch works in the context of cross-validating our example data.

Create toy datasets

We randomly select 50% of the cells in proportion to cluster sizes as query, and set the rest of the cells as reference.

library(dplyr)
library(tibble)
set.seed(999)
## subsampling
all <- colData(sce.example) %>% as.data.frame() %>% rownames_to_column()
sam1 <- all %>% group_by(cluster_membership) %>% sample_frac(.5)
sam2 <- dplyr::setdiff(all, sam1)

sce.sam1 <- sce.example[,sam1$rowname] #query
sce.sam2 <- sce.example[,sam2$rowname] #reference

Run FRmatch

Now, we have two data objects that we can feed to FRmatch. The FRmatch() function is a wrapper function that take two must-have input arguments, sce.query= and sce.ref=, which are the query experiment data and reference experiment data, repectively. In this function, steps include:

rst <- FRmatch(sce.query = sce.sam1, sce.ref = sce.sam2)

This main function returns a list of results, which can be visualized using our graphical functions.

Visualization tools for FRmatch

Plot FRmatch results

We provide the function plot_FRmatch() to facilitate the visualization of FRmatch results.

plot_FRmatch(rst)
plot_FRmatch(rst, type="padj")

Non-zero expression plot

We alos provide a supporting function that calculates and plots the "% expressed per marker gene per cluster" for the \code{FRmatch} input data object. The percentage is defined as

number of cells that express the marker gene in the cluster / cluster size

The NS-Forest algorithm is designed to select the minimum set of binary genes for each cluster, which has the "best" classfication score (measured by F-measure) differentiating the cluster from all other clusters pooling toghther. The binaryness is desirable because it is pragmatically important for many downstream use cases of marker genes. Therefore, we are looking for few dropouts of the marker genes in the cluster that they mark, which can be checked using

plot_nonzero(sce.example, return.value=FALSE, return.plot=TRUE)

Friedman-Rafsky test

We also implemented our own function for Friedman-Rafsky (FR) test with customized options and graphical tool. FR test is a multivariate generalization of nonparametric two-sample test. It is a graphical model based on the concept of the minimum spanning tree (MST). This natually provides a graphical tool to visualize high-dimensional clustered data in a 2-dimensional plot. Below is a trivial example.

samp1 <- matrix(rnorm(50),nrow=5)
samp2 <- matrix(rnorm(100),nrow=5)
FR.test(samp1, samp2, plot.MST=TRUE, main="Minimum spanning tree plot")

We encourage our users to utilize this MST graphical tool to visually examine their interested clusters.

Session info

sessionInfo()


JCVenterInstitute/FRmatch documentation built on Jan. 25, 2020, 8:38 p.m.