In tallulandrews/CycleMix: Cell-Cycle using Mixture Models

require(knitr)
opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)

set.seed(1973)
library("CycleMix")
library("SingleCellExperiment")
library("scater")

Introduction

Droplet-based single-cell RNA sequencing (scRNAseq) enables the assaying of tens of thousands to millions of cells. One important feature of those cells is often which of those cells are actively proliferating and which are not. In addition, we may want to remove the transcriptional signature of this proliferation from our scRNAseq to avoid cells being clustered together simply because they are within the same cell-cycle stage.

This package implements a method for robustly classifying single cells by their cell-cycle phase and flexibly regressing out the transcriptional differences between any combination of phases specified by the user. This enables both complete regression of transcriptional differences related to the cell-cycle or simply regressing out differences between different phases of the cellcycle while preserving differences between proliferating and non-proliferating cells.

Genesets

This package provides some published cell-cycle marker gene sets for human (HGeneSets) and mouse (MGeneSets).

names(HGeneSets)

Sources of the gene sets can be found in the help manual, i.e.

?HGeneSets

You can use your own geneset simply by creating a data.frame of the same format:

head(MGeneSets$Cyclone)

The "Dir" column is used to weight the gene expression we have simplified this to positive (1) and negative (-1) markers, but any numeric value can be used, for instance fold change.

We can convert human cell-cycle genes to mouse orthologs using biomaRt with our provided wrappers. For instance if we want to use the quiescence markers in mouse we would convert them as so:

#require("biomaRt")
#map <- downloadEnsemblData()
mouse_g0 <- HGeneSets$Quiesc
#mouse_g0$Gene <- mapGeneNames(map, mouse_g0$Gene, in.name="symbol", in.org="Hsap", out.name="symbol", out.org="Mmus")
#mouse_g0 <- mouse_g0[mouse_g0$Gene != "",]

Gene sets can also be combined together easily:

mouse_CC <- rbind(mouse_g0, MGeneSets$Cyclone)

Input Data

CycleMix can use a SingleCellExperiment object, a Seurat object, or a matrix or sparse matrix as input. or this tutorial we have provided some example data already formatted correctly. You must normalize the expression data before using CycleMix, but any normalization will work - vst, sctransform, log-cpm, integrated counts etc....

When using a SingleCellExperiment object, CycleMix will automatically look for a column called "feature_symbol" for the gene names that match those in the marker table.

head(rowData(Ex))

You should also have a log-transformed normalized expression matrix in the SingleCellExperiment object, in our case this matrix is called "logcounts":

names(assays(Ex))

ee the SingleCellExperiment or scater packages for details on creating and normalizing data in SingleCellExperiment objects.

Assigning Cells

We can now assign cell-cycle stages to our cells. For our example data we know all the cells are cycling thus won't include the quiescence markers.

SingleCellExperiment Object

output <- classifyCells(Ex, MGeneSets$Cyclone) #### CAUSES ERROR!!!!
print(summary(factor(output$phase)))

Our example data comes from staged cells so we can compare our assignments to the ground truth:

table(factor(output$phase), Ex$cell_type1)

We can also examine the gaussian mixture model fit to each cell-cycle stage:

plotMixture(output$fit[["G2M"]], BIC=TRUE)

The plot on the left shows the 6 different models considered : mixtures of 1-3 gaussian distributions with equal (E) or different (V) variances. The BIC criterion was used to select the optimal model. The plot on the right shows the distribution of expression scores across all cells. The curves of the fitted distributions are plotted on top. The threshold for assigning cells to the stage (if applicable) is indicated with the red dotted line.