cyclone: Cell cycle phase classification

Description Usage Arguments Details Value Description of the score calculation Author(s) References See Also Examples

View source: R/cyclone.R


Classify single cells into their cell cycle phases based on gene expression data.


## S4 method for signature 'ANY'
cyclone(x, pairs, gene.names=rownames(x), iter=1000, min.iter=100, min.pairs=50, 
    BPPARAM=SerialParam(), verbose=FALSE, subset.row=NULL)

## S4 method for signature 'SingleCellExperiment'
cyclone(x, pairs, subset.row=NULL, ..., assay.type="counts", get.spikes=FALSE)



A numeric matrix-like object of gene expression values where rows are genes and columns are cells. Alternatively, a SingleCellExperiment object containing such a matrix.


A list of data.frames produced by sandbag, containing pairs of marker genes.


A character vector of gene names.


An integer scalar specifying the number of iterations for random sampling to obtain a cycle score.


An integer scalar specifying the minimum number of iterations for score estimation.


An integer scalar specifying the minimum number of pairs for cycle estimation.


A BiocParallelParam object to use in bplapply for parallel processing.


A logical scalar specifying whether diagnostics should be printed to screen.


A logical, integer or character scalar indicating the rows of x to use.


Additional arguments to pass to cyclone,ANY-method.


A string specifying which assay values to use, e.g., "counts" or "logcounts".


A logical specifying whether spike-in transcripts should be used.


This function implements the classification step of the pair-based prediction method described by Scialdone et al. (2015). To illustrate, consider classification of cells into G1 phase. Pairs of marker genes are identified with sandbag, where the expression of the first gene in the training data is greater than the second in G1 phase but less than the second in all other phases. For each cell, cyclone calculates the proportion of all marker pairs where the expression of the first gene is greater than the second in the new data x (pairs with the same expression are ignored). A high proportion suggests that the cell is likely to belong in G1 phase, as the expression ranking in the new data is consistent with that in the training data.

Proportions are not directly comparable between phases due to the use of different sets of gene pairs for each phase. Instead, proportions are converted into scores (see below) that account for the size and precision of the proportion estimate. The same process is repeated for all phases, using the corresponding set of marker pairs in pairs. Cells with G1 or G2M scores above 0.5 are assigned to the G1 or G2M phases, respectively. (If both are above 0.5, the higher score is used for assignment.) Cells can be assigned to S phase based on the S score, but a more reliable approach is to define S phase cells as those with G1 and G2M scores below 0.5.

For cyclone,SingleCellExperiment-method, the matrix of counts is used but can be replaced with expression values by setting assay.type. By default, get.spikes=FALSE which means that any rows corresponding to spike-in transcripts will not be considered for score calculation. This is for the same reasons as described in ?sandbag.

Users can also manually set subset.row to specify which rows of x are to be used. This is better than subsetting x directly, as it reduces memory usage and also subsets gene.names at the same time. If this is specified, it will overwrite any setting of get.spikes.

While this method is described for cell cycle phase classification, any biological groupings can be used here – see ?sandbag for details. However, for non-cell cycle phase groupings, the output phases will be an empty character vector. Users should manually apply their own score thresholds for assigning cells into specific groups.


A list is returned containing:


A character vector containing the predicted phase for each cell.


A data frame containing the numeric phase scores for each phase and cell (i.e., each row is a cell).


A data frame containing the row-normalized scores (i.e., where the row sum for each cell is equal to 1).

Description of the score calculation

To make the proportions comparable between phases, a distribution of proportions is constructed by shuffling the expression values within each cell and recalculating the proportion. The phase score is defined as the lower tail probability at the observed proportion. High scores indicate that the proportion is greater than what is expected by chance if the expression of marker genes were independent (i.e., with no cycle-induced correlations between marker pairs within each cell).

By default, shuffling is performed iter times to obtain the distribution from which the score is estimated. However, some iterations may not be used if there are fewer than min.pairs pairs with different expression, such that the proportion cannot be calculated precisely. A score is only returned if the distribution is large enough for stable calculation of the tail probability, i.e., consists of results from at least min.iter iterations.

Note that the score calculation in cyclone is slightly different from that described originally by Scialdone et al. The original code shuffles all expression values within each cell, while in this implementation, only the expression values of genes in the marker pairs are shuffled. This modification aims to use the most relevant expression values to build the null score distribution.


Antonio Scialdone, with modifications by Aaron Lun


Scialdone A, Natarajana KN, Saraiva LR et al. (2015). Computational assignment of cell-cycle stage from single-cell transcriptome data. Methods 85:54–61

See Also



example(sandbag) # Using the mocked-up data in this example.

# Classifying (note:! in real cases)
test <- training 
assignments <- cyclone(test, out)

# Visualizing
col <- character(ncells)
col[is.G1] <- "red"
col[is.G2M] <- "blue"
col[is.S] <- "darkgreen"
plot(assignments$score$G1, assignments$score$G2M, col=col, pch=16)

Example output

Loading required package: BiocParallel
Loading required package: scater
Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,, cbind, colMeans, colSums, colnames,,
    duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
    lapply, lengths, mapply, match, mget, order, paste, pmax,,
    pmin,, rank, rbind, rowMeans, rowSums, rownames, sapply,
    setdiff, sort, table, tapply, union, unique, unsplit, which,
    which.max, which.min

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: ggplot2

Attaching package: 'scater'

The following object is masked from 'package:stats':


sandbg> ncells <- 50

sandbg> ngenes <- 20

sandbg> training <- matrix(rnorm(ncells*ngenes), ncol=ncells)

sandbg> rownames(training) <- paste0("X", seq_len(ngenes))

sandbg> is.G1 <- 1:20

sandbg> is.S <- 21:30

sandbg> is.G2M <- 31:50

sandbg> out <- sandbag(training, list(G1=is.G1, S=is.S, G2M=is.G2M))

sandbg> str(out)
List of 3
 $ G1 :'data.frame':	104 obs. of  2 variables:
  ..$ first : chr [1:104] "X1" "X1" "X1" "X1" ...
  ..$ second: chr [1:104] "X2" "X8" "X12" "X16" ...
 $ S  :'data.frame':	67 obs. of  2 variables:
  ..$ first : chr [1:67] "X1" "X1" "X1" "X1" ...
  ..$ second: chr [1:67] "X5" "X6" "X9" "X12" ...
 $ G2M:'data.frame':	79 obs. of  2 variables:
  ..$ first : chr [1:79] "X1" "X1" "X1" "X4" ...
  ..$ second: chr [1:79] "X9" "X10" "X11" "X1" ...

sandbg> # Getting pre-trained marker sets
sandbg> mm.pairs <- readRDS(system.file("exdata", "mouse_cycle_markers.rds", package="scran"))

sandbg> hs.pairs <- readRDS(system.file("exdata", "human_cycle_markers.rds", package="scran"))
     G1     S   G2M
1 0.920 0.154 0.479
2 0.523 0.039 0.211
3 0.002 0.158 0.897
4 0.967 0.447 0.009
5 0.698 0.610 0.384
6 0.972 0.316 0.040
[1] "G1"  "G1"  "G2M" "G1"  "G1"  "G1" 
Warning message:
system call failed: Cannot allocate memory 

scran documentation built on May 21, 2018, 6 p.m.

Related to cyclone in scran...