# Cell cycle phase classification

### Description

Classify single cells into their cell cycle phases based on gene expression data.

### Usage

1 2 3 4 5 |

### Arguments

`x` |
A numeric matrix of gene expression values where rows are genes and columns are cells. Alternatively, a SCESet object containing such a matrix. |

`pairs` |
A list of data.frames produced by |

`gene.names` |
A character vector of gene names. |

`iter` |
An integer scalar specifying the number of iterations for random sampling to obtain a cycle score. |

`min.iter` |
An integer scalar specifying the minimum number of iterations for score estimation. |

`min.pairs` |
An integer scalar specifying the minimum number of pairs for cycle estimation. |

`BPPARAM` |
A BiocParallelParam object to use in |

`verbose` |
A logical scalar specifying whether diagnostics should be printed to screen. |

`subset.row` |
A logical, integer or character scalar indicating the rows of |

`...` |
Additional arguments to pass to |

`assay` |
A string specifying which assay values to use, e.g., |

`get.spikes` |
A logical specifying whether spike-in transcripts should be used. |

### Details

This function implements the classification step of the pair-based prediction method described by Scialdone et al. (2015).
Consider classification into G1 phase.
Pairs of marker genes are identified with `sandbag`

, where the expression of the first gene in the training data is greater than the second in G1 phase but less than the second in all other phases.
For each cell, `cyclone`

calculates the proportion of all marker pairs where the expression of the first gene is greater than the second in the new data `x`

(pairs with the same expression are ignored).
A high proportion suggests that the cell is likely to belong in G1 phase, as the expression ranking in the new data is consistent with that in the training data.

To make the proportions comparable between phases, a distribution of proportions is constructed by shuffling the expression values within the cell and recalculating the proportion at each iteration.
The phase score for that cell is then defined as the lower tail probability of this distribution.
By default, shuffling is performed `iter`

times to obtain the distribution from which the score is estimated.
However, some iterations may not be used if there are fewer than `min.pairs`

pairs with different expression, such that the proportion cannot be calculated precisely.
Also, a score is only returned if the distribution is large enough for stable calculation of the tail probability, i.e., consists of results from at least `min.iter`

iterations.

The same process is repeated for all phases, using the appropriate set of marker pairs in `pairs`

for each phase.
Cells with G1 or G2M scores above 0.5 should be assigned to the G1 or G2M phases, respectively.
(If both are above 0.5, the higher score is used for assignment.)
This is based on the interpretation of the score as 1 minus the p-value for the null distribution of proportions.
The null hypothesis here is that expression of the marker genes is independent within each cell, i.e., with no cycle-induced correlations between marker pairs.
Cells can be assigned to S phase based on the S phase score, but a more reliable approach is to define S phase cells based on those cells with G1 and G2M scores below 0.5.

For `cyclone,SCESet-method`

, the matrix of counts is used but can be replaced with expression values by setting `assays`

.
By default, `get.spikes=FALSE`

which means that any rows corresponding to spike-in transcripts will not be considered for score calculation.
This is for the same reasons as described in `?sandbag`

.

Users can also manually set `subset.row`

to specify which rows of `x`

are to be used.
This is better than subsetting `x`

directly, as it reduces memory usage and also subsets `gene.names`

at the same time.
If this is specified, it will overwrite any setting of `get.spikes`

.

### Value

A list is returned containing:

`phases`

:A character vector containing the predicted phase for each cell.

`scores`

:A data frame containing the numeric phase scores for each phase and cell (i.e., each row is a cell).

`normalized.scores`

:A data frame containing the row-normalized scores (i.e., where the row sum for each cell is equal to 1).

### Author(s)

Antonio Scialdone, with modifications by Aaron Lun

### References

Scialdone A, Natarajana KN, Saraiva LR et al. (2015).
Computational assignment of cell-cycle stage from single-cell transcriptome data.
*Methods* 85:54–61

### See Also

`sandbag`

### Examples

1 2 3 4 5 6 7 8 9 10 11 12 | ```
example(sandbag)
# Classifying (note: test.data!=training.data in real cases)
test <- training
assignments <- cyclone(test, out)
# Visualizing
col <- character(ncells)
col[is.G1] <- "red"
col[is.G2M] <- "blue"
col[is.S] <- "darkgreen"
plot(assignments$score$G1, assignments$score$G2M, col=col, pch=16)
``` |