segmentExpression2CopyNumber: Calling CNVs.

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/segmentExpression2CopyNumber.R


Maps single cell expression profiles to copy number profiles.


segmentExpression2CopyNumber(eps, gpc, cn, seed=0, outF=NULL, maxPloidy=8, 
                                       nCores=2, stdOUT="log.applyAR2seg")



Segment-by-cell matrix of expression.


Number of genes expressed per cell.


Average copy number across cells for each segment (i.e. row in eps).


The fraction of entries in a-priori segment-by-cell copy number matrix to be used as seed for association rule mining.


Output file prefix in which to print intermediary heatmaps and histograms, or NULL (default) if no print.


The maximum ploidy to accept as solution.


The numbers of threads used.


Log-file to which standard output is redirected during parallel processing.


Let S := { S_1, S_2, ... S_n } be the set of n genomic segments obtained from bulk DNA-sequencing. Let E_{ij} and G_{ij} be the average number of UMIs and the number of expressed genes per segment i per cell j. The segment-by-cell expression matrix is first normalized by gene coverage. For each x \in S, the linear regression model:

E_{x*} \sim ∑_{i \in S}G_{i*}

, fits the average segment expression per cell onto the cell's overall gene coverage. The model's residuals R_{ij} reflect inter-cell differences in expression per segment that cannot be explained by differential gene coverage per cell. A first approximation of the segment-by-cell copy number matrix CN is given by:

CN_{ij} := R_{ij} * (cn_i / \bar{R_{i*}} )

, where cn_i is the population-average copy number of segment i derived from DNA-seq. Above transformation of E_{ij} into CN_{ij} is in essence a numerical optimization, shifting the distribution of each segment to the average value expected from bulk DNA-seq.

Let x' \in CN be the measured copy number of a given segment-cell pair, and x its corresponding true copy number state. The probability of assigning copy number x to a cell j at locus i depends on:
A. Cell j's read count at locus i, calculated conditional on the measurement x'. Using a Gaussian smoothing kernel, we compute the kernel density estimate of the read counts at locus i across cells to identify the major (M) and the minor (m) copy number states of i as the highest and second highest peak of the fit respectively. Then we calculate the proportion of cells expected at state m as f = \frac{cn_i - M}{m - M} . The probability of assigning copy number x to a cell j at locus i is calculated as:
P_A(x|x') \sim
: 0, if x \notin {m,M}
: P_{ij}(x'|N(m, sd = f)), if x == m
: P_{ij}(x'|N(M, sd = 1-f)), if x == M

B. Cell j's read count at other loci, i.e. how similar the cell is to other cells that have copy number x at locus i. We use Apriori - an algorithm for association rule mining - to find groups of loci that tend to have correlated copy number states across cells. Let V_{i,K \to x} be the set of rules concluding copy number x for locus i, where k \in K are copy number profiles of up to n=4 loci in the form { S_1=x_1, S_2=x_2, ... S_n=x_n }. Further let C_r be the confidence of a rule r \in V_{i,K \to x}. For each cell j \in J matching any of the copy number profiles in K, we calculate:
P_B(x) \sim ∑_{r \in V_{i,K \to x}}C_r
, the cumulative confidence of the rules in support of x at i.

We first obtain a seed of cell-segment pairs by assigning a-priori copy number states only when argmax_{x \in [1,8]} P_A (x|x') > t. We use this seed as input to B. Finally, a-posteriori copy number for segment i in cell j is calculated as:

argmax_{x \in [1,8]} P_A(x|x') + P_B(x)


Segment-by-cell matrix of copy number states.


Noemi Andor


Andor, N.*, Lau, B.*, Catalanotti, C., Kumar, V., Sathe, A., Belhocine, K., Wheeler, T., et al. (2018) Joint single cell DNA-Seq and RNA-Seq of gastric cancer reveals subclonal signatures of genomic instability and gene expression. doi:

Borgelt C & Kruse R. (2002) Induction of Association Rules: Apriori Implementation.

See Also



##Calculate number of genes expressed per each cell:
gpc = apply(epg>0, 2, sum)

##Call function:
cnps = segmentExpression2CopyNumber(eps, gpc, cn, seed=0.5, nCores=2, stdOUT="log")
head(eps[,1:5]); ##Expression of first five cells
head(cnps[,1:5]); ##Copy number of first five cells

noemiandor/liayson documentation built on Dec. 13, 2018, 10:24 p.m.