segmentExpression2CopyNumber: Calling CNVs. In noemiandor/liayson: Linking Singe-Cell Transcriptomes Atween Contemporary Subpopulation Genomes

Description

Maps single cell expression profiles to copy number profiles.

Usage

 1 2 segmentExpression2CopyNumber(eps, gpc, cn, seed=0, outF=NULL, maxPloidy=8, nCores=2, stdOUT="log.applyAR2seg") 

Arguments

 eps Segment-by-cell matrix of expression. gpc Number of genes expressed per cell. cn Average copy number across cells for each segment (i.e. row in eps). seed The fraction of entries in a-priori segment-by-cell copy number matrix to be used as seed for association rule mining. outF Output file prefix in which to print intermediary heatmaps and histograms, or NULL (default) if no print. maxPloidy The maximum ploidy to accept as solution. nCores The numbers of threads used. stdOUT Log-file to which standard output is redirected during parallel processing.

Details

Let S := { S_1, S_2, ... S_n } be the set of n genomic segments obtained from bulk DNA-sequencing. Let E_{ij} and G_{ij} be the average number of UMIs and the number of expressed genes per segment i per cell j. The segment-by-cell expression matrix is first normalized by gene coverage. For each x \in S, the linear regression model:

E_{x*} \sim ∑_{i \in S}G_{i*}

, fits the average segment expression per cell onto the cell's overall gene coverage. The model's residuals R_{ij} reflect inter-cell differences in expression per segment that cannot be explained by differential gene coverage per cell. A first approximation of the segment-by-cell copy number matrix CN is given by:

CN_{ij} := R_{ij} * (cn_i / \bar{R_{i*}} )

, where cn_i is the population-average copy number of segment i derived from DNA-seq. Above transformation of E_{ij} into CN_{ij} is in essence a numerical optimization, shifting the distribution of each segment to the average value expected from bulk DNA-seq.

Let x' \in CN be the measured copy number of a given segment-cell pair, and x its corresponding true copy number state. The probability of assigning copy number x to a cell j at locus i depends on:
A. Cell j's read count at locus i, calculated conditional on the measurement x'. Using a Gaussian smoothing kernel, we compute the kernel density estimate of the read counts at locus i across cells to identify the major (M) and the minor (m) copy number states of i as the highest and second highest peak of the fit respectively. Then we calculate the proportion of cells expected at state m as f = \frac{cn_i - M}{m - M} . The probability of assigning copy number x to a cell j at locus i is calculated as:
P_A(x|x') \sim
: 0, if x \notin {m,M}
: P_{ij}(x'|N(m, sd = f)), if x == m
: P_{ij}(x'|N(M, sd = 1-f)), if x == M

B. Cell j's read count at other loci, i.e. how similar the cell is to other cells that have copy number x at locus i. We use Apriori - an algorithm for association rule mining - to find groups of loci that tend to have correlated copy number states across cells. Let V_{i,K \to x} be the set of rules concluding copy number x for locus i, where k \in K are copy number profiles of up to n=4 loci in the form { S_1=x_1, S_2=x_2, ... S_n=x_n }. Further let C_r be the confidence of a rule r \in V_{i,K \to x}. For each cell j \in J matching any of the copy number profiles in K, we calculate:
P_B(x) \sim ∑_{r \in V_{i,K \to x}}C_r
, the cumulative confidence of the rules in support of x at i.

We first obtain a seed of cell-segment pairs by assigning a-priori copy number states only when argmax_{x \in [1,8]} P_A (x|x') > t. We use this seed as input to B. Finally, a-posteriori copy number for segment i in cell j is calculated as:

argmax_{x \in [1,8]} P_A(x|x') + P_B(x)

Value

Segment-by-cell matrix of copy number states.

Noemi Andor

References

Andor, N.*, Lau, B.*, Catalanotti, C., Kumar, V., Sathe, A., Belhocine, K., Wheeler, T., et al. (2018) Joint single cell DNA-Seq and RNA-Seq of gastric cancer reveals subclonal signatures of genomic instability and gene expression. doi: https://doi.org/10.1101/445932

Borgelt C & Kruse R. (2002) Induction of Association Rules: Apriori Implementation.

apriori
  1 2 3 4 5 6 7 8 9 10 11 ##Calculate number of genes expressed per each cell: data(epg) gpc = apply(epg>0, 2, sum) ##Call function: data(eps) data(segments) cn=segments[rownames(eps),"CN_Estimate"] cnps = segmentExpression2CopyNumber(eps, gpc, cn, seed=0.5, nCores=2, stdOUT="log") head(eps[,1:5]); ##Expression of first five cells head(cnps[,1:5]); ##Copy number of first five cells