segmentExpression2CopyNumber: Calling CNVs.
In liayson: Linking Singe-Cell Transcriptomes Atween Contemporary Subpopulation Genomes

View source: R/segmentExpression2CopyNumber.R

segmentExpression2CopyNumber

R Documentation

Calling CNVs.

Description

Maps single cell expression profiles to copy number profiles.

Usage

segmentExpression2CopyNumber(eps, gpc, cn, seed=0, outF=NULL, maxPloidy=8, 
                                       nCores=2, stdOUT="log.applyAR2seg")

Arguments

`eps`	Segment-by-cell matrix of expression.
`gpc`	Number of genes expressed per cell.
`cn`	Average copy number across cells for each segment (i.e. row in eps).
`seed`	The fraction of entries in a-priori segment-by-cell copy number matrix to be used as seed for association rule mining.
`outF`	Output file prefix in which to print intermediary heatmaps and histograms, or NULL (default) if no print.
`maxPloidy`	The maximum ploidy to accept as solution.
`nCores`	The numbers of threads used.
`stdOUT`	Log-file to which standard output is redirected during parallel processing.

Details

Let S := { S_1, S_2, ... S_n } be the set of n genomic segments obtained from bulk DNA-sequencing. Let E_{ij} and G_{ij} be the average number of UMIs and the number of expressed genes per segment i per cell j. The segment-by-cell expression matrix is first normalized by gene coverage. For each x \in S, the linear regression model:

E_{x*} \sim ∑_{i \in S}G_{i*}

, fits the average segment expression per cell onto the cell's overall gene coverage. The model's residuals R_{ij} reflect inter-cell differences in expression per segment that cannot be explained by differential gene coverage per cell. A first approximation of the segment-by-cell copy number matrix CN is given by:

CN_{ij} := R_{ij} * (cn_i / \bar{R_{i*}} )

, where cn_i is the population-average copy number of segment i derived from DNA-seq. Above transformation of E_{ij} into CN_{ij} is in essence a numerical optimization, shifting the distribution of each segment to the average value expected from bulk DNA-seq.

Let x' \in CN be the measured copy number of a given segment-cell pair, and x its corresponding true copy number state. The probability of assigning copy number x to a cell j at locus i depends on:
A. Cell j's read count at locus i, calculated conditional on the measurement x'. Using a Gaussian smoothing kernel, we compute the kernel density estimate of the read counts at locus i across cells to identify the major (M) and the minor (m) copy number states of i as the highest and second highest peak of the fit respectively. Then we calculate the proportion of cells expected at state m as f = \frac{cn_i - M}{m - M} . The probability of assigning copy number x to a cell j at locus i is calculated as:
P_A(x|x') \sim
: 0, if x \notin {m,M}
: P_{ij}(x'|N(m, sd = f)), if x == m
: P_{ij}(x'|N(M, sd = 1-f)), if x == M

B. Cell j's read count at other loci, i.e. how similar the cell is to other cells that have copy number x at locus i. We use Apriori - an algorithm for association rule mining - to find groups of loci that tend to have correlated copy number states across cells. Let V_{i,K \to x} be the set of rules concluding copy number x for locus i, where k \in K are copy number profiles of up to n=4 loci in the form { S_1=x_1, S_2=x_2, ... S_n=x_n }. Further let C_r be the confidence of a rule r \in V_{i,K \to x}. For each cell j \in J matching any of the copy number profiles in K, we calculate:
P_B(x) \sim ∑_{r \in V_{i,K \to x}}C_r
, the cumulative confidence of the rules in support of x at i.

We first obtain a seed of cell-segment pairs by assigning a-priori copy number states only when argmax_{x \in [1,8]} P_A (x|x') > t. We use this seed as input to B. Finally, a-posteriori copy number for segment i in cell j is calculated as:

argmax_{x \in [1,8]} P_A(x|x') + P_B(x)

Value

Segment-by-cell matrix of copy number states.

Author(s)

Noemi Andor

References

Andor, N.*, Lau, B.*, Catalanotti, C., Kumar, V., Sathe, A., Belhocine, K., Wheeler, T., et al. (2018) Joint single cell DNA-Seq and RNA-Seq of gastric cancer reveals subclonal signatures of genomic instability and gene expression. doi: https://doi.org/10.1101/445932

Borgelt C & Kruse R. (2002) Induction of Association Rules: Apriori Implementation.

Examples

##Calculate number of genes expressed per each cell:
data(epg)
gpc = apply(epg>0, 2, sum)

##Call function:
data(eps)
data(segments)
cn=segments[rownames(eps),"CN_Estimate"]

	cnps = segmentExpression2CopyNumber(eps, gpc, cn, seed=0.5, nCores=2, stdOUT="log")
	head(eps[,1:3]); ##Expression of first three cells
	head(cnps[,1:3]); ##Copy number of first three cells

liayson documentation built on Sept. 30, 2022, 1:07 a.m.