segmentExpression2CopyNumber: Calling CNVs.

Description Usage Arguments Details Value Examples

View source: R/segmentExpression2CopyNumber.R


Maps single cell expression profiles to copy number profiles.


segmentExpression2CopyNumber(eps, gpc, cn, seed=0, outF=NULL, maxPloidy=8, 
                                       nCores=2, stdOUT="log.applyAR2seg")



Segment-by-cell matrix of expression.


Number of genes expressed per cell.


Average copy number across cells for each segment (i.e. row in eps).


The fraction of entries in a-priori segment-by-cell copy number matrix to be used as seed for association rule mining.


Output file prefix in which to print intermediary heatmaps and histograms, or NULL (default) if no print.


Let S := { S_1, S_2, ... S_n } be the set of n genomic segments obtained from bulk DNA-sequencing. The segment-by-cell expression matrix is first normalized by gene coverage. Let EN_{ij} and G_{ij} be the average number of UMIs and the number of expressed genes per segment j per cell i. The linear regression model:

EN_{*x} \sim ∑_{j \in S}G_{*j}

, fits the average segment expression per cell onto the cell's overall expression, for each x \in S. The model’s residuals R_{ij} reflect inter-cell differences in expression per segment that cannot be explained by differential gene coverage per cell. A first approximation of the cell-by-segment copy number matrix CN is given by:

CN_{ij} := R_{ij} * (cn_j / μ_j )

, where μ_j = mean_x(R_{xj}), is the mean residual per segment across cells and cn_j is the population-average copy number of segment j derived from DNA-seq.
Above transformation of EN_{ij} into CN_{ij} is in essence a numerical optimization, shifting the distribution of each segment to the average value expected from bulk DNA-seq.

Let x’ \in CN be the measured copy number of a given cell-segment pair, and x its corresponding true copy number state. Further, let CNF be the matrix of assigned copy number states per segment per cell. The probability of assigning copy number x to a cell i at locus j depends on:
A. Cell i's read count at locus j, calculated conditional on the measurement x’. We fit a Gaussian kernel on the read counts at locus j across cells to identify the major ( M) and the minor ( m) copy number states of j as the highest and second highest peak of the fit respectively. Then we calculate the proportion of cells expected at state m as f = \frac{cn_j - M}{m - M} . Then the probability of assigning copy number x to a cell i at locus j is calculated as:
P_A(x|x') \sim
: 0, if x \notin {m,M}
: P_{ij}(x'|N(m, sd = f)), if x == m
: P_{ij}(x'|N(M, sd = 1-f)), if x == M

B. Cell i's read count at other loci, i.e. how similar the cell is to other cells that have copy number x at locus j. We use Apriori - an algorithm for association rule mining - to find groups of loci that tend to have correlated copy number states across cells. Let R_{j,K \to x} be the set of rules concluding copy number x for locus j, where k \in K are copy number profiles of up to n=4 loci in the form { S_1=x_1, S_2=x_2, ... S_n=x_n }. Further let C_r be the confidence of a rule r \in R_{j,K \to x}. For each cell i \in I matching any of the copy number profiles in K, we calculate:
P_B(x) \sim ∑_{r \in R_{j,K \to x}}C_r
, the cumulative confidence of the rules in support of x at j.

We first assign CNF_{ij}:=argmax_{x \in [1,8]} P_A (x|x'), only when P_A (x|x')>t, to obtain a seed of cell-segment pairs with assigned a-priori copy number states. We use this seed as input to B. Finally, a-posteriori copy number for segment j in cell i is calculated as:

CNF_{ij}:=argmax_{x \in [1,8]} P_A(x|x') + P_B(x)


Segment-by-cell matrix of copy number states.


##Calculate number of genes expressed per each cell:
gpc = apply(epg>0, 2, sum)

##Call function:
cnps = segmentExpression2CopyNumber(eps, gpc, cn, seed=0.5, nCores=2, stdOUT="log")
head(eps[,1:5]); ##Expression of first five cells
head(cnps[,1:5]); ##Copy number of first five cells

noemiandor/liayson documentation built on Oct. 27, 2018, 12:15 a.m.