View source: R/segmentExpression2CopyNumber.R
segmentExpression2CopyNumber | R Documentation |
Maps single cell expression profiles to copy number profiles.
segmentExpression2CopyNumber(eps, gpc, cn, seed=0, outF=NULL, maxPloidy=8, nCores=2, stdOUT="log.applyAR2seg")
eps |
Segment-by-cell matrix of expression. |
gpc |
Number of genes expressed per cell. |
cn |
Average copy number across cells for each segment (i.e. row in eps). |
seed |
The fraction of entries in a-priori segment-by-cell copy number matrix to be used as seed for association rule mining. |
outF |
Output file prefix in which to print intermediary heatmaps and histograms, or NULL (default) if no print. |
maxPloidy |
The maximum ploidy to accept as solution. |
nCores |
The numbers of threads used. |
stdOUT |
Log-file to which standard output is redirected during parallel processing. |
Let S := { S_1, S_2, ... S_n } be the set of n genomic segments obtained from bulk DNA-sequencing. Let E_{ij} and G_{ij} be the average number of UMIs and the number of expressed genes per segment i per cell j. The segment-by-cell expression matrix is first normalized by gene coverage. For each x \in S, the linear regression model:
E_{x*} \sim ∑_{i \in S}G_{i*}
, fits the average segment expression per cell onto the cell's overall gene coverage. The model's residuals R_{ij} reflect inter-cell differences in expression per segment that cannot be explained by differential gene coverage per cell. A first approximation of the segment-by-cell copy number matrix CN is given by:
CN_{ij} := R_{ij} * (cn_i / \bar{R_{i*}} )
, where cn_i is the population-average copy number of segment i derived from DNA-seq. Above transformation of E_{ij} into CN_{ij} is in essence a numerical optimization, shifting the distribution of each segment to the average value expected from bulk DNA-seq.
Let x' \in CN be the measured copy number of a given segment-cell pair, and x its corresponding true copy number state. The probability of assigning copy number x to a cell j at locus i depends on:
A. Cell j's read count at locus i, calculated conditional on the measurement x'.
Using a Gaussian smoothing kernel, we compute the kernel density estimate of the read counts at locus i across cells to identify the major (M) and the minor (m) copy number states of i as the highest and second highest peak of the fit respectively. Then we calculate the proportion of cells expected at state m as f = \frac{cn_i - M}{m - M} . The probability of assigning copy number x to a cell j at locus i is calculated as:
P_A(x|x') \sim
: 0, if x \notin {m,M}
: P_{ij}(x'|N(m, sd = f)), if x == m
: P_{ij}(x'|N(M, sd = 1-f)), if x == M
B. Cell j's read count at other loci, i.e. how similar the cell is to other cells that have copy number x at locus i. We use Apriori - an algorithm for association rule mining - to find groups of loci that tend to have correlated copy number states across cells. Let V_{i,K \to x} be the set of rules concluding copy number x for locus i, where k \in K are copy number profiles of up to n=4 loci in the form { S_1=x_1, S_2=x_2, ... S_n=x_n }. Further let C_r be the confidence of a rule r \in V_{i,K \to x}. For each cell j \in J matching any of the copy number profiles in K, we calculate:
P_B(x) \sim ∑_{r \in V_{i,K \to x}}C_r
, the cumulative confidence of the rules in support of x at i.
We first obtain a seed of cell-segment pairs by assigning a-priori copy number states only when argmax_{x \in [1,8]} P_A (x|x') > t. We use this seed as input to B. Finally, a-posteriori copy number for segment i in cell j is calculated as:
argmax_{x \in [1,8]} P_A(x|x') + P_B(x)
Segment-by-cell matrix of copy number states.
Noemi Andor
Andor, N.*, Lau, B.*, Catalanotti, C., Kumar, V., Sathe, A., Belhocine, K., Wheeler, T., et al. (2018) Joint single cell DNA-Seq and RNA-Seq of gastric cancer reveals subclonal signatures of genomic instability and gene expression. doi: https://doi.org/10.1101/445932
Borgelt C & Kruse R. (2002) Induction of Association Rules: Apriori Implementation.
apriori
##Calculate number of genes expressed per each cell: data(epg) gpc = apply(epg>0, 2, sum) ##Call function: data(eps) data(segments) cn=segments[rownames(eps),"CN_Estimate"] cnps = segmentExpression2CopyNumber(eps, gpc, cn, seed=0.5, nCores=2, stdOUT="log") head(eps[,1:3]); ##Expression of first three cells head(cnps[,1:3]); ##Copy number of first three cells
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.