library(knitr) opts_chunk$set(fig.width = 12)

The research question is that "Can we reconstruct the lineage of how cells differentiate?". McKenna et al Science (2016) shows the possibility that lineage can be reconstructed through the edited barcode of CRISPR/Cas9 target sites. Scientists can reconstruct the cell lineange based on the information from barcodes of each cell.

The next question is how accurately can we recover the cell lineage? The Allen Institute Cell Lineage Reconstruction DREAM Challenge is hosted to answer the question and search for useful approaches.

We are Il-Youp Kwak and Wuming Going who participated this competition as a team (Kwak_Gong). We would like to share our methods and experience. Hope our findings helpful to everyone! :)

library(DCLEAR) library(phangorn) library(tidyverse) library(ape)

We tried to generate the sequence data, the way Salvador-Martinez and Grillo et al. (2019) described in 'General description of the simulation' from Result section. Basically, it simulate binary structure of cell lineage. For a simple example, given the tree structure of 7 cells below,

1 2 3 4 5 6 7

The barcode of cells generated as

1: '0000000000' 2: 'E00A000000' 3: '0000C00000' 4: 'EA0A000E00' 5: 'E00A000000' 6: 'E0A0CD0000' 7: '0000C---00'

Here, '0' stands for the initial state, '-' stands for interval dropout, and character letter stands for mutational outcomes.

Here is a code to generate simulation data.

m = 30 # number of targets acell = as.integer(rep(1,m)) ## mother cell with an unmutated state mu_d = 0.1 ## mutation rate (ex 0.01 to 0.3) d = 3 ## number of cell division n_s = 5 ## the number of possible mutational outcomes (1:n_s) p_d = 0.05 ## dropout probability nmstrings = c( '0', '-', LETTERS[1:n_s] ) sim_tree = list() sim_tree[[1]] = acell k = 2 while(k < 2^d) { ## codes for point mutation mother_cell = sim_tree[[k%/%2]] mu_loc = runif(m) < mu_d mutation_cites = (mother_cell == 1) & mu_loc n_mut = sum(mutation_cites) if (n_mut > 0) { mother_cell[mutation_cites] = as.integer(sample(n_s, n_mut, replace = T)+2) } ## codes for dropout dropout_cites = runif(m) < p_d if (sum(dropout_cites) > 2 ) { dropout_between = sample(which(dropout_cites), 2 ) mother_cell[dropout_between[1]:dropout_between[2]] = as.integer(2) } sim_tree[[k]] = mother_cell k = k+1 }

Simulated cells are shown below:

1:7 %>% map(~paste(nmstrings[sim_tree[[.]]], collapse=""))

By extending the code, we made our code `sim_seqdata`

for simulating data.

set.seed(1) mu_d1 = c( 30, 20, 10, 5, 5, 1, 0.01, 0.001) mu_d1 = mu_d1/sum(mu_d1) simn = 100 # number of cell samples m = 200 ## number of targets mu_d = 0.03 # mutation rate d = 12 # number of cell division p_d = 0.005 # dropout probability

To generate `r simn`

number of cells with `r m`

number of targets, `r mu_d`

mutation rate, `r d`

number of cell divisions, `r p_d`

dropout probability, and `r length(mu_d1)`

number of outcome states with outcome probability of `r mu_d1`

, we can run the code below.

sD = sim_seqdata(sim_n = simn, m = m, mu_d = mu_d, d = d, n_s = length(mu_d1), outcome_prob = mu_d1, p_d = p_d )

The result of sim_seqdata function, sD, is a list of two object, 'seqs' and 'tree'.

'seqs' is a sequence data of 'phyDat' object,

```
sD$seqs
```

and 'tree' is ground truth tree structure of 'phylo' object.

class(sD$tree)

One easy way to construct a tree is to calculate distance among cells and construct a tree from the distance matrix. One of widely used method for the sequence distance is hamming distance.

distH = dist.hamming(sD$seqs)

With Neighbor Joining or fastme by Desper, R. and Gascuel, O. (2002), we can construct tree from the distance matrix.

TreeNJ = NJ(distH) TreeFM = fastme.ols(distH)

We calculate Robinson-Foulds distance, one of evaluation metric used in the competition, to evaluate performance.

print( RF.dist(TreeNJ, sD$tree, normalize = TRUE) ) print( RF.dist(TreeFM, sD$tree, normalize = TRUE) )

Hamming distance simply count the number of different characters given two sequences. Say we have two sequences '00AB0' and '0-CB0'. Hamming distance is simply 2.

The second position, we have '0' and '-', and the third position, we have 'A' and 'C'. Would it be reasonable to put equal weights for these two differences? If not, what difference would be farther?

To account for this consideration, we proposed weights for '0' (initial state), '-' (interval dropout), ''(point dropout) and each outcome state (A-Z). Say $w_1$ is weight for '0', $w_2$ is for '-', $w_3$ is for '', and $w_a, \cdots, w_z$ are for outcome states A to Z.

So, for the given example, '00AB0' and '0-CB0', weighted hamming distance is $w_1 w_2 + w_a w_c$ .

We can estimate mutation probability for each outcome state from the data by their frequencies. Less frequent outcomes are less likely to observe. Less likely outcomes would corresponds with farther distances. Thus one idea for outcome states is with their information entropy. $w_a = -log P(\text{base is A}), \cdots, w_z = -log P(\text{base is Z})$.

Or, we may assign simple constant weight for $w_a, \cdots, w_z$.

We set weight for initial state, $w_1$, as 1. We can search for weights $w_2$ and $w_3$ that minimize averaged RF score on train set with simple grid search.

The proposed weighted hamming didn't account for interval dropout. It maybe better to consider interval dropout event in distance calculation. Previously, we assigned weights to each target site. Our extended idea is to assign weight on dropout interval. Say we have two sequences '00AB0' and '00- - -'. The Hamming distance score would be 3, and the weighted hamming score would be $w_a w_2 + w_b w_2 + w_1 w_2$. However, this interval dropout maybe one event of interval missing. So, with the new algorithm, weighted hamming distance II is just $w_2^*$.

Here's how we specify weights.

InfoW = -log(mu_d1) ## entropy as their weights InfoW[1:2] = 1 ## weight for initial state is 1 , weight for '-' is set to 1. InfoW[3:7] = 4.5 ## Constant weight for outcome states dist_wh1 = WH(sD$seqs, InfoW) #dist_wh1 = dist_weighted_hamming(sD$seqs, InfoW, dropout = FALSE)

We can construct tree using NJ or fastme.

TreeNJ_wh1 = NJ(dist_wh1) TreeFM_wh1 = fastme.ols(dist_wh1)

RF performance is as below.

print( RF.dist(TreeNJ_wh1, sD$tree, normalize = TRUE) ) print( RF.dist(TreeFM_wh1, sD$tree, normalize = TRUE) )

Here's how we specify weights.

InfoW = -log(mu_d1) ## entropy as their weights InfoW[1] = 1 # weight for initial state is 1 InfoW[2] = 12 # weight for interval dropout, '----', is set to 12. InfoW[3:7] = 3 # Constant weight for outcome states dist_wh2 = WH(sD$seqs, InfoW, dropout=TRUE)

We can construct tree using NJ or fastme.

TreeNJ_wh2 = NJ(dist_wh2) TreeFM_wh2 = fastme.ols(dist_wh2)

RF performance is as below.

print( RF.dist(TreeNJ_wh2, sD$tree, normalize = TRUE) ) print( RF.dist(TreeFM_wh2, sD$tree, normalize = TRUE) )

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.