This function generates simulated data, for evaluation purposes. We start with an underlying multinomial population with entries proportional to rank^-power distribution, where power is fixed. Next, we draw multinomially from this distribution, a fixed number of cells, to generate each desired replicate. Then, this distribution is subject to log-normal error, and subsequently scaled up to the expected number of reads. To round into integers, the expected number of reads for each clone is finally pushed through a poisson process to generate integer read counts. The poisson "rounding" process is why the resulting read counts are not exactly as specified.

1 2 3 4 5 6 7 8 9 10 | ```
generate.clonal.data(
n = 2e+07,
num.cells.taken.vector = c(2000, 5000, 10000, 20000, 50000, 50000),
read.count.per.replicate.vector = rep(20000, length(num.cells.taken.vector)),
clonal.distribution.power = -sqrt(2),
pcr.noise.type = 'pareto',
pcr.pareto.location = 1,
pcr.pareto.shape = 1,
pcr.lognormal.meanlog = 0,
pcr.lognormal.sdlog = 1)
``` |

`n` |
The true number of distinct clones in the underlying assemblage |

`num.cells.taken.vector` |
A vector specifying the number of cells taken in each independent biological replicate |

`read.count.per.replicate.vector` |
A vector of the same length as num.cells.taken.vector, specifying the number of reads generated from each biological replicate, of the same corresponding indices |

`clonal.distribution.power` |
The true underlying clonal multinomial distribution is proportional to (1:n)^-clonal.distribution.power |

`pcr.noise.type` |
A string denoting the type of PCR noise: either 'pareto' (default), or 'lognormal'. The package author Yi Liu has found anecdotally and empirically that pareto distributions model sequencing amplification bonanzas much better than lognormal distributions. |

`pcr.pareto.location` |
The location parameter for the pareto distribution; matters only if the noise type is pareto. |

`pcr.pareto.shape` |
The shape parameter for the pareto distribution; matters only if the noise type is pareto. |

`pcr.lognormal.meanlog` |
The meanlog parameter for the lognormal distribution; matters only if the nosie type is lognormal |

`pcr.lognormal.sdlog` |
The sdlog parameter for the lognormal distribution; matters only if the nosie type is lognormal |

`read.count.matrix` |
This is a matrix of simulated counts, with rows corresponding to clones (classes, or species), and columns corresponding to biological replicates |

`true.clone.prob` |
This is the underlying simulated assemblage multinomial distribution used to generate read.count.matrix |

`true.clonality` |
This is the true clonality score of the underlying simulated assemblage |

Yi Liu (liuyipei@stanford.edu / liu.yi.pei@gmail.com)

1 2 3 4 5 | ```
my.data <- generate.clonal.data(n=2e3)
# n ~ 2e7 is more appropriate for a realistic B cell repertoire
my.lymphclon.results <- infer.clonality(my.data$read.count.matrix)
# a consistently improved estimate of clonality (the squared
# 2-norm of the underlying multinomial distribution)
``` |

