Genotypes and selection model

Given two loci, A and B, with alleles A and a, and B and b, there must be nine possible gammetes. If the A allele and B allele are selected with coefficients $s_{A}$ and $s_{B}$, and we assume no dominance and no epistasis, we can write the average fitness per genotype. For simplicity we use an haploid model, which would be similar to the diploid homozygotes case (also the case of highly self-fertilizing species):

\ \begin{center} \begin{tabular}{ c | c c} & B & b \ \hline A & $[(1+s_A)(1+s_B)]^{\epsilon} $&$(1+s_{A}) (1-s_{B}) $\
a & $(1- s_{A})(1+ s_{B})$&$ [(1-s_A)(1-s_B)]^{\epsilon} $
\end{tabular} \end{center} \

From a previous experiment, we obtain values of absolute fitness (percentage of survival x number of offspring) denoted as $y$ of length $N$. For those $N$ genotypes there is whole-genome information in the form of biallelic SNPs, a total of $p$ loci. The genotypic information is represented in a genome matrix $X$ of $N$ rows and $p$ columns. The genotypes are represented as -1 for homozygotes of the reference allele and +1 for homozygote of the alternative allele (there are no heterozygotes in the dataset but they could be represented as 0).

Then we can propose that the observed fitness $y$ is a function of the selection coefficients at every locus, the genotypes, and possibly some epistatic term that informs about the interaction of multiple alleles at the same locus. We express such fitness as:

$$ y = w(s,X,e) = \Big[\prod_{i=1}^{p} (1+s_i \odot X_{i}) \Big]^{e\odot |X_{i}|} $$

The probabilistic model

Observed fitness

We can assume the data $y$ follows a Poisson distribution:

$$ y \sim \operatorname{\Gamma}( \alpha, \beta ) $$ $$ \lambda = w(\vec{s},X,e) = \prod_{i=1}^{p} (1+s_i \odot X_{i}) ^e $$

$$f(y) = \frac {\lambda ^{y}e^{-\lambda }}{y!} $$ So the probability of observing the $y$ value of a haplotype $h$ would be:

$$f(y_h) = (y!)^{-1} \times \prod_{i=1}^{p} (1+s_i \odot X_{hi})^{e \ y_h} \times e^{\prod_{i=1}^{p} (1+s_i \odot X_{hi})^e} $$

Selection coefficients

And the selection coefficients a Gamma distribution (typically with $\alpha$ and $beta$ parameters or $\alpha = k$ and $\theta = 1/\beta$):

$$s \sim \Gamma(\alpha, \beta)$$

,where the moments are:

$$ E[s] = \alpha / \beta = k\theta$$ $$ Var[s] = \alpha / \beta^2 = k\theta^2$$

The probability density function of Gamma in terms of $k$ and $\theta$ is:

$$f(s) = \frac{1}{\Gamma(k) \theta^{k}} s^{k-1} e^{-\frac{s}{\theta}}$$

In log likelihood form this can be written as:

$$\ln f(y) = (k-1) \ln(y) {-\frac{y}{\theta}} - \ln \Gamma(k) - k \ln \theta $$

The likelihood function

$$\ell(e,\alpha_s, \beta_s) = \prod_{h\ \in\ haps.} \sum_{m \in K} f(y_h | s_i, e)f(s_i | \alpha_s, \beta_s) $$

Extension where $e$ is a random variable (needs to be worked out)

An important extension of this method would be to instead of estimating a common epistatic term for the whole genome, we infer its ditribution.

To do that we integrate the likelihood over the distribution of e given some parameters $\theta_e$ of such distribution with known probability denstiy function.

$$ \ln \ell(\vec{s},\theta_e ) = \int_{-\inf}^{+\inf} \ln \ell(\vec{s}, e) f(e; \theta_e) de $$

Or for feasibility, the $e$ distribution can be bined in equal sizes and approximate:

$$ \ln \ell(\vec{s},\theta_e ) \approx \sum_{m \in e \ bins} \ln \ell(\vec{s}, e) P(e; \theta_e) $$ The distribution of $e$ should be optimally a conjugate of the likelihood.



MoisesExpositoAlonso/popgensim documentation built on May 7, 2019, 8:57 p.m.