knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

In the "WExT Statistics" vignette, I walked through the statistical theory underlying WExT. Here, I explain the two algorithms Leirson et al. devised to execute the test.

As a reminder, we are trying to solve

$$ \begin{align} \tag{4} \Phi_{WR}(M) = \text{Pr}(T_M \ge t_M | Y_M = \vec{r}_M) \end{align} $$

This is the probability of observing at least $t_M$ mutually exclusive mutations in a gene set $M$ under the proposed model where $Y_i = \sum^{n}{j=1}{X{ij}}$ is a Poisson binomial distributed variable for the number of mutation in $g_i$ ($Y_M = [Y_i]_{i \in M}$).

Recursive formula for the weighted exclusivity test

For any gene sets $M$ of size $k$, assuming ${Y_i}^m_{i=1}$ are mutually independent (i.e. the mutations in one sample do not affect another), equation $(4)$ can be written as

$$ \tag{8} \Phi_{WR}(M) = \frac{\text{Pr}(T_M \ge t_M, Y_M = \vec{r}M)}{\prod{i \in M}{\text{Pr}(Y_i = r_i)}} $$

This is the probability of each more extreme case than $t_M$ divided by the total probability of any assortment of mutations given fixed rows.

A recursive formula for the numerator

Without loss of generality, let $M = {1, ..., k}$.

The joint probability in the numerator of $(8)$ can be found using a recursive formula.

With $F(t_M, r_1, ... , r_k, n) = \text{Pr}(T_M \ge t_M, Y_M = \vec{r}_M)$, the recurrence relationship can be written as

$$ \tag{9} F(t, x_1, ..., x_k, j) = \sum_{\pi \in {0,1}^k} \prod^k_{i=1} q_{ij \pi_i} F(w_{\pi}(t), y_{\pi_1}(x_1), ..., y_{\pi_k}(x_k), j-1) $$

where

$$ \tag{10} q_{ijl}= \begin{cases} p_{ij} & \text{if } l=1 \ 0 & \text{otherwise} \end{cases} $$ $$ \tag{11} w_{\pi}(t)= \begin{cases} t-1 & \text{if } \sum^{k}{i=1}{\pi{i}} = 1 \ t & \text{otherwise} \end{cases} $$

$$ \tag{12} y_{l}(x)= \begin{cases} x-1 & \text{if } l=1 \ x & \text{otherwise} \end{cases} $$

The base cases for $(8)$ are

$$ \tag{13} F(t, x_1, ..., x_k, j) = \begin{cases} 1 & \text{if } t = x_1 = ... = x_k = j = 0 \ 0 & \text{if } \text{min}{ t, x_1, ..., x_k, j } < 0, \text{ or } t > \sum^{k}{i=1}{x_i}, \text{ or } \text{max}^{k}{i=1}x_i > n \end{cases} $$

Explanation

The outer sum of $(9)$ introduces $\pi$, a vector of length $k$ with all values either $0$ or $1$. Thus, the sum runs through all possible sets $\pi$. For instance, if $M$ had three genes (i.e. $k = 3$), then $\pi$ would run through the vectors ${0,0,0}, {1,0,0}, {0,1,0}, {0,0,1}, ..., {1,1,1}$. Each vector is the possible mutations of a sample, where $1$ indicates is mutated and $0$ indicates is not.

Within the sum in $(9)$, there is a product that iterates over each gene in $M$ (indexed as $i \in {1, .., k}$). For each gene, it is multiplying two values: $q_{ij \pi_i}$ and $F(w_{\pi}(t), y_{\pi_1}(x_1), ..., y_{\pi_k}(x_k), j-1)$.

The first value, $q_{ij \pi_i}$, is $p_{ij}$ if $\pi_i = 1$ and $0$ otherwise. Thus, if the $i^{th}$ value of $\pi$ is $1$ (i.e. gene number $i$ is mutated), then the probability of sample $j$ having a mutation in gene $g_i$ is returned. This probability comes from the weight matrix $W$, defined in the "WExT Statistics" vignette, which is the probability of observing a mutation in gene $g_i$ in sample $s_j$

$$ \begin{align} \tag{6} W = [w_{ij}] = \frac{1}{|\Omega|} \sum_{B \in \Omega}{B} \end{align} $$

The function $F$ takes in several parameters. The first is $w_{\pi}(t)$ which is $t-1$ is the sum of all values in the binary vector $\pi$ is equal to $1$, and $t$ otherwise. If the sum of $\pi$ is $1$, this indicates that there is only one gene being considered in that step of the iteration, so the number of mutually exclusive events ($t = t_M$) is reduced by $1$ for the next step in the recursion through $F$. This ensures the number of mutually exclusive events is held constant.

The next set of parameters for $F$ are $y_{\pi}(x)$ for every row-sum (i.e. number of samples with a mutation in gene $g_i$) of genes in $M$. This simply reduces the row-sum by $1$ if the the index in $\pi$ is $1$. This holds the row-sums constant through the recursion.

Finally, $j-1$ reduces the number of samples by $1$ to pass to the next layer in recursion

The base case for $F$ is the final value for the recursion.

A dynamic programming solution for the denominator

The authors claim that a standard dynamic programming method exists for calculating the denominator. It is a Poisson-Binomial probability mass function, so I will need to look into R functions for this.



jhrcook/wext documentation built on May 17, 2021, 1:19 a.m.