# EM: EM algorithm for the negative binomial mixture model In johannabertl/SelectionMix: Negative binomial mixture model

## Description

EM algorithm to estimate the parameters of a negative binomial mixture model with k components.

## Usage

 `1` ```EM(theta.null, x.syn, x.non, cvec, iter, epsilon = NULL, ui, ci) ```

## Arguments

 `theta.null` Starting value, vector consisting of alpha, beta, p1, ..., pk-1. `x.syn` Vector of synonymous mutation counts per gene. `x.non` Vector of non-synonymous mutation counts per gene. `cvec` Parameter vector c1, ..., ck. `iter` Maximum number of iterations. `epsilon` Log-likelihood difference stopping criterion. If NULL, the algorithm is run for `iter` iterations. Otherwise, it is run until the increase in the log-likelihood is smaller than `epsilon`, but at most for `iter` iterations. `ui` Linear constraints for theta. See details and examples. `ci` Linear constraints for theta. See details and examples.

## Details

The parameters alpha, beta and p1, ..., pk-1 are estimated. pk = 1 - p1 - ... -pk-1. c1, ..., ck are fixed.

In the M step, constrained optimization is conducted with the function `constrOptim`. `ui` and `ci` are defined accordingly (see constrOptim): `ui %*% theta - ci >= 0`. It is important that ui and ci are specified such that p is a probability vector, see examples.

As a stopping criterion, the difference in log-likelihood can be used. To make sure that there is always at least a short trajectory for visual checks, the criterion is only used after at least 5 iterations.

## Value

theta

Matrix with the trajectories of the parameter values.

Q

Vector with the trajectory of the Q values.

loglikelihood

Vector with the trajectory of the log likelihood values.

convergence

Vector with the convergence code for each iteration of the M-step of the algorithm, see `stats::optim` (0 indicates successful completion).

## Author(s)

Johanna Bertl

`rnegbinmix::EM_random` for a wrapper function with randomly drawn starting values.
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62``` ```### Example with 2 components ### # Simulate dataset of synonymous and non-synonymous mutations c1 = 0.5; c2 = 10 p1 = 0.2; p2 = 0.8 alpha = 10 beta = 5 mutations = rnegbinmix_syn_nonsyn(n=3000, alpha=alpha, beta=beta, c=c(c1, c2), p = c(p1, p2)) # function Q Qtest = Q(theta = c(5, 1, 0.5), theta.prime = c(5, 1, 0.5), x.syn = mutations\$Syn, x.non = mutations\$Non, cvec = c(c1, c2)) # EM algorithm # Constraints: ui %*% theta - ci >= 0 # alpha > 0, beta > 0, 0 <= p1 <=1 ui = rbind(c(1, 0, 0), c(0, 1, 0), c(0, 0, 1), c(0, 0, -1)) ci = c(0, 0, 0, -1) # EM algorithm EMres = EM(theta.null = c(5, 1, 0.5), x.syn = mutations\$Syn, x.non = mutations\$Non, cvec=c(c1, c2), iter=25, epsilon = 10^(-8), ui = ui, ci=ci) matplot(EMres\$theta, t="b") plot(EMres\$loglikelihood, t="b") ### Example with 2 components and non-integer values ### # Usually, the mutation counts are integer values. If they have been corrected, e. g. for gene length, they might not be integers. # EM algorithm EMres_nonint = EM(theta.null = c(5, 1, 0.5), x.syn = mutations\$Syn + 0.25, x.non = mutations\$Non + 0.25, cvec=c(c1, c2), iter=25, epsilon = 10^(-8), ui = ui, ci=ci) matplot(EMres_nonint\$theta, t="b") plot(EMres_nonint\$loglikelihood, t="b") ### Example with 3 components #### c1 = 0.01; c2 = 1; c3 = 100 p1 = 0.33; p2 = 0.33; p3 = 0.34 alpha = 10 beta = 5 mutations = rnegbinmix_syn_nonsyn(n=3000, alpha=alpha, beta=beta, c=c(c1, c2, c3), p = c(p1, p2, p3)) # Q Qtest = Q(theta = c(5, 1, 0.33, 0.33), theta.prime = c(5, 1, 0.33, 0.33), x.syn = mutations\$Syn, x.non = mutations\$Non, c = c(c1, c2, c3)) # EM algorithm # Constraints: ui %*% theta - ci >= 0 # alpha > 0, beta > 0, 0 <= p1, p2 <=1, p1 + p2 <= 1 ui = rbind(c(1, 0, 0, 0), c(0, 1, 0, 0), c(0, 0, 1, 0), c(0, 0, 0, 1), c(0, 0, -1, 0), c(0, 0, 0, -1), c(0, 0, -1, -1)) ci = c(0, 0, 0, 0, -1, -1, -1) EMres = EM(theta.null = c(5, 1, 0.33, 0.33), x.syn = mutations\$Syn, x.non = mutations\$Non, cvec=c(c1, c2, c3), iter=25, epsilon = 10^(-8), ui = ui, ci=ci) matplot(EMres\$theta, t="b") plot(EMres\$loglikelihood, t="b") ```