EM: EM algorithm for the negative binomial mixture model
In johannabertl/SelectionMix: Negative binomial mixture model

Description Usage Arguments Details Value Author(s) See Also Examples

EM algorithm to estimate the parameters of a negative binomial mixture model with k components.

1	EM(theta.null, x.syn, x.non, cvec, iter, epsilon = NULL, ui, ci)

`theta.null`	Starting value, vector consisting of alpha, beta, p1, ..., pk-1.
`x.syn`	Vector of synonymous mutation counts per gene.
`x.non`	Vector of non-synonymous mutation counts per gene.
`cvec`	Parameter vector c1, ..., ck.
`iter`	Maximum number of iterations.
`epsilon`	Log-likelihood difference stopping criterion. If NULL, the algorithm is run for `iter` iterations. Otherwise, it is run until the increase in the log-likelihood is smaller than `epsilon`, but at most for `iter` iterations.
`ui`	Linear constraints for theta. See details and examples.
`ci`	Linear constraints for theta. See details and examples.

The parameters alpha, beta and p1, ..., pk-1 are estimated. pk = 1 - p1 - ... -pk-1. c1, ..., ck are fixed.

In the M step, constrained optimization is conducted with the function constrOptim. ui and ci are defined accordingly (see constrOptim): ui %*% theta - ci >= 0. It is important that ui and ci are specified such that p is a probability vector, see examples.

As a stopping criterion, the difference in log-likelihood can be used. To make sure that there is always at least a short trajectory for visual checks, the criterion is only used after at least 5 iterations.

theta: Matrix with the trajectories of the parameter values.
Q: Vector with the trajectory of the Q values.
loglikelihood: Vector with the trajectory of the log likelihood values.
convergence: Vector with the convergence code for each iteration of the M-step of the algorithm, see stats::optim (0 indicates successful completion).

Johanna Bertl

rnegbinmix::EM_random for a wrapper function with randomly drawn starting values.

### Example with 2 components ###

# Simulate dataset of synonymous and non-synonymous mutations

c1 = 0.5; c2 = 10
p1 = 0.2; p2 = 0.8
alpha = 10
beta = 5

mutations = rnegbinmix_syn_nonsyn(n=3000, alpha=alpha, beta=beta, c=c(c1, c2), p = c(p1, p2))

# function Q
Qtest = Q(theta = c(5, 1, 0.5), theta.prime = c(5, 1, 0.5), x.syn = mutations$Syn, x.non = mutations$Non, cvec = c(c1, c2))

# EM algorithm

# Constraints: ui %*% theta - ci >= 0
# alpha > 0, beta > 0, 0 <= p1 <=1

ui = rbind(c(1, 0, 0), c(0, 1, 0), c(0, 0, 1), c(0, 0, -1))
ci = c(0, 0, 0, -1)

# EM algorithm
EMres = EM(theta.null = c(5, 1, 0.5), x.syn = mutations$Syn, x.non = mutations$Non, cvec=c(c1, c2), iter=25, epsilon = 10^(-8), ui = ui, ci=ci)
matplot(EMres$theta, t="b")
plot(EMres$loglikelihood, t="b")


### Example with 2 components and non-integer values ###
# Usually, the mutation counts are integer values. If they have been corrected, e. g. for gene length, they might not be integers.

# EM algorithm
EMres_nonint = EM(theta.null = c(5, 1, 0.5), x.syn = mutations$Syn + 0.25, x.non = mutations$Non + 0.25, cvec=c(c1, c2), iter=25, epsilon = 10^(-8), ui = ui, ci=ci)
matplot(EMres_nonint$theta, t="b")
plot(EMres_nonint$loglikelihood, t="b")


### Example with 3 components ####

c1 = 0.01; c2 = 1; c3 = 100
p1 = 0.33; p2 = 0.33; p3 = 0.34
alpha = 10
beta = 5

mutations = rnegbinmix_syn_nonsyn(n=3000, alpha=alpha, beta=beta, c=c(c1, c2, c3), p = c(p1, p2, p3))

# Q

Qtest = Q(theta = c(5, 1, 0.33, 0.33), theta.prime = c(5, 1, 0.33, 0.33), x.syn = mutations$Syn, x.non = mutations$Non, c = c(c1, c2, c3))

# EM algorithm

# Constraints: ui %*% theta - ci >= 0
# alpha > 0, beta > 0, 0 <= p1, p2 <=1, p1 + p2 <= 1

ui = rbind(c(1, 0, 0, 0), c(0, 1, 0, 0), c(0, 0, 1, 0), c(0, 0, 0, 1),
 c(0, 0, -1, 0), c(0, 0, 0, -1), c(0, 0, -1, -1))
ci = c(0, 0, 0, 0, -1, -1, -1)

EMres = EM(theta.null = c(5, 1, 0.33, 0.33), x.syn = mutations$Syn, x.non = mutations$Non, cvec=c(c1, c2, c3), iter=25, epsilon = 10^(-8), ui = ui, ci=ci)
matplot(EMres$theta, t="b")
plot(EMres$loglikelihood, t="b")