EM_smartstart: EM algorithm for the negative binomial mixture model with...
In johannabertl/SelectionMix: Negative binomial mixture model

Description Usage Arguments Details See Also Examples

EM algorithm to estimate the parameters alpha, beta, p1, ..., pk-1 of a negative binomial mixture model with k components.

1 2	EM_smartstart(theta.null, x.syn, x.non, cvec, iter1, iter2, epsilon1 = NULL, epsilon2 = NULL, ui, ci, full.output = F)

`theta.null`	Starting values for the parameters p1, ..., pk-1 in EM1.
`x.syn`	Vector of synonymous mutation counts per gene.
`x.non`	Vector of non-synonymous mutation counts per gene.
`cvec`	Parameters c1, ..., ck.
`iter1`	Maximum number of iterations for EM1.
`iter2`	Maximum number of iterations for EM2.
`epsilon1`	Stopping criterion for EM1. If NULL, the maximum number of iterations is used.
`epsilon2`	Stopping criterion for EM2. If NULL, the maximum number of iterations is used.
`ui`	Constraints on the parameters. See details.
`ci`	Constraints on the parameters. See details.
`full.output`	logical. If T, the results of the ML estimation and EM1 are output. Otherwise, only the output of EM2 is output (in the same format as by the function `EM`).

The parameters alpha and beta are estimated on the synonymous mutations by ML. Then, the parameters p1, ..., pk-1 are estimated by an EM algorithm on the non-synonymous mutations using the estimated alpha and beta as known parameters (EM1). Finally, alpha, beta and p1, ..., pk-1 are used as starting values in another EM algorithm, where all of them are re-estimated (EM2).

For the estimation of alpha and beta on the synonymous mutations, the function MASS::fitdistr is used.

In the M step of both EM algorithms, constrained optimization is conducted with the function constrOptim. ui and ci are defined accordingly (see constrOptim): ui %*% theta - ci >= 0. It is important that ui and ci are specified such that p is a probability vector, see examples. It is not allowed to specify constraints that affect (alpha and beta) and (p1, ..., pk-1) jointly.

As a stopping criterion, the difference in log-likelihood can be used. To make sure that there is always at least a short trajectory for visual checks, the criterion is only used after at least 5 iterations.

The function EM_smartstart_apply can be used to run the algorithm on a list of datasets (e. g. using mclapply for parallel execution).

MASS::fitdistr

### Example with 2 components ###

# Simulate dataset of synonymous and non-synonymous mutations

c1 = 0.5; c2 = 10
p1 = 0.2; p2 = 0.8
alpha = 10
beta = 5

mutations = rnegbinmix_syn_nonsyn(n=3000, alpha=alpha, beta=beta, c=c(c1, c2), p = c(p1, p2))

# EM algorithm

# Constraints: ui %*% theta - ci >= 0
# alpha > 0, beta > 0, 0 <= p1 <=1

ui = rbind(c(1, 0, 0), c(0, 1, 0), c(0, 0, 1), c(0, 0, -1))
ci = c(0, 0, 0, -1)

# EM algorithm
EMtest = EM_smartstart(theta.null = 0.5, x.syn = mutations$Syn, x.non = mutations$Non, cvec=c(c1, c2), iter1=10, iter2=10, epsilon2 = 10^(-8), ui = ui, ci = ci)
matplot(EMtest$theta, t="b")
plot(EMtest$loglikelihood, t="b")



### Example with 4 components ###

c1 = 0.1; c2 = 1; c3 = 10; c4 = 100;
cvec = c(c1, c2, c3, c4)
p1 = 0.2; p2 = 0.6; p3 = 0.1; p4 = 0.1;
pvec = c(p1, p2, p3, p4)
alpha = 1.72396
beta = 89.019

mutations = rnegbinmix_syn_nonsyn(n=3000, alpha=alpha, beta=beta, c=cvec, p = pvec)


# EM algorithm

# Constraints: ui %*% theta - ci >= 0
# alpha > 0, beta > 0, 0 <= p1 <=1

ui = rbind(diag(1, 5),
     cbind(rep(0, 3), rep(0, 3), diag(-1, 3)),
     c(0, 0, rep(-1, 3)))

ci = c(rep(0, 5), rep(-1, 4))

# EM algorithm
EMtest = EM_smartstart(theta.null = c(0.25, 0.25, 0.25), x.syn = mutations$Syn, x.non = mutations$Non, cvec=cvec, iter1=10, iter2=50, epsilon2 = 10^(-8), ui = ui, ci=ci, full.output=T)
matplot(EMtest$EM_final$theta, t="b")
plot(EMtest$EM_final$loglikelihood, t="b")