#' @name MBASIC
#' @title Bayesian clustering model for a state-space matrix.
#'
#' @description This function is designed to analyze general state-space models. The data consists of observations over I units under N experiments with K different conditions. There are S states for each experiment and unit.
#' @param Y An N by I matrix containing the data from N experiments across I observation units.
#' @param Gamma The data for background information. Default: \code{NULL}. See details for more information.
#' @param S An integer for the number of states.
#' @param statemap A vector the same length as the number of mixture components, and taking values from 1 to S representing the states of each component. Default: NULL. See details for more information.
#' @param fac A vector of levels repr1esenting the conditions of each replicate.
#' @param struct A K by J matrix indicating the structures of each cluster.
#' @param J The number of clusters to be identified.
#' @param family The distribution of family to be used. Either "lognormal", "negbin", "binom", "gamma-binom" or "scaled-t". See details for more information.
#' @param method A string for the fitting method, 'MBASIC' (default), 'PE-MC', 'SE-HC', or 'SE-MC'. See details for more information.
#' @param maxitr The maximum number of iterations in the E-M algorithm. Default: 100.
#' @param tol Tolerance for the relative increment in log-likelihood value in checking the algorithm convergence. Default: 1e-10.
#' @param tol.par Tolerance for the relative error in parameter updates in checking the algorithm convergence. Default: 1e-5.
#' @param zeta The initial value for the proportion of units that are not clustered. Default: 0.1. If 0, no singleton cluster is fitted.
#' @param out The file directory for writing fitting information in each E-M iteration. Default: NULL (no information is outputted).
#' @param verbose A boolean variable indicating whether intermediate model fitting metrics should be printed. Default: FALSE.
#' @param min.count The minimum count threshold for each component. This argument is only used when \code{family='negbin'}. If it is a single number, it is the threshold for all components >= 2. If it is a vector, it is the threshold for each component.
#' @param para A list object that contains the true model parameters. Default: NULL. See details for more information.
#' @param initial Either a \code{\link{list}} or \linkS4class{MBASICFit} object that provides initial values for model parameters. Default: NULL.
#' @details
#' MBASIC assumes that there are S underlying states for each expeirment and each loci. A single state may also include multiple mixture components, indexed by m. In total, we can have M mixture components. The mapping from mixture components to the states are provided by \code{statemap}. By default, \code{statemap=NULL}, in which case each state has only one component, and M=S.\cr
#' Function MBASIC currently supports five different distributional families: log-normal, negative binomial, binomial, gamma-binomial and scaled-t distributions. This should be specified by the \code{family} argument.\cr
#' For the log-normal distributions, log(Y+1) is modeled as normal distributions. For experiment n, if locus i has component m, distribution for log(Y[n,i]+1) is N(Mu[n,m]*Gamma[n,i+I(m-1)], Sigma[n,m]).\cr
#' For the negative binomial distributions, the meanings of Mu and Sigma are different. For experiment n, if locus i has component m, distribution of Y[n,i]-min.count[n,m] is NB(Mu[n,m]*Gamma[n,i+I(m-1)], Sigma[n,m]). In this package, NB(mu, a) denotes the negative-binomial distribution with mean mu and size a (i.e. the variance is mu*(1+mu/a)). Notice that if a single value of 'min.count' is provided, it will be converted to a vector of \code{c(0, rep(min.count, M-1))}.\cr
#' The 'min.count' for the negative binomial distribution specifies the minimum enrichment for each replicate and compoment. It will be formed in an N by M matrix, but the function accepts its input as a vector of length N (recommended) or M, or a single value. If it is a single value, it will be used as the common threshold for all replicates and compoments >= 2. If it is a vector of length N, it will be used as the replicate specific thresholds for all components >= 2. If it is a vector of length M, it will be used as the threshold for each component for all replicates. If \code{min.count=NULL}, no threshold is applied.\cr
#' For the binomial distribution, for experiment n, if locus i has component m, distribution for Y[n,i] is Binom(Gamma[n,i], Mu[n,m]).\cr
#' For the gamma-binomial distribution, for experiment n, if locus i has component m, distribution for Y[n,i] is Binom(Gamma[n,i], p) where p follows a gamma prior of gamma(Mu[n,m], Sigma[n,m]).\cr
#' For the scaled-t distribution, for experiment n, if locus i has component m, distribution for Y[n,i]/Gamma[n,i+I(m-1)]/Mu[n,m] is t distribution with Sigma[n,m] degrees of freedom.\cr
#' The \code{Gamma} parameter encodes the background information for all N experiments, I units and M components. It can be a matrix with dimension K by I * M, where the background datum for experiment n, unit i and component m is Gamma[n,i+I*(m-1)]. If in the input \code{Gamma=NULL}, then it is regenerated as a matrix of entries 1 with dimension N x IM. If in the input \code{Gamma} is a N x I matrix, then this function adds I(M-1) columns of all 1s to this matrix.\cr
#' The \code{method} argument determines what fitting method will be used. The default is 'MBASIC', where the states and the clustering are simultaneously estimated. 'SE-HC' and 'SE-MC' methods use 2-step algorithms. In the first step, both estimate the states for each unit by an E-M algorithm for each experiment. In the second step, 'SE-HC' uses hierarchical clustering to cluster the units, while 'SE-MC' uses function \code{\link{MBASIC.state}} to identify clusters.\cr
#' The \code{para} argument takes a list object that is supposed to include the following fields:
#'\tabular{ll}{
#' W \tab A K by (J*S) matrix. The (k,J*(s-1)+j)-th entry is the probability that the units in cluster j has state s in the k-th experiment.\cr
#' Z \tab An I by J matrix. The (i,j)-th entry is the indicator whether the i-th unit belongs to cluster j.\cr
#' Theta \tab A K by (I*S) matrix. The (k,I*(s-1)+i)-th entry is the probability that the i-th unit has state s in the k-th experiment.\cr
#' non.id \tab A binary vector of length I. The i-th entry is the indicator whether the i-th unit does not belong to any cluster.
#' }
#' This argument is intended to carry the true parameters in simulation studies. If it is not null, then the model also computes a number of metrics that describes the error in model fitting. Users should be cautious that the order of the rows and columns of matrices in the fields of para should match the Y matrix.
#' @return An object of class \linkS4class{MBASICFit}.
#' @author Chandler Zuo \email{zuo@@stat.wisc.edu}
#' @examples
#' ## Simulate a dataset
#' dat.sim <- MBASIC.sim(xi = 2, family = "lognormal", I = 1000, fac = rep(1:10, each = 2), J = 3, S = 3, zeta = 0.1)
#' ## Fit the model
#' dat.sim.fit <- MBASIC(Y = dat.sim$Y, S = 3, fac = rep(1:10, each = 2), J = 3, maxitr = 3, para = NULL, family = "lognormal", method = "MBASIC", zeta = 0.1, tol = 1e-6)
#' @useDynLib MBASIC
#' @export
MBASIC <- function(Y, Gamma = NULL, S, fac, J=NULL, maxitr = 100, struct = NULL, para = NULL, family="lognormal", method = "MBASIC", zeta = 0.1, min.count = 0, tol = 1e-10, tol.par = 0.001, out = NULL, verbose = FALSE, statemap = NULL, initial = NULL) {
Mu.init <- Sigma.init <- V.init <- ProbMat.init <- W.init <- Z.init <- b.init <- P.init <- NULL
if(!is.null(initial)) {
if(class(initial) == "list") {
for(var.init in c("Mu", "Sigma", "V", "ProbMat", "W", "Z", "b", "P")) {
if(var.init %in% names(initial)) {
assign(paste(var.init, "init", sep = "."), initial[[var.init]])
}
}
} else if(class(initial) == "MBASICFit") {
for(var.init in c("Mu", "Sigma", "V", "W", "Z", "b", "P")) {
if(var.init %in% names(initial)) {
if(length(slot(initial, var.init)) > 0) {
assign(paste(var.init, "init", sep = "."), slot(initial, var.init))
}
}
}
ProbMat.init <- initial@Theta
} else {
warning("'initial' is not list or class 'MBASICFit'. Its value is skipped.")
}
}
if(verbose) {
write.out(out, "Started")
}
if(! method %in% c("SE-HC", "SE-MC", "PE-MC", "MBASIC")) {
stop("Error: 'method' must be one of 'SE-HC', 'SE-MC', 'PE-MC' or 'MBASIC'.")
}
if(! family %in% c("lognormal", "negbin", "binom", "gamma-binom", "scaled-t")) {
stop("Error: 'family' must be one of 'lognormal', 'negbin', 'binom', 'gamma-binom', 'scaled-t'.")
}
if(is.null(statemap)) {
## states and components have one-to-one matching
statemap <- seq(S)
}
M <- length(statemap)
stateMap <- matrix(0, nrow = M, ncol = S)
stateMap[cbind(seq(M), statemap)] <- 1
## prespecified
K <- length(unique(fac))
facNames <- as.character(unique(fac))
facMap <- seq(K)
names(facMap) <- facNames
fac <- as.character(fac)
fac <- facMap[as.character(fac)]
I <- ncol(Y)
N <- nrow(Y)
if(length(fac) != N)
stop("Error: total number of replicates do not match with the number of rows in Y")
if(length(min.count) == 1) {
min.count <- matrix(min.count, nrow = N, ncol = M)
min.count[, statemap == 1] <- 0
} else if(is.null(min.count)) {
min.count <- matrix(0, nrow = N, ncol = M)
} else if(length(min.count) == N) {
tmp <- min.count
min.count <- matrix(0, nrow = N, ncol = M)
min.count[, statemap != 1] <- tmp
} else if(length(min.count) == M) {
min.count <- matrix(rep(min.count, each = N), ncol = M)
} else if(ncol(min.count) != M | nrow(min.count) != N) {
warning("The dimension of 'min.count' is invalid. No minimum value threshold is applied.")
min.count <- matrix(0, nrow = N, ncol = M)
}
if(family != "negbin" & sum(min.count != 0) > 0) {
warning("'min.count' is only supported for the negative binomial distribution. No minimum count threshold value is set.")
}
if(sum(min.count < 0) > 0) {
warning("Negative values in 'min.count' are reset as 0.")
min.count[min.count < 0] <- 0
}
if(family %in% c("binom", "gamma-binom")) {
if(is.null(Gamma)) {
Gamma <- matrix(apply(Y, 1, max), nrow = N, ncol = I)
} else if(nrow(Gamma) != N | ncol(Gamma) != I) {
stop("Dimension of 'Gamma' must be the same as 'Y'.")
} else {
if(sum(Gamma < Y) > 0) {
warning("Some 'Gamma' is smaller than 'Y'. Those covariates are automatically augmented.")
Gamma[Gamma < Y] <- Y[Gamma < Y]
}
}
Gamma <- matrix(c(Gamma), nrow = N, ncol = I * M)
}
if(is.null(Gamma)) {
Gamma <- matrix(1, nrow = N, ncol = I * M)
} else if(nrow(Gamma) == N & ncol(Gamma) == I) {
Gamma.add <- matrix(apply(Gamma, 1, mean), nrow = N, ncol = I * (M - 1))
Gamma <- cbind(Gamma, Gamma.add)
} else if(nrow(Gamma) != N | ncol(Gamma) != I * M) {
stop("Error: structure of 'Gamma' is not correct. See details.")
}
if(min(Gamma) == 0) {
Gamma <- Gamma + min(Gamma[Gamma > 0])
}
## scale the Gamma matrix if the distribution is not binomial or gamma-binomial
scaleMat <- matrix(1, nrow = N, ncol = M)
if(family %in% c("lognormal", "negbin", "scaled-t")) {
for(m in seq(M)) {
scaleMat[, m] <- apply(Gamma[, I * (m - 1) + seq(I)], 1, mean)
Gamma[, I * (m - 1) + seq(I)] <- Gamma[, I * (m - 1) + seq(I)] / scaleMat[, m]
}
}
if(is.null(struct)) {
if(is.null(J))
stop("Error: either struct or J must not be missing.")
struct <- matrix(seq_len(K), nrow = K, ncol = J)
} else {
if(is.null(J))
J <- ncol(struct)
J <- sum(J)
if(ncol(struct)!= sum(J) | nrow(struct) != K)
stop("Error: the dimension of struct is inconsistent with grouping structure!")
}
if(prod(sort(unique(statemap)) == seq(S)) != 1) {
stop("Error: 'statemap' must consists of values 1 to 'S'.")
}
## design matrix D is K by N
designMap <- matrix(0, nrow = N, ncol = K)
for(k in seq(K)) {
designMap[fac == unique(fac)[k], k] <- 1
}
SampleToExp <- apply(designMap, 1, function(x) which(x==1))
unitMap <- matrix(0, nrow = N * M, ncol = N * S)
for(m in seq(M)) {
rows <- (m - 1) * N + seq(N)
cols <- (statemap[m] - 1) * N + seq(N)
unitMap[cbind(rows, cols)] <- 1
}
numpar <- I * (S - 1) + ## P
J + ## probz and zeta
sum(apply(struct, 2, function(x) length(unique(x)))) * (S - 1) + ## W
N * M + ## Mu
N * (M - S) ## V
if(family != "binom") {
numpar <- numpar + N * M ## Sigma
}
outitr <- 0
totallik <- oldlik <- -Inf
alllik <- allerr <- allzeta <- allmisclass <- matchId1 <- W.err <- matchId2 <- allari <- numeric(0)
maxlik <- -Inf
if(verbose) {
write.out(out, "Initialize parameters")
}
## initialize distributions
V <- Sigma <- Mu <- matrix(0, nrow = N, ncol = M)
if(is.null(Mu.init) | is.null(Sigma.init)) {
InitDist()
} else {
Mu <- Mu.init
Sigma <- Sigma.init
}
b <- rep(0, I)
## initialize the matrices by K-means
## in constructing Z, cluster all locis
ProbMat <- matrix(0, nrow = K * S, ncol = I)
if(M == S) {
V.init <- matrix(1, ncol = M, nrow = N)
}
if(is.null(ProbMat.init) | is.null(V.init)) {
InitStates()
} else {
ProbMat <- ProbMat.init
V <- V.init
Pi <- matrix(0, nrow = K, ncol = S)
for(s in seq_len(S)) {
idx <- seq_len(K) + (s-1) * K
if(length(idx) > 1) {
Pi[,s] <- apply(ProbMat[idx,], 1 ,mean)
} else {
Pi[, s] <- mean(ProbMat[idx, ])
}
}
ProbMat.full <- matrix(0, nrow = M * N, ncol = I)
for(m in seq(M)) {
ProbMat.full[(m - 1) * N + seq(N), ] <- V[, m] * (designMap %*% ProbMat[(statemap[m] - 1) * K + seq(K), ])
}
}
if(!is.null(para)) {
tryCatch({
ProbMat.true <- matrix(0, nrow = S * K, ncol = I)
for(s in seq(S)) {
ProbMat.true[(s - 1) * K + seq(K), ] <- as.numeric(para$Theta == s)
}
W.true <- para$W
Z.true <- para$Z
nonid.true <- para$non.id
Mu.true <- para$Mu
}, error = function(e) {
warning("'para' is invalid and reset as NULL.")
})
}
if(method != "MBASIC") {
## SE-HC, PE-MC or SE-MC method
## em step to estimate Theta
allpar <- c(c(V), c(Mu), c(Sigma), c(Pi))
oldpar <- 0
alllik <- numeric(0)
totallik <- oldlik <- maxlik <- -Inf
conv <- FALSE
for(itr in seq(maxitr)) {
oldlik <- totallik
## M step
UpdateDist()
## E step
UpdateStates()
oldpar <- allpar
allpar <- c(c(V), c(Mu), c(Sigma), c(Pi))
alllik <- c(alllik, totallik)
if(maxlik < totallik) {
assignBest(c("ProbMat", "Mu", "Sigma", "Pi", "V"))
maxlik <- totallik
}
## check for convergence
if(max(na.omit(abs(oldpar / allpar - 1))) < tol.par) {
conv <- TRUE
break
}
if(itr > 10 & oldlik < totallik & totallik - oldlik < tol * abs(oldlik)) {
conv <- TRUE
break
}
if(itr > 100 & which.max(alllik) < itr / 2) {
conv <- TRUE
break
}
} ## finish iteration
getBest(c("ProbMat", "Mu", "Sigma", "Pi", "V"))
Theta <- matrix(-1, nrow = K, ncol = I)
for(k in seq_len(K)) {
idx <- k + K * (seq_len(S) - 1)
Theta[k,] <- apply(ProbMat[idx,], 2, which.max)
}
if(!is.null(para))
allerr <- sqrt(sum((ProbMat - ProbMat.true) ^ 2) / I / K / S)
if(method != "PE-MC") {
ret <- MBASIC.state(Theta, J=J, zeta = zeta, struct = struct, method = method, maxitr = maxitr, tol = tol, tol.par = tol.par, para = para, out = out, W.init = W.init, Z.init = Z.init, P.init = P.init, b.init = b.init)
conv <- (conv & ret@converged)
## Pi is the proportion for components in the k experiment to have state s
## Pi is different from Z. Z is the posterior probability.
Mu.err <- numeric(0)
if(prod(dim(Mu) == dim(Mu.true)) == 1) {
Mu.err <- sqrt(mean((Mu - Mu.true) ^ 2))
}
write.out(out, paste("mis-class rate ", ret@MisClassRate))
write.out(out, paste("Error for W ", round(ret@W.err, 3)))
write.out(out, paste("ARI ", ret@ARI))
write.out(out, paste("loglik", round(tail(ret@alllik, 1), 3), "err", round(allerr, 3)))
write.out(out, paste("Error for Mu", round(Mu.err, 3)))
W <- ret@W
rownames(ProbMat) <- rownames(W) <- rep(facNames, S)
rownames(Mu) <- rownames(Sigma) <- rownames(V) <- facNames[fac]
return(new("MBASICFit",
Theta = ProbMat,
W = W,
Z = ret@Z,
V = V,
b = ret@b,
clustProb = ret@clustProb,
lik = ret@lik,
alllik = ret@alllik,
zeta = ret@zeta,
Mu = Mu / scaleMat,
Sigma = Sigma,
probz = ret@probz,
P = ret@P,
converged = conv,
Theta.err = allerr,
ARI = ret@ARI,
W.err = ret@W.err,
MisClassRate = ret@MisClassRate,
Mu.err = Mu.err,
Iter = ret@Iter
)
)
}
}
## initialize W, Z, b
## ProbMat <- D.rep %*% ProbMat
if(is.null(W.init) | is.null(Z.init)) {
InitWZb()
} else {
W <- W.init
Z <- Z.init
}
if(!is.null(b.init)) {
}
predZ <- Zcond <- Z
## initialize p, probz
if(is.null(P.init)) {
P <- matrix(0, nrow = I, ncol = S)
for(s in seq_len(S)) {
idx <- seq_len(K) + K * (s - 1)
P[, s] <- apply(ProbMat[idx, ], 2, mean)
}
} else {
P <- P.init
}
probz <- apply(rbind(Z, diag(rep(1, J))), 2, mean)
## EM algorithm
## change storage modes for C modules
storage.mode(statemap) <- "integer"
storage.mode(fac) <- "integer"
oldpar <- 0
newpar <- c(c(W), probz, zeta, c(P), c(V), c(Mu), c(Sigma))
alllik <- numeric(0)
totallik <- oldlik <- maxlik <- -Inf
conv <- FALSE
for(outitr in seq_len(maxitr)) {
if(outitr == 1 | method == "MBASIC") {
## only compute the PDF once if the method is PE-MC
PDF <- matrix(0, nrow = N * M, ncol = I)
for(m in seq_len(M)) {
Gamma.m <- Gamma[, (m - 1) * I + seq(I)]
PDF[seq(N) + (m - 1) * N, ] <- logdensity(Y, Mu[, m], Sigma[, m], Gamma.m, family, min.count[, m])
}
PDF <- trimLogValue(PDF)
}
Theta <- matrix(-1, nrow = K, ncol = I)
for(k in seq_len(K)) {
idx <- k + K * (seq_len(S) - 1)
Theta[k,] <- apply(ProbMat[idx,], 2, which.max)
}
## E step
## M step for some parameters
estep.result <- .Call("e_step", W, P, V, zeta, probz, PDF, fac - 1, statemap - 1, package = "MBASIC")
if(FALSE) {
estep.r1 <- estep(W, P, V, zeta, probz, PDF, designMap, stateMap, unitMap)
for(s in names(estep.result)) {
print(s)
print(max(abs(estep.result[[s]] - estep.r1[[s]])))
}
}
## Expected Theta matrix
ProbMat <- estep.result[["Theta"]]
ProbMat.full <- estep.result[["Theta_nu"]]
## Maximizers
zeta <- estep.result[["zeta"]]
W <- estep.result[["W"]]
probz <- estep.result[["probz"]]
predZ <- estep.result[["Z"]]
Zcond <- estep.result[["Zcond"]]
b.prob <- estep.result[["b_prob"]]
V <- estep.result[["V"]]
P <- estep.result[["P"]]
clustOrder <- .orderCluster(W, struct)
W <- W[, clustOrder, drop = FALSE]
W <- .structure(W, struct)
probz <- probz[clustOrder]
Zcond <- Zcond[, clustOrder, drop = FALSE]
predZ <- predZ[, clustOrder, drop = FALSE]
if(method != "PE-MC") {
## skip this step if the method is PE-MC
## M-step for Mu and Sigma
UpdateDist()
}
oldlik <- totallik
totallik <- .Call("loglik", W, P, V, zeta, probz, PDF, fac - 1, statemap - 1, package="MBASIC")
if(verbose) {
write.out(out, paste("itr", outitr, "lik", round(tail(totallik, 1), 2), "zeta", round(zeta, 2)))
}
alllik <- c(alllik, totallik)
allzeta <- c(allzeta, zeta)
if(length(para) > 1 & verbose) {
PrintUpdate()
}
if(maxlik < totallik) {
maxlik <- totallik
assignBest(c("ProbMat", "Theta", "W", "V", "P", "Mu", "Sigma", "zeta", "probz", "predZ", "Zcond", "b.prob"))
}
oldpar <- newpar
newpar <- c(c(W), probz, zeta, c(P), c(V), c(Mu), c(Sigma))
if(max(na.omit(abs(newpar / oldpar - 1))) < tol.par) {
conv <- TRUE
break
}
if(outitr > 10 & oldlik < totallik & totallik - oldlik < tol * abs(oldlik)) {
conv <- TRUE
break
}
if(outitr > 100 & which.max(alllik) < outitr - 100) {
conv <- TRUE
break
}
} ## finish outer loop
getBest(c("ProbMat", "Theta", "W", "V", "P", "Mu", "Sigma", "zeta", "probz", "predZ", "Zcond", "b.prob"))
Mu.err <- W.err <- numeric(0)
if(length(para) > 1)
PrintUpdate()
rownames(ProbMat) <- rownames(W) <- rep(facNames, S)
rownames(Mu) <- rownames(Sigma) <- rownames(V) <- facNames[fac]
new("MBASICFit",
Theta = ProbMat,
W = W,
V = V,
Z = Z,
b = b.prob,
clustProb = cbind(b.prob, Zcond * (1 - b.prob)),
aic = - 2 * tail(maxlik, 1) + 2 * numpar,
bic = - 2 * tail(maxlik, 1) + log(N * I) * numpar,
aicc = -2 * tail(maxlik, 1) + 2 * numpar + 2 * numpar * (numpar + 1) / (N * I - numpar - 1),
alllik = alllik,
lik = maxlik,
zeta = zeta,
Mu = Mu / scaleMat,
Sigma = Sigma,
probz = probz,
P = P,
converged = conv,
Theta.err = allerr,
ARI = tail(allari, 1),
W.err = tail(W.err, 1),
MisClassRate = tail(allmisclass, 1),
Struct = struct,
Mu.err = Mu.err,
Iter = outitr
)
}
InitStates <- function() {
Inherit(c("S", "statemap", "V", "K", "M", "I", "N", "Y", "Sigma", "Mu", "family", "unitMap", "designMap", "stateMap", "Gamma", "min.count"))
## initialize V
for(s in seq(S)) {
ids <- which(statemap == s)
V[, ids] <- 1 / length(ids)
}
## initialize ProbMat
totalF <- matrix(0, nrow = K, ncol = I)
F1 <- matrix(0, nrow = K * S, ncol = I)
totalF.full <- matrix(0, nrow = N, ncol = I)
F1.full <- matrix(0, nrow = N * M, ncol = I)
for(m in seq(M)) {
Gamma.m <- Gamma[, I * (m - 1) + seq(I)]
idx <- (m - 1) * N + seq_len(N)
F1.full[idx,] <- logdensity(Y, Mu[, m], Sigma[, m], Gamma.m, family, min.count[, m])
}
F1.full <- trimLogValue(F1.full)
F1.full <- exp(F1.full)
## convert into (NS) x I
F1.tmp <- log(crossprod(unitMap, F1.full))
## initialize ProbMat.full
for(m in seq(M)) {
idx <- (m - 1) * N + seq(N)
totalF.full <- totalF.full + F1.full[idx, ]
}
totalF.full <- t(matrix(rep(c(t(totalF.full)), M), nrow = I))
ProbMat.full <- F1.full / totalF.full
ProbMat.full[totalF.full == 0] <- 1 / M
ProbMat.full <- trimProbValue(ProbMat.full)
## convert to KS by I
for(s in seq(S)) {
idx <- (s - 1) * K + seq_len(K)
F1[idx, ] <- exp(crossprod(designMap, F1.tmp[(s - 1) * N + seq_len(N), ]))
totalF <- totalF + F1[idx, ]
}
totalF <- t(matrix(rep(c(t(totalF)), S), nrow = I))
ProbMat <- F1 / totalF
ProbMat[totalF == 0] <- 1/S
ProbMat <- trimProbValue(ProbMat)
Pi <- matrix(rep(apply(stateMap, 2, sum) / M, each = K), nrow = K, ncol = S)
Return(c("ProbMat.full", "ProbMat", "V", "Pi"))
}
InitDist <- function() {
Inherit(c("M", "Y", "family", "Mu", "Sigma", "Gamma", "min.count"))
Gamma.m <- Gamma[, seq(ncol(Y))]
for(m in seq(M)) {
m3 <- NULL
thr <- min.count[, m]
if(family == "lognormal") {
Y.sec <- c(Y)[c(Y) <= quantile(c(Y), m / M) & c(Y) >= quantile(c(Y), (m - 1) / M)]
Y.sec <- log(Y.sec + 1)
} else if(family == "negbin") {
Y.sec <- c(Y - thr)[c(Y - thr) < quantile(c(Y - thr), m / M) & c(Y - thr) > quantile(c(Y - thr), (m - 1) / M)]
Y.sec[Y.sec < 0] <- 0
## Y.sec <- Y.sec - thr
Gamma.sec <- c(Gamma * Gamma)[c(Y - thr) < quantile(c(Y - thr), m / M) & c(Y - thr) > quantile(c(Y - thr), (m - 1) / M)]
m3 <- mean(Gamma.sec)
} else if(family == "scaled-t") {
Y.sec <- abs(Y)
Y.sec <- c(Y.sec)[c(Y.sec) < quantile(c(Y.sec), m / M) & c(Y.sec) > quantile(c(Y.sec), (m - 1) / M)]
} else if(family == "gamma-binom") {
## gamma-binomial distribution
ratio <- Y / Gamma.m
ratio[Gamma.m == 0] <- mean(Y[Gamma.m > 0] / Gamma.m[Gamma.m > 0])
Y.sec <- c(ratio)[ratio <= quantile(ratio, m / M) & ratio >= quantile(ratio, (m - 1) / M)]
} else {
ratio <- Y / Gamma.m
id1 <- which(Gamma.m > 0)
ratio <- ratio[id1]
id <- which(ratio <= quantile(ratio, m / M) & ratio >= quantile(ratio, (m - 1) / M))
Y.sec <- sum(Y[id1[id]]) / sum(Gamma.m[id1[id]])
}
m1 <- mean(Y.sec)
m2 <- var(Y.sec)
MomentEstimate()
}
Return(c("Mu", "Sigma"))
}
UpdateDist <- function() {
Inherit(c("M", "N", "family", "Y", "ProbMat.full", "Mu", "Sigma", "Gamma", "min.count"))
I <- ncol(Y)
for(m in seq_len(M)) {
idx <- seq(N) + (m - 1) * N
Gamma.m <- Gamma[, I * (m - 1) + seq(I)]
thr <- min.count[, m]
m3 <- NULL
if(family == "lognormal") {
m1 <- apply(log(Y + 1) * ProbMat.full[idx, ], 1, sum) / apply(Gamma.m * ProbMat.full[idx, ], 1, sum)
m2 <- apply((log(Y + 1) - m1 * Gamma.m) * (log(Y + 1) - m1 * Gamma.m) * ProbMat.full[idx, ], 1, sum) / apply(ProbMat.full[idx, ], 1, sum)
} else if(family == "negbin"){
## negative binomial family
m1 <- apply((Y - thr) * ProbMat.full[idx, ], 1, sum) / apply(Gamma.m * ProbMat.full[idx, ], 1, sum)
m2 <- apply((Y - thr) * (Y - thr) * ProbMat.full[idx, ], 1, sum) / apply(ProbMat.full[idx, ], 1, sum)
m3 <- apply(Gamma.m * Gamma.m * ProbMat.full[idx, ], 1, sum) / apply(ProbMat.full[idx, ], 1, sum)
} else if(family == "scaled-t"){
## scaled-t family
m1 <- apply(abs(Y / Gamma.m) * ProbMat.full[idx, ], 1, sum) / apply(ProbMat.full[idx, ], 1, sum)
m2 <- apply(Y * Y / Gamma.m / Gamma.m * ProbMat.full[idx, ], 1, sum) / apply(ProbMat.full[idx, ], 1, sum)
} else if(family == "gamma-binom") {
## gamma-binomial distribution
ratio <- Y / Gamma.m
ratio[Gamma.m == 0] <- mean(Y[Gamma.m > 0] / Gamma.m[Gamma.m > 0])
m1 <- apply(ratio * ProbMat.full[idx, ], 1, sum) / apply(ProbMat.full[idx, ], 1, sum)
m2 <- apply(ratio * ratio * ProbMat.full[idx, ], 1, sum) / apply(ProbMat.full[idx, ], 1, sum)
} else {
## binomial
m1 <- apply(Y * ProbMat.full[idx, ], 1, sum) / apply(Gamma.m * ProbMat.full[idx, ], 1, sum)
m2 <- NULL
}
MomentEstimate()
}
## order the means
Mu <- Mu + min.count
od <- apply(Mu, 1, order)
Mu <- matrix(Mu[cbind(rep(seq_len(N), each = M), c(od))], ncol = M, byrow = TRUE)
Mu <- Mu - min.count
Sigma <- matrix(Sigma[cbind(rep(seq_len(N), each = M), c(od))], ncol = M, byrow = TRUE)
Return(c("Mu", "Sigma"))
}
UpdateStates <- function() {
Inherit(c("K", "S", "I", "M", "N", "Y", "Gamma", "Mu", "Sigma", "family", "V", "unitMap", "designMap", "stateMap", "Pi", "statemap", "out", "verbose", "min.count"))
F1 <- matrix(0, nrow = K * S, ncol = I)
F1.full <- matrix(0, nrow = N * M, ncol = I)
## compute F1
## recompute replicate density
for(m in seq(M)) {
idx <- (m - 1) * N + seq_len(N)
Gamma.m <- Gamma[, seq(I) + I * (m - 1)]
F1.full[idx, ] <- exp(logdensity(Y, Mu[, m], Sigma[, m], Gamma.m, family, min.count[, m])) * V[, m]
}
F1.tmp <- F1.full
## convert to (NS) x I
F1.full <- log(crossprod(unitMap, F1.full))
for(s in seq(S)) {
F1[(s - 1) * K + seq_len(K), ] <- crossprod(designMap, F1.full[(s - 1) * N + seq(N), ]) + log(Pi[, s])
}
F1 <- trimLogValue(F1)
totalF <- matrix(0, nrow = K, ncol = I)
for(s in seq_len(S)) {
idx <- seq_len(K) + (s-1) * K
F1[idx,] <- exp(F1[idx,])
totalF <- totalF + F1[idx,]
}
totalF <- t(matrix(rep(c(t(totalF)), S), nrow = I))
ProbMat <- F1 / totalF
ProbMat <- trimProbValue(ProbMat)
## joint likelihood for each replicate and component
totallik <- sum(log(totalF))
if(verbose)
write.out(out, paste("Likelihood for component estimation: ", round(totallik, 3), sep = ""))
## ProbMat.full
ProbMat.full <- F1.tmp
for(s in seq(S)) {
Vwei <- 0
for(m in which(statemap == s)) {
Vwei <- Vwei + ProbMat.full[(m - 1) * N + seq(N), ]
}
for(m in which(statemap == s)) {
ProbMat.full[(m - 1) * N + seq(N), ] <- ProbMat.full[(m - 1) * N + seq(N), ] / Vwei
}
}
ProbMat.aug <- matrix(0, nrow = N * M, ncol = I)
for(m in seq(M)) {
ProbMat.aug[(m - 1) * N + seq(N), ] <- designMap %*% ProbMat[(statemap[m] - 1) * K + seq(K), ]
}
ProbMat.aug <- trimProbValue(ProbMat.aug)
ProbMat.full <- ProbMat.full * ProbMat.aug
ProbMat.full <- trimProbValue(ProbMat.full)
## update Pi
for(s in seq_len(S)) {
idx <- seq_len(K) + (s-1) * K
if(length(idx) > 1) {
Pi[,s] <- apply(ProbMat[idx,], 1 ,mean)
} else {
Pi[, s] <- mean(ProbMat[idx, ])
}
}
## update V
EV <- ProbMat.aug * c(V) + ProbMat.full
V.new <- matrix(apply(EV, 1, sum), nrow = N)
V.new <- V.new / tcrossprod(V.new %*% stateMap, stateMap)
Return(c("V", "Pi", "ProbMat", "ProbMat.full", "totallik"))
}
PrintUpdate <- function() {
Inherit(c("ProbMat", "ProbMat.true", "I", "K", "S", "J", "W", "W.true", "Z.true", "nonid.true", "b.prob", "Zcond", "allmisclass", "W.err", "allari", "Mu", "Mu.true", "totallik", "out", "verbose"))
allerr <- sqrt(sum((ProbMat - ProbMat.true) ^ 2) / I / K / S)
## compute misclassification rate
W.f <- matrix(0, nrow = K * S, ncol = J)
for(s in seq_len(S))
W.f[s + S * seq(0, K - 1),] <- W[seq_len(K) + K * (s - 1),]
mc <- matchCluster(W.f, W.true, Zcond, Z.true, b.prob, nonid.true)
allmisclass <- c(allmisclass, mc$mcr)
W.err <- c(W.err, mc$W.err)
allari <- c(allari, mc$ari)
Mu.err <- numeric(0)
if(prod(dim(Mu) == dim(Mu.true)) == 1) {
Mu.err <- sqrt(mean((Mu - Mu.true) ^ 2))
}
if(verbose) {
write.out(out, paste("mis-class rate ", mc$mcr))
write.out(out, paste("Error for W ", round(mc$W.err, 3)))
write.out(out, paste("ARI ", mc$ari))
write.out(out, paste("loglik", totallik, "err", round(allerr, 3)))
write.out(out, paste("Error for Mu", round(Mu.err, 3)))
}
Return(c("allerr", "allari", "allmisclass", "W.err", "Mu.err"))
}
Inherit <- function(vnames = NULL) {
if(is.null(vnames)) {
vnames <- ls(envir = parent.frame(2))
}
for(v in vnames) {
## do not use get() since it will cause error for non-defined variables
assign(v, parent.frame(2)[[v]], envir = parent.frame())
}
}
Return <- function(vnames = NULL) {
if(is.null(vnames)) {
vnames <- ls(envir = parent.frame())
}
for(v in vnames) {
## do not use get() since it will cause error for non-defined variables
assign(v, parent.frame()[[v]], envir = parent.frame(2))
}
}
logdensity <- function(y, mu, sigma, gamma, family, thr) {
if(family == "lognormal") {
y <- log(y + 1)
return(-(y - mu * gamma) ^ 2 / sigma / 2 - log(sigma) / 2 - log(2 * pi) / 2 )
} else if(family == "scaled-t") {
return(dt(y / mu / gamma, df = sigma, log = TRUE) - log(mu * gamma))
} else if(family == "negbin") {
return(dnbinom(y - thr, mu = mu * gamma, size = sigma, log = TRUE))
} else if(family == "gamma-binom") {
a <- mu / (1 - mu) * sigma
b <- sigma
return(
log(beta(a + y, gamma - y + b)) -
log(beta(a, b)) + log(choose(gamma, y))
)
} else {
return(dbinom(y, prob = mu, size = gamma, log = TRUE))
}
}
MomentEstimate <- function() {
Inherit(c("m1", "m2", "m3", "Mu", "Sigma", "family", "m", "thr"))
if(family == "lognormal") {
Mu[, m] <- m1
m2[m2 < 0.01] <- 0.01
Sigma[, m] <- m2
} else if(family == "scaled-t") {
func <- function(df) {
g1 <- gamma((df - 1) / 2)
g2 <- gamma(df / 2)
(df - 2) * g1 * g1 / g2 / g2 / pi
}
solve <- function(v) {
lower <- 2.01
upper <- 100
fun.up <- func(upper)
fun.low <- func(lower)
if(v >= fun.up) {
return(upper)
} else if(v <= fun.low) {
return(lower)
} else {
while(abs(fun.up - fun.low) > 0.01 & abs(upper - lower) > 0.01) {
med <- (upper + lower) / 2
fun.med <- func(med)
if(fun.med > v + 0.005) {
upper <- med
fun.up <- fun.med
} else if(fun.med < v - 0.005){
lower <- med
fun.low <- fun.med
} else {
return(med)
}
}
return((upper + lower) / 2)
}
}
## df
Sigma[, m] <- sapply(m1 * m1 / (m2 + m1 * m1), solve)
## scale
Mu[, m] <- sqrt((m2 + m1 * m1) * (Sigma[, m] - 2) / Sigma[, m])
} else if(family == "negbin") {
m1[m1 < 0.001] <- 0.001
Mu[, m] <- m1
m2 <- (m1 * m1 * m3) / (m2 - m1 * m1 - m1 * m1 * m3)
m2[m2 < 0] <- 100
Sigma[, m] <- m2
} else if(family == "gamma-binom") {
m2 <- m2 - m1 * m1
a.plus.b <- 1 / (m1 * (1 - m1)) - 1
a <- m1 * a.plus.b
b <- a.plus.b - a
a[a < 0.01] <- 0.01
b[b < 0.01] <- 0.01
Mu[, m] <- a / (a + b)
Sigma[, m] <- b
} else {
Mu[, m] <- m1
Sigma[, m] <- 0
}
Return(c("Mu", "Sigma"))
}
assignBest <- function(varnames) {
for(v in varnames) {
bestname <- paste("best", v, sep = "")
assign(bestname, get(v, envir = parent.frame()), envir = parent.frame())
}
}
getBest <- function(varnames) {
for(v in varnames) {
bestname <- paste("best", v, sep = "")
assign(v, get(bestname, envir = parent.frame()), envir = parent.frame())
}
}
InitWZb <- function() {
Inherit(c("ProbMat", "zeta", "I", "J", "S", "K", "struct"))
km.fit <- NULL
try(km.fit <- kmeans(t(ProbMat), centers = J))
if(is.null(km.fit)) {
## use Kmeans++ initialization
km.fit <- kmeanspp(ProbMat, J)
}
Z <- matrix(0, nrow = I, ncol = J)
Z[cbind(seq(I), km.fit$cluster)] <- 1
W <- t(km.fit$centers)
d <- apply(ProbMat - tcrossprod(W, Z), 1, function(x) sum(x * x))
thr <- quantile(d, 1 - zeta)
id <- which(d < thr)
b <- rep(1, I)
b[id] <- 0
clustOrder <- .orderCluster(W, struct)
W <- W[, clustOrder, drop = FALSE]
Z <- Z[, clustOrder, drop = FALSE]
W <- .structure(W, struct)
b.prob <- b
Return(c("W", "Z", "b.prob"))
}
#' @name MBASIC.full
#' @title Simultaneously fitting MBASIC model for different numbers of clusters.
#' @description Simultaneously fitting MBASIC model for different numbers of clusters.
#' @param struct A list of K by J matrix indicating the structures of each cluster. Default: NULL.
#' @param J A vector of the numbers of clusters to be identified.
#' @param ncores The number of CPUs to be used in parallele. Default: 1.
#' @param out The prefix of the file names to write model output information. Default: NULL, when the information is written to the R output.
#' @param ... Other parameters that are passed to \code{\link{MBASIC}}.
#' @details
#' This function fits MBASIC models with different numbers of clusters and/or different structures in parallel.
#' @return A list object including the following fields:
#' \tabular{ll}{
#' allFits \tab A list of \linkS4class{MBASICFit} objects with different numbers of clusters and/or structures.\cr
#' BestFit \tab The best model with minimum BIC.\cr
#' Time \tab The time used for the model fitting.\cr}
#' @author Chandler Zuo \email{zuo@@stat.wisc.edu}
#' @seealso \code{\link{MBASIC}}
#' @examples
#' ## Simulate a dataset
#' dat.sim <- MBASIC.sim(xi = 2, family = "lognormal", I = 1000, fac = rep(1:10, each = 2), J = 3, S = 3, zeta = 0.1)
#' ## Fit the model
#' dat.sim.fit <- MBASIC.full(Y = dat.sim$Y, S = 3, fac = rep(1:10, each = 2), J = 3:5, maxitr = 3, para = NULL, family = "lognormal", method = "MBASIC", zeta = 0.1, tol = 1e-6, tol.par = 0.001)
#' @useDynLib MBASIC
#' @import foreach
#' @export
MBASIC.full <- function(J=NULL, struct = NULL, out = NULL, ncores = 1, ...) {
t0 <- Sys.time()
allJs <- J
allstructs <- struct
if(is.null(allstructs)) {
nmodels <- length(allJs)
} else {
nmodels <- length(allstructs)
}
if(nmodels == 0) {
stop("Error: at least J or struct must be not NULL.")
}
startParallel(ncores)
allfits <- foreach(i = seq(nmodels)) %dopar% {
if(!is.null(allstructs)) {
struct <- allstructs[[i]]
}
if(!is.null(allJs)) {
J <- allJs[i]
}
if(!is.null(out)) {
out <- paste(out, "model", i, ".txt", sep = "")
}
MBASIC(J = J, struct = struct, out = out, ...)
}
endParallel()
bestBIC <- Inf
bestFit <- NULL
for(fit in allfits) {
if(fit@bic < bestBIC) {
bestBIC <- fit@bic
bestFit <- fit
}
}
return(list(allFits = allfits, BestFit = bestFit,
Time = as.numeric(Sys.time() - t0, units = "secs")))
}
kmeanspp <- function(dat, J) {
dist.mat <- matrix(Inf, nrow = ncol(dat), ncol = J)
centers <- matrix(0, nrow = nrow(dat), ncol = J)
## centers[, 1] <- apply(dat, 1, mean)
centers[, 1] <- dat[, sample(seq(ncol(dat)), 1)]
dist.mat[, 1] <- apply(dat, 2, function(x) sum((x - centers[, 1]) ^ 2))
for(j in seq(J)[-1]) {
centers[, j] <- dat[, sample(seq(ncol(dat)), 1, prob = apply(dist.mat, 1, min))]
dist.mat[, j] <- apply(dat, 2, function(x) sum((x - centers[, j]) ^ 2))
}
return(list(cluster = apply(dist.mat, 1, which.min),
centers = t(centers)))
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.