MBASIC: Bayesian clustering model for a state-space matrix.

Description Usage Arguments Details Value Author(s) Examples

Description

This function is designed to analyze general state-space models. The data consists of observations over I units under N experiments with K different conditions. There are S states for each experiment and unit.

Usage

1
2
3
4
MBASIC(Y, Gamma = NULL, S, fac, J = NULL, maxitr = 100, struct = NULL,
  para = NULL, family = "lognormal", method = "MBASIC", zeta = 0.1,
  min.count = 0, tol = 1e-10, tol.par = 0.001, out = NULL,
  verbose = FALSE, statemap = NULL, initial = NULL)

Arguments

Y

An N by I matrix containing the data from N experiments across I observation units.

Gamma

The data for background information. Default: NULL. See details for more information.

S

An integer for the number of states.

fac

A vector of levels repr1esenting the conditions of each replicate.

J

The number of clusters to be identified.

maxitr

The maximum number of iterations in the E-M algorithm. Default: 100.

struct

A K by J matrix indicating the structures of each cluster.

para

A list object that contains the true model parameters. Default: NULL. See details for more information.

family

The distribution of family to be used. Either "lognormal", "negbin", "binom", "gamma-binom" or "scaled-t". See details for more information.

method

A string for the fitting method, 'MBASIC' (default), 'PE-MC', 'SE-HC', or 'SE-MC'. See details for more information.

zeta

The initial value for the proportion of units that are not clustered. Default: 0.1. If 0, no singleton cluster is fitted.

min.count

The minimum count threshold for each component. This argument is only used when family='negbin'. If it is a single number, it is the threshold for all components >= 2. If it is a vector, it is the threshold for each component.

tol

Tolerance for the relative increment in log-likelihood value in checking the algorithm convergence. Default: 1e-10.

tol.par

Tolerance for the relative error in parameter updates in checking the algorithm convergence. Default: 1e-5.

out

The file directory for writing fitting information in each E-M iteration. Default: NULL (no information is outputted).

verbose

A boolean variable indicating whether intermediate model fitting metrics should be printed. Default: FALSE.

statemap

A vector the same length as the number of mixture components, and taking values from 1 to S representing the states of each component. Default: NULL. See details for more information.

initial

Either a list or MBASICFit object that provides initial values for model parameters. Default: NULL.

Details

MBASIC assumes that there are S underlying states for each expeirment and each loci. A single state may also include multiple mixture components, indexed by m. In total, we can have M mixture components. The mapping from mixture components to the states are provided by statemap. By default, statemap=NULL, in which case each state has only one component, and M=S.
Function MBASIC currently supports five different distributional families: log-normal, negative binomial, binomial, gamma-binomial and scaled-t distributions. This should be specified by the family argument.
For the log-normal distributions, log(Y+1) is modeled as normal distributions. For experiment n, if locus i has component m, distribution for log(Y[n,i]+1) is N(Mu[n,m]*Gamma[n,i+I(m-1)], Sigma[n,m]).
For the negative binomial distributions, the meanings of Mu and Sigma are different. For experiment n, if locus i has component m, distribution of Y[n,i]-min.count[n,m] is NB(Mu[n,m]*Gamma[n,i+I(m-1)], Sigma[n,m]). In this package, NB(mu, a) denotes the negative-binomial distribution with mean mu and size a (i.e. the variance is mu*(1+mu/a)). Notice that if a single value of 'min.count' is provided, it will be converted to a vector of c(0, rep(min.count, M-1)).
The 'min.count' for the negative binomial distribution specifies the minimum enrichment for each replicate and compoment. It will be formed in an N by M matrix, but the function accepts its input as a vector of length N (recommended) or M, or a single value. If it is a single value, it will be used as the common threshold for all replicates and compoments >= 2. If it is a vector of length N, it will be used as the replicate specific thresholds for all components >= 2. If it is a vector of length M, it will be used as the threshold for each component for all replicates. If min.count=NULL, no threshold is applied.
For the binomial distribution, for experiment n, if locus i has component m, distribution for Y[n,i] is Binom(Gamma[n,i], Mu[n,m]).
For the gamma-binomial distribution, for experiment n, if locus i has component m, distribution for Y[n,i] is Binom(Gamma[n,i], p) where p follows a gamma prior of gamma(Mu[n,m], Sigma[n,m]).
For the scaled-t distribution, for experiment n, if locus i has component m, distribution for Y[n,i]/Gamma[n,i+I(m-1)]/Mu[n,m] is t distribution with Sigma[n,m] degrees of freedom.
The Gamma parameter encodes the background information for all N experiments, I units and M components. It can be a matrix with dimension K by I * M, where the background datum for experiment n, unit i and component m is Gamma[n,i+I*(m-1)]. If in the input Gamma=NULL, then it is regenerated as a matrix of entries 1 with dimension N x IM. If in the input Gamma is a N x I matrix, then this function adds I(M-1) columns of all 1s to this matrix.
The method argument determines what fitting method will be used. The default is 'MBASIC', where the states and the clustering are simultaneously estimated. 'SE-HC' and 'SE-MC' methods use 2-step algorithms. In the first step, both estimate the states for each unit by an E-M algorithm for each experiment. In the second step, 'SE-HC' uses hierarchical clustering to cluster the units, while 'SE-MC' uses function MBASIC.state to identify clusters.
The para argument takes a list object that is supposed to include the following fields:

W A K by (J*S) matrix. The (k,J*(s-1)+j)-th entry is the probability that the units in cluster j has state s in the k-th experiment.
Z An I by J matrix. The (i,j)-th entry is the indicator whether the i-th unit belongs to cluster j.
Theta A K by (I*S) matrix. The (k,I*(s-1)+i)-th entry is the probability that the i-th unit has state s in the k-th experiment.
non.id A binary vector of length I. The i-th entry is the indicator whether the i-th unit does not belong to any cluster.

This argument is intended to carry the true parameters in simulation studies. If it is not null, then the model also computes a number of metrics that describes the error in model fitting. Users should be cautious that the order of the rows and columns of matrices in the fields of para should match the Y matrix.

Value

An object of class MBASICFit.

Author(s)

Chandler Zuo zuo@stat.wisc.edu

Examples

1
2
3
4
## Simulate a dataset
dat.sim <- MBASIC.sim(xi = 2, family = "lognormal", I = 1000, fac = rep(1:10, each = 2), J = 3, S = 3, zeta = 0.1)
## Fit the model
dat.sim.fit <- MBASIC(Y = dat.sim$Y, S = 3, fac = rep(1:10, each = 2), J = 3, maxitr = 3, para = NULL, family = "lognormal", method = "MBASIC", zeta = 0.1, tol = 1e-6)

chandlerzuo/mbasic documentation built on May 13, 2019, 3:24 p.m.