MVN biclustering

Description

This function performs MVN biclustering on a n by p matrix. Details are given in Tan and Witten (2014).

Usage

1
2
3
matrixBC(x, k, r, lambda, alpha, beta, nstart = 20, Cs.init = NULL,
 Ds.init = NULL, max.iter = 50, threshold = 1e-04, Sigma.init = NULL, 
 Delta.init = NULL, center=TRUE)

Arguments

x

Data matrix; samples are rows and columns are features. Cannot contain missing values.

k

The number of row clusters, i.e., the number of clusters for the observations.

r

The number of column clusters, i.e., the number of clusters for the features.

lambda

Non-negative regularization parameter for lasso on the mean of each bicluster. lambda=0 means no regularization.

alpha

Non-negative regularization parameter for the graphical lasso to estimate the covariance matrix of the samples. alpha=0 means no regularization. alpha>0 is recommended.

beta

Non-negative regularization parameter for the graphical lasso to estimate the covariance matrix of the features. beta=0 means no regularization. beta>0 is recommended.

nstart

The number of random initialization sets used in the kmeans function. The default is 20.

Cs.init

Starting values for the row labels. The default value is NULL – kmeans clustering is performed to estimate the row labels.

Ds.init

Starting values for the column labels. The default value is NULL – kmeans clustering is performed to estimate the column labels.

max.iter

Maximum number of iterations. The default value is 50 iterations.

threshold

Threshold value for convergence. The default is 1e-4.

Sigma.init

Starting values for the covariance matrix of the observations. The default value is NULL – the graphical lasso as described in Friedman, Hastie, and Tibshirani (2008) is performed to estimate the covariance matrix of the observations.

Delta.init

Starting values for the covariance matrix of the features. The default value is NULL – the graphical lasso as described in Friedman, Hastie, and Tibshirani (2008) is performed to estimate the covariance matrix of the features.

center

Mean center the data matrix before performing sparse biclustering. The default is TRUE.

Details

This implements MVN biclustering using Algorithm (3) described in Tan and Witten (2014) 'Sparse biclustering of transposable data'. This approach takes into account the correlation among the features within the same cluster and also takes into account the correlation among the observations within the same cluster. The row labels for the observations and column labels for the features are estimated and the mean of each bicluster is encouraged to be sparse using the lasso penalty. Details are given in Algorithm (3) in Tan and Witten (2014).

If Sigma.init and Delta.init are NULL, the graphical lasso in Friedman, Hastie, and Tibshirani (2008) is used to estimated the covariance matrix of the observations and features. If Sigma.init is provided, then the covariance matrices would not be updated in the algorithm. Note that when Sigma and Delta equal the identity matrix up to a scaling factor, this approach is exactly that of sparse biclustering and the function sparseBC should be used.

Note that most of the computation time comes from the graphical lasso algorithm. We recommend setting the tuning parameters alpha and beta to be large so that the graphical lasso can be implemented efficiently (see the glasso package). When n > p, alpha=0 will return an error. Similarly, when p > n, beta=0 will return an error.

If center=TRUE, the data matrix x is mean centered before performing sparse biclustering. The reported mean matrix mus is the addition of the substracted mean, mean(x), and the estimated mean matrix from sparse biclustering on the mean centered data.

Value

an object of class matrixBC.

Among some internal variables, this object includes the elements

Cs

Cs is the output for the row labels.

Ds

Ds is the output for the column labels.

mus

mus is the estimated mean matrix for the entire matrix.

Mus

Mus is the estimated mean matrix for each bicluster.

Sigma

Sigma is the estimated covariance matrix of the observations.

Delta

Delta is the estimated covariance matrix of the features.

objs

objs is the maximized objective value of the negative l1 penalized log-likelihood of the matrix-variate normal distribution.

iteration

The number of iterations until convergence.

Author(s)

Kean Ming Tan and Daniela Witten

References

KM Tan and D Witten (2014) Sparse biclustering of transposable data. Journal of Computational and Graphical Statistics 23(4):985-1008.

J Friedman, T Hastie, and R Tibshirani (2008). Sparse inverse covariance estimation with the lasso. Biostatistics 9, 432–441.

See Also

sparseBC matrixBC.BIC summary.matrixBC image.matrixBC

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Lung cancer data
# Not run to save time
#data(lung)
#truecluster<-as.numeric(as.factor(rownames(lung)))
#cancersd<-apply(lung,2,sd)
# Pick the top 400 genes that have the largest standard deviation
#lung<-lung[,rank(cancersd)>=length(cancersd)-399]

# Example of MVN Biclustering
#set.seed(5)
#res<-matrixBC(lung,k=4,r=10,lambda=60,alpha=0.4,beta=0.4) 
# one misclassification
#res$Cs

# lambda chosen such that the estimated mean matirx ofsparseBC has a
# similar number of nonzero as matrixBC
#res2<-sparseBC(lung,k=4,r=10,lambda=230)
# a few observations are being misclassified
#res2$Cs

# print information from the object matrixBC
#summary(res)

# Plot the estimated mean matris for the object matrixBC
#image(res)