README.md

rccm

Random Covariance Clustering Model

Description

This R package implements methods for clustering and joint estimation of multiple sparse precision matrices. Methods implemented include the Random Covariance Model and Random Covariance Clustering Model.

Table of contents

Overview of main functions

Function Description rccm Implements the Random Covariance Clustering Model (RCCM) for joint estimation of sparse precision matrices belonging to multiple clusters or groups. Optimization is conducted using block coordinate descent. randCov Implements the Random Covariance Model (RCM) for joint estimation of multiple sparse precision matrices. Optimization is conducted using block coordinate descent. gapSelect Selects the optimal number of clusters for the Random Covariance Clustering Model (RCCM) based on a Gap statistic as proposed by Tibshirani et al. (2001). rccSim Simulates data based on the Random Covariance Clustering Model (RCCM). Data is generated in a hierarchical manner, beginning with group-level networks and precision matrices and then subject-level networks and matrices. starsRccm Implements a modified stability approach for regularization selection (stARS) method for tuning parameter selection. Methods available to implement include the fused graphical lasso (Danaher et al., 2014), group graphical lasso (Danaher et al., 2014), graphical lasso (Friedman et al., 2008), random covariance clustering model (RCCM), and the random covariance model (Zhang et al., 2019). cvTune Implements k-fold cross-validation for tuning parameter selection. Methods available to implement include the fused graphical lasso (Danaher et al., 2014), group graphical lasso (Danaher et al., 2014), graphical lasso (Friedman et al., 2008), random covariance clustering model (RCCM), and the Random Covariance Model (Zhang et al., 2019). dwishart Probability density function for the Wishart distribution. randCalc Calculates the rand index describing the amount of agreement between two integer vectors of cluster memberships.

Installation

Install the latest version of the package from GitHub with the following R code:

if("devtools" %in% installed.packages() == FALSE) {
    install.packages("devtools")
}

devtools::install_github("dilernia/rccm")

Examples

Here we walk through brief examples of using some key functions. For a more detailed simulation example for implementing the RCCM, see this link for a simulation with precision matrices with higher magnitude entries, and this link for a simulation with low magnitude entries.

Analyzing single data set

Below we generate a single data set with 2 clusters with 12 and 10 subjects respectively, 15 variables for each subject, 100 observations for each variable for each subject, the groups sharing about 50% of network connections, and 10% of differential connections within each group:

library(rccm)

# Simulate data
set.seed(1994)
myData <- rccSim(G = 2, clustSize = c(12, 10), p = 15, n = 100, overlap = 0.50, rho = 0.10)

# Analyze simulated data with RCCM
result <- rccm(x = myData$simDat, lambda1 = 10, lambda2 = 50, lambda3 = 2, nclusts = 2)

# Check clustering performance
zHats <- apply(result$weights, MARGIN = 2, FUN = which.max)
randCalc(zHats, myData$zgks)

Selecting optimal tuning parameters using stARS

Below we generate a single data set and then obtain the optimal estimation results using stARS for tuning parameter selection with 10 bootstrap samples for demonstration. A higher number of bootstrap samples (30 or 50) should be used in practice for more reliable results. We also include examples for using Ward clustering & GGL and K-means clustering and GLasso, and note that the magnitude of the tuning parameters that work well for each method vary, so we adjust / scale them accordingly to achieve the best results for each respective method.

# Simulate data
set.seed(1994)
G <- 2
myData <- rccSim(G = G, clustSize = 10, p = 5, n = 100, overlap = 0.50, rho = 0.10)

# Standardizing data
simData$simDat <- lapply(simData$simDat, FUN = scale)

# Grid of tuning parameters to search over
lambdas <- expand.grid(lambda1 = c(25, 35, 45), lambda2 = c(300, 325), lambda3 = 0.01)

# Find optimal tuning parameter set using modified stARS with 10 bootstrap samples
optTune <- starsRccm(datf = myData$simDat, lambs = lambdas, G = G, N = 10, method = "RCCM")

optWardggl <- starsRccm(datf = myData$simDat, lambs = lambdas, G = G, N = 10, method = "GGL")

optGL <- starsRccm(datf = myData$simDat, lambs = lambdas, G = G, N = 10, method = "GLasso")

# Analyze with optimally selected tuning parameters for RCCM
resultRccm <- rccm(x = myData$simDat, lambda1 = optTune$lambda1[1],
lambda2 = optTune$lambda2[1], lambda3 = optTune$lambda3[1], nclusts = G)

# Using Ward & GGL approach

# Function for calculating pair-wise Frob norm differences and then clustering
MatClust <- function(mats, G) {
K <- dim(mats)[3]
combos <- expand.grid(s1 = 1:K, s2 = 1:K)
distMat <- matrix(NA, nrow = K, ncol = K)
for(r in 1:K) {
  for(s in 1:K) {
    distMat[r, s] <- norm(mats[, , r] - mats[, , s], type = 'F')
  }
}

cl0 <- cutree(hclust(d = as.dist(distMat), method = "ward.D"), k = G)

wgk <- matrix(NA, nrow = G, ncol = K)

for(i in 1:G) {
  for(j in 1:K) {
    wgk[i, j] <- ifelse(cl0[j] == i, 1, 0)
  }
}
return(wgk)
}

K <- length(simData$simDat)
Sl <- sapply(simData$simDat, cov, simplify = "array")
gl <- sapply(1:K, simplify = "array", FUN = function(x){glasso::glasso(Sl[, , x], rho = 0.001, 
             penalize.diagonal = FALSE)$wi})

# Estimating cluster memberships using Ward clustering based on dissimilarity matrix of Frob norm differences
GGLres <- list()
GGLres$weights <- MatClust(gl, G = G)

# Analyzing using GGL within each estimated cluster
GGLres$res <- unlist(lapply(FUN = function(g) {
          prec <- JGL::JGL(Y = simData$simDat[which(as.logical(GGLres$weights[g, ]))], penalty = "group", 
                           penalize.diagonal = FALSE, lambda1 = optWardggl$lambda1[1] / 100, 
                           lambda2 = optWardggl$lambda2[1] / 50000, return.whole.theta = TRUE)$theta
                            return(setNames(prec, c(which(as.logical(GGLres$weights[g, ])))))
                            }, X = 1:settings[i, ]$G), recursive = F)

# Using GLasso & K-means clustering
GLassores <- list()
GLassores$res <- lapply(myData$simDat, FUN = function(x) {
                       glasso::glasso(cov(x), rho = optGL$lambda1[1] / 100, penalize.diagonal = FALSE)$wi})

# Creating matrix of vectorized precision matrix estimates
vMat <- do.call(rbind, lapply(X = GLassores$res, FUN = as.numeric))

# Finding estimated cluster memberships using k-means clustering
zHat <- as.integer(factor(kmeans(x = vMat, centers = G)$cluster, levels = c(1:G)))

Selecting number of clusters using gap statistic

Below we generate a single data set with 2 clusters with 10 subjects in each group, 10 variables, 150 observations, with the groups sharing about 50% of network connections, and 10% of differential connections within each group:

# Simulate data
set.seed(1994)
myData <- rccSim(G = 2, clustSize = 10, p = 10, n = 150, overlap = 0.50, rho = 0.10)

# Analyze simulated data with RCCM
optLambdas <- data.frame(lambda1 = 25, lambda2 = 100, lambda3 = 0.1, G = 2:3)
result2 <- rccm(x = myData$simDat, lambda1 = optLambdas$lambda1[1],
                lambda2 = optLambdas$lambda2[1], lambda3 = optLambdas$lambda3[1],
            nclusts = 2)
result3 <- rccm(x = myData$simDat, lambda1 = optLambdas$lambda1[2],
                lambda2 = optLambdas$lambda2[2], lambda3 = optLambdas$lambda3[2],
            nclusts = 3)

# Estimated cluster memberships
zHats <- cbind(apply(result2$weights, MARGIN = 2, FUN = which.max),
               apply(result3$weights, MARGIN = 2, FUN = which.max))

# Selecting number of clusters (takes ~ 2 minutes)
clustRes <- gapSelect(x = myData$simDat, gMax = 3, B = 30, zs = zHats,
optLambdas = optLambdas)

clustRes

References

Danaher, P., Wang, P., & Witten, D. (2014). The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(2), 373-397.

Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3), 432-441.

Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411-423.

Zhang, L., DiLernia, A., Quevedo, K., Camchong, J., Lim, K., & Pan, W. (2020). A random covariance model for bi‐level graphical modeling with application to resting‐state fMRI data. Biometrics.



dilernia/rccm documentation built on Sept. 25, 2022, 9:40 a.m.