R/test_indep_clust_subset.R

Defines functions test_indep_clust_subset

Documented in test_indep_clust_subset

#' Pseudo likelihood ratio test for independence of clusterings
#'
#'
#' Implements the pseudo likelihood ratio test described in Section 3 of
#' Gao et. al. (2019) "Are Clusterings of Multiple Data Views Independent?"
#' for testing for dependence between clusterings of two data views.
#' Fits Gaussian mixture models in each view, in the case where
#' observations can be removed in one view but not the other
#' (e.g. when view 1 data is available on 1st observation but
#' view 2 data is not available on 2nd observation).
#' Allows for fitting model-based clusterings on the full data views,
#' and comparing these clusterings on subsets of the data views where
#' observations are not removed in both views.
#'
#' @param x Multi-view data with two views; a list of two numeric vectors
#'  (in the case of univariate data) or matrices containing the two data views.
#'  In matrix format, rows correspond to observations and columns correspond to variables.
#' @param B An integer specifying the number of permutations to
#' use for the permutation procedure. The default number is 200.
#' @param model1 A character string indicating the model
#' to be fitted for Gaussian model-based clustering in view 1 using
#' the function \code{\link[mclust]{Mclust}}. The default is \code{"EII"}
#' (spherical, equal volume).
#' The help file for \code{\link[mclust]{mclustModelNames}} describes
#' the available model options.
#' @param model2 A character string indicating indicating the model
#' to be fitted for Gaussian model-based clustering in view 1 using
#' the function \code{\link[mclust]{Mclust}}. The default is \code{"EII"}
#' (spherical, equal volume). The help file for
#'  \code{\link[mclust]{mclustModelNames}} describes the available model options.
#' @param subset1 A numeric vector indicating the rows of the matrix containing
#' the first data view that correspond to observations which have not been
#' removed in the second data view.
#' @param subset2 A numeric vector indicating the rows of the matrix containing
#' the second data view that correspond to observations which have not been removed
#' in the first data view.
#' @param step A numeric value containing the fixed step size to be used in the optimization
#' algorithm for estimating Pi. The default step size is 0.001. See Appendix B
#' in the Supplementary Materials of "Are Clusterings of Multiple Data Views Independent?"
#' for details.
#' @param maxiter A numeric value containing the maximum number of iterations to run in
#'  the optimization algorithm. The default maximum is 1000.
#' @param init1 An optional argument containing the model to be fitted in the
#' hierarchical clustering initialization in Gaussian model-based clustering
#' in view 1. The default is \code{"VVV"} (ellipsoidal, varying volume,
#'  shape, and orientation).
#' The help file for \code{\link[mclust]{hc}} describes
#' the available model options.
#' @param init2 An optional argument containing the model to be fitted in the
#' hierarchical clustering initialization in Gaussian model-based clustering
#' in view 2. The default is \code{"VVV"} (ellipsoidal, varying volume,
#'  shape, and orientation).
#' The help file for \code{\link[mclust]{hc}} describes
#' the available model options.
#' @param K1 An optional argument containing the number of clusters in View 1.
#'  If left out, then the number of clusters is chosen with BIC as described in
#'  Section 2.3.3 of "Are Clusterings of Multiple Data Views Independent?"
#' @param K2 An optional argument containing the number of clusters in View 2.
#'  If left out, then the number of clusters is chosen with BIC as described in
#'  Section 2.3.3 of "Are Clusterings of Multiple Data Views Independent?"
#' @import mclust
#' @export
#' @return
#' A list containing the following output components:
#' \item{K1}{The number of clusters in view 1}
#' \item{K2}{The number of clusters in view 2}
#' \item{Pi.est}{The estimated Pi matrix}
#' \item{PLRstat}{The pseudo likelihood ratio test statistic}
#' \item{pval}{The p-value}
#' \item{modelfit1}{The object of class '\code{Mclust}' corresponding to the model-based clustering fitted
#' in View 1; contains eg. estimated parameters and cluster assignments. The help file for \code{\link[mclust]{Mclust}} describes the components of the
#' object.}
#' \item{modelfit2}{The object of class '\code{Mclust}' corresponding to the model-based clustering fitted
#' in View 2; contains eg. estimated parameters and cluster assignments.
#' The help file for \code{\link[mclust]{Mclust}} describes the components of the
#' object.}
#' @examples
#' set.seed(1)
#' n <- 50
#' sig <- 2
#' p <- 2
#' K <- 3
#' mu1 <- cbind(c(2, 0), c(0, 2),  c(2, -2), c(-2, 0), c(0, -2), c(-2, 2))
#' mu2 <- cbind(c(-2, 0), c(0, -2), c(-2, 2), c(2, 0), c(0, 2), c(2, -2))
#' # Generates two-view data where the clusters are independent.
#' x1 <- list(matrix(sig* rnorm(n*p), n, p) + t(mu1)[sample(1:K, n, replace=TRUE), ],
#'         matrix(sig * rnorm(n*p), n, p) + t(mu2)[sample(1:K, n, replace=TRUE), ])
#' # Generate two-view data where the clusters are identical.
#' n <- 71
#' cl <- sample(1:K, n, replace=TRUE)
#' x2 <- list(matrix(sig* rnorm(n*p), n, p) + t(mu1)[cl, ],
#' matrix(sig * rnorm(n*p), n, p) + t(mu2)[cl, ])
#'
#' # Run the function on independent data views; we do not reject the null hypothesis.
#'  # By default, not specifying K1 and K2 means the number of clusters
#'  # to use in the test in each view is chosen via BIC.
#'  # Covariance matrix model specified is shared sigma^2 I covariance matrix in view 1
#'  # and shared diagonal covariance matrix in view 2.
#'  # B specifies the number of permutations to do for the permutation test.
#'  # Covariance matrix model specified for initialization
#'  # is shared sigma^2 I covariance matrix in view 1
#'  # Estimates Gaussian mixture model parameters on x1[[1]] and x1[[2]],
#'  # and compares the estimated clusterings on the subsetted data
#'  # x1[[1]][2:48, ] and x1[[1]][2:48, ].
#' indep1 <- test_indep_clust_subset(x1,model1="EII", model2="EEI",
#' subset1=2:48, subset2=2:48, init1="EII", B=51)
#' # The estimated cluster parameters in view 1
#' indep1$modelfit1$parameters
#' # The cluster assignments in view 2
#' indep1$modelfit2$classification
#'
#' @references
#' Fraley C. and Raftery A. E. (2002) Model-based clustering,
#' discriminant analysis and density estimation,
#' Journal of the American Statistical Association, 97/458, pp. 611-631.
#'
#' Gao, L.L., Bien, J., Witten, D. (2019) Are Clusterings of Multiple Data Views Independent?
#' Biostatistics, <DOI:10.1093/biostatistics/kxz001>
#'
#' Scrucca L., Fop M., Murphy T. B. and Raftery A. E. (2016)
#' mclust 5: clustering, classification and density estimation
#' using Gaussian finite mixture models, The R Journal, 8/1, pp. 205-233.
test_indep_clust_subset <- function(x,  model1="EII", model2="EII",
                                    K1=NULL, K2=NULL,
                                    subset1, subset2,
                                    init1=NULL, init2=NULL,
                                    B=200, step=0.001, maxiter=1000) {
  input_check <- function(x, model1, model2, K1, K2, subset1, subset2, 
                          init1, init2, B, step, maxiter) { 
    if(typeof(x) != "list") {
      stop("x is not in the right format; should be a list of two matrices
           (or vectors, for univariate data)")
    } 
    
    if(length(x) != 2) {
      stop("x is not in the right format; should be a list of two numeric matrices
           (or vectors, for univariate data)")
    }  
    
    if(!is.numeric(x[[1]]) || !is.numeric(x[[2]])) {
      stop("x is not in the right format; should be a list of two numeric matrices
           (or vectors, for univariate data)")
    } 
    
    if(is.matrix(x[[1]]) != is.matrix(x[[2]])) {
      stop("x is not in the right format; should be a list of two numeric matrices
           (or vectors, for univariate data)")
    } 
    
    val1 <- ifelse(is.matrix(x[[1]]), nrow(x[[1]]), length(x[[1]]))
    val2 <- ifelse(is.matrix(x[[2]]), nrow(x[[2]]), length(x[[2]]))
    if(val1 != val2){
      stop("view 1 should contain the same number of observations as view 2")
    }
    
    if(any(is.na(x[[1]])) || any(is.na(x[[2]])))
      stop(paste("views cannot contain any missing values"))

    if(!is.numeric(subset1) || !is.numeric(subset2)) {
      stop("subset is not in the right format; should be a numeric vector")
    } 
    
    if(sum(c(subset1 <= 0, subset2 <= 0)) > 0) {
      stop("cannot provide a subset with a negative index")
    } 
    
    if(length(subset1) != length(subset2)){
      stop("the two subsets should contain the same number of observations")
    }
    
    if(!model1 %in% c("E", "V", "EII", "VII", "EEI", "VEI", 
                      "EVI", "VVI", "EEE", "EVE", "VEE", "VVE", 
                      "EEV", "VEV", "EVV", "VVV"))
      stop(paste("Did not pass a valid model for Mclust in View 1") )
    
    if(model1 %in% c("E", "V") && is.matrix(x[[1]]))
      stop(paste(model1, "is a univariate model for Mclust, cannot be used for
               multivariate data") )
    
    if(model1 %in% mclust.options("emModelNames") && !is.matrix(x[[1]]))
      stop(paste(model1, "is a multivariate model for Mclust, cannot be used for
               univariate data"))
    
    if(!model2 %in% c("E", "V", "EII", "VII", "EEI", "VEI", 
                      "EVI", "VVI", "EEE", "EVE", "VEE", "VVE", 
                      "EEV", "VEV", "EVV", "VVV"))
      stop(paste("Did not pass a valid model for Mclust in View 2") )
    
    if(model2 %in% c("E", "V") && is.matrix(x[[2]]))
      stop(paste(model2, "is a univariate model for Mclust, cannot be used for
               multivariate data"))
    if(model2 %in% c("E", "V", "EII", "VII", "EEI", "VEI", 
                     "EVI", "VVI", "EEE", "EVE", "VEE", "VVE", 
                     "EEV", "VEV", "EVV", "VVV") && !is.matrix(x[[2]]))
      stop(paste(model2, "is a multivariate model for Mclust, cannot be used for
               univariate data") )
    
    if(!is.null(init1)) {
      if(!init1 %in% c("E", "V", "EII", "VII", "EEE", "VVV"))
        stop(paste("Did not pass a valid initialization model for Mclust in View 1") )
      
      if(init1 %in% c("E", "V") && is.matrix(x[[1]]))
        stop(paste(init1, "is a univariate initialization model for Mclust,
                 cannot be used for multivariate data") )
      
      if(init1 %in% c("EII", "VII", "EEE", "VVV") && !is.matrix(x[[1]]))
        stop(paste(init1, "is a multivariate initialization model for Mclust,
                 cannot be used for univariate data"))
      
      agree1 <- ((init1 %in% c("E", "V")) == (model1 %in% c("E", "V")))
      if(!agree1) {
        stop("In view 1, initialization model is for univariate data,
           but model is for multivariate data, or
           initialization model is for multivariate data,
           but model is for univariate data")
      }
    }
    
    if(!is.null(init2)) {
      if(!init2 %in% c("E", "V", "EII", "VII", "EEE", "VVV"))
        stop(paste("Did not pass a valid initialization model for Mclust in View 2") )
      
      if(init2 %in% c("E", "V") && is.matrix(x[[2]]))
        stop(paste(init2, "is a univariate initialization model for Mclust,
                 cannot be used for multivariate data"))
      if(init2 %in% c("EII", "VII", "EEE", "VVV") && !is.matrix(x[[2]]))
        stop(paste(init2, "is a univariate initialization model for Mclust,
                 cannot be used for univariate data") )
      agree2 <- ((init2 %in% c("E", "V")) == (model2 %in% c("E", "V")))
      if(!agree2) {
        stop("In view 2, initialization model is for univariate data,
           but model is for multivariate data, or
           initialization model is for multivariate data,
           but model is for univariate data")
      }
    }
    
    if(step <=0 ) stop('Step size for optimization algorithm should be > 0')
    if(B < 1) stop('Number of permutations should be > 0')
    if(B <= 50) warning('Number of permutation iterations is specified to be small;
                      p-value approximation will likely be imprecise')
    if(maxiter %% 1 != 0|| maxiter <= 0) stop('Maximum number of iterations should be
                                            a positive integer')
    if(!is.null(K1)) { 
      if(!is.numeric(K1)) stop('Number of clusters must be a positive integer in View 1')
      if(K1 %% 1 != 0 || K1 < 0) stop('Number of clusters must be a positive integer in View 1')
      if(K1 < 1) stop('Number of clusters must be greater than 1 in View 1')
    }
    
    if(!is.null(K2)) { 
      if(!is.numeric(K2)) stop('Number of clusters must be a positive integer in View 2')
      if(K2 %% 1 != 0 || K2 < 0) stop('Number of clusters must be a positive integer in View 2')
      if(K2 < 1) stop('Number of clusters must be greater than 1 in View 2')
    }
  }
  
  input_check(x, model1, model2, K1, K2, subset1, subset2, 
              init1, init2, B, step, maxiter)
  
  if(is.null(init1))  init1 <- "VVV"
  if(is.null(init2))  init2 <- "VVV"
  
  # Model-based clustering of each view
  if(is.null(K1)) {
    EM.View1 <- mclust::Mclust(x[[1]], G=2:9, modelNames=c(model1),
                               initialization=list(hcPairs=hc(x[[1]],
                                                              modelName=init1)))
    if(is.null(EM.View1)) stop("The model specified for model-based clustering
                               could not be fitted in view 1")
    K1 <- EM.View1$G
  } else {
   EM.View1 <- mclust::Mclust(x[[1]], K1, modelNames=c(model1),
                               initialization=list(hcPairs=hc(x[[1]],
                                                              modelName=init1)))
    if(is.null(EM.View1)) stop("The model specified for model-based clustering
                               could not be fitted in view 1")
  }

  if(is.null(K2)) {
    EM.View2 <- mclust::Mclust(x[[2]], G=2:9, modelNames=c(model2),
                               initialization=list(hcPairs=hc(x[[2]],
                                                              modelName=init2)))
    if(is.null(EM.View2)) stop("The model specified for model-based clustering
                               could not be fitted in view 1")
    K2 <- EM.View2$G
  } else {
    EM.View2 <- mclust::Mclust(x[[2]], K2, modelNames=c(model2),
                               initialization=list(hcPairs=hc(x[[2]],
                                                              modelName=init2)))
    if(is.null(EM.View2)) stop("The model specified for model-based clustering
                               could not be fitted in view 2")
  }


  EM.View1.param <- EM.View1$parameters
  EM.View2.param <- EM.View2$parameters

  if(is.matrix(x[[1]])) {
    # Density matrices for each mixture component
    logphi1 <-  mclust::cdens(modelName=model1, data=x[[1]][subset1, ], logarithm=TRUE, parameters=EM.View1.param)
  } else {
    logphi1 <-  mclust::cdens(modelName=model1, data=x[[1]][subset1], logarithm=TRUE, parameters=EM.View1.param)
  }

  if(is.matrix(x[[1]])) {
    logphi2 <- mclust::cdens(modelName=model2, data=x[[2]][subset2, ], logarithm=TRUE, parameters=EM.View2.param)
  } else {
    logphi2 <- mclust::cdens(modelName=model2, data=x[[2]][subset2], logarithm=TRUE, parameters=EM.View2.param)
  }

  n <- nrow(logphi1)
  pi1.est <- EM.View1.param$pro
  pi2.est <- EM.View2.param$pro
  # Optimization to estimate Pi
  Pi.est <- optimize_over_Pi(logphi1, logphi2,pi1.est, pi2.est,
                             max.iter=maxiter, stepsz=step)

  # Compute pseudo likelihood ratio test statistic
  nullLogLik <- EM.View1$loglik + EM.View2$loglik
  log.Lambda = -Pi.est$obj - nullLogLik
  PLRstat <- 2*log.Lambda

  cat("Computing p-value: ")
  # Permutation procedure
  PLRstat.perm <- rep(0, B)
  for(b in 1:B) {

    if(b %% 10^round(log10(B) - 1) == 0) {
      cat(round(b/B*100))
      cat("% of permutations done ... ")
    }

    if(b == 1001) {
      earlyPval <- mean(ifelse(PLRstat.perm[1:1000] >= PLRstat, 1, 0))
      if(earlyPval >= 0.1) {
        PLRstat.perm <- PLRstat.perm[1:1000]
        cat("stopped permuting at 1000 permutations (calculated permutation p-value >= 0.1)")
        cat("\n")
        cat("\n")
        break
      }
    }
    Pi.est.perm <- optimize_over_Pi(logphi1, logphi2[sample(1:n, replace=F), ],
                                          pi1.est, pi2.est, max.iter=maxiter, stepsz=step)
    log.Lambda.perm <- -Pi.est.perm$obj - nullLogLik
    PLRstat.perm[b] <- 2*log.Lambda.perm
  }
  cat("\n")
  cat("\n")

  pval.perm <- mean(ifelse(PLRstat.perm >= PLRstat, 1, 0))

  # Likelihood Ratio statistic
  return(list(PLRstat=PLRstat,
              Pi.est=Pi.est$Pi,
              K1=K1,
              K2=K2,
              pval=pval.perm,
              modelfit1=EM.View1,
              modelfit2=EM.View2))
  }

Try the multiviewtest package in your browser

Any scripts or data that you put into this service are public.

multiviewtest documentation built on Oct. 13, 2021, 5:08 p.m.