LOVE: LOVE: Latent-model based OVErlapping clustering
In bingx1990/LOVE: Overlapping clustering based on latent factor models

Description Usage Arguments Details Value References Examples

View source: R/LOVE.R

Perform overlapping (variable) clustering of a p- dimensional feature generated from the latent factor model

X = AZ + E

with identifiability conditions on A and Cov(Z).

LOVE(
  X,
  lbd = 0.5,
  mu = 0.5,
  est_non_pure_row = "HT",
  verbose = FALSE,
  pure_homo = FALSE,
  diagonal = FALSE,
  delta = NULL,
  merge = FALSE,
  rep_CV = 50,
  ndelta = 50,
  q = 2,
  exact = FALSE,
  max_pure = NULL,
  nfolds = 10
)

`X`	A n by p data matrix.
`lbd`	The grid of leading constant of λ.
`mu`	The leading constant used for thresholding the loading matrix.
`est_non_pure_row`	String. Procedure used for estimating the non-pure rows. One of {"HT", "ST", "Dantzig"}.
`verbose`	Logical. Set FALSE to suppress printing the progress.
`pure_homo`	Logical. TRUE if the pure loadings have the same magnitude.
`diagonal`	Logical. If TRUE, the covariance matrix of Z is diagonal; else FALSE.
`delta`	The grid of leading constant of δ.
`merge`	Logical. If TRUE, take the union of all candidate pure variables; otherwise, take the intersection.
`rep_CV`	The number of repetitions used for cross validation.
`ndelta`	Integer. The length of the grid of `delta`.
`q`	Either `2` or `Inf` to specify the type of score.
`exact`	Logical. Only active for compute the `Inf` score. If TRUE, compute the `Inf` score exactly via solving a linear program. Otherwise, use approximation to compute `Inf` score.
`max_pure`	A numeric value between (0, 1] specifying the maximal proportion of pure variables. Default is NULL. When not specified, `max_pure` = 1 if n > p, `max_pure` = 0.8 otherwise.
`nfolds`	The number of folds. Default is 10.

LOVE performs overlapping clustering of the feature variables X generated from the latent factor model

X = AZ+E

where the loading matrix A and the covariance matrix of Z satisfy certain identifiability conditions. The main goal is to estimate the loading matrix A whose support is used to form overlapping groups of X.

The first step estimates the pure loadings, defined as the rows of A that are proportional to canonical vectors. When the pure loadings are expected to have the same magnitudes (up to the sign), for instance,

A_{1.} = (1, 0, 0), A_{2.} = (-1, 0, 0),

the estimation of pure loadings is done via setting pure_homo to TRUE. When different magnitudes are expected for the pure loadings, such as

A_{1.} = (1, 0, 0), A_{2.} = (-0.5, 0, 0),

the estimation uses a different approach by setting setting pure_homo to FALSE.

The second step estimates the non-pure (mixed) loadings of A. Three procedures are available as specified by est_non_pure_row. The choice "HT" specifies the estimation via hard-thresholding that is computationally fast while "ST" uses soft-thresholding instead. Both "ST" and "Dantzig" resort to solving linear programs. Another difference of "Dantzig" from "HT" and "ST" is that the former does not require to estimate the precision matrix of Z.

A list of objects including:

K The estimated number of clusters.
pureVec The estimated set of pure variables.
pureInd The estimated partition of pure variables.
group The estimated clusters (indices of each cluster).
A The estimated p by K assignment matrix.
C The covariance matrix of Z.
Omega The precision matrix of Z.
Gamma The diagonal of the covariance matrix of E.
optDelta The selected value of δ.

Bing, X., Bunea, F., Yang N and Wegkamp, M. (2020) Adaptive estimation in structured factor models with applications to overlapping clustering, Annals of Statistics, Vol.48(4) 2055 - 2081, August 2020. https://projecteuclid.org/journals/annals-of-statistics/volume-48/issue-4/Adaptive-estimation-in-structured-factor-models-with-applications-to-overlapping/10.1214/19-AOS1877.short

Bing, X., Bunea, F. and Wegkamp, M. (2021) Detecting approximate replicate components of a high-dimensional random vector with latent structure. https://arxiv.org/abs/2010.02288.

p <- 6
n <- 100
K <- 2
A <- rbind(c(1, 0), c(-1, 0), c(0, 1), c(0, 1), c(1/3, 2/3), c(1/2, -1/2))
Z <- matrix(rnorm(n * K, sd = sqrt(2)), n, K)
E <- matrix(rnorm(n * p), n, p)
X <- Z %*% t(A) + E
res_LOVE <- LOVE(X, pure_homo = FALSE, delta = NULL)
res_LOVE <- LOVE(X, pure_homo = TRUE, delta = seq(0.1, 1.1 ,0.1))