iClusterVB: Fast Integrative Clustering for High-Dimensional Multi-View...

View source: R/iClusterVB.R

iClusterVBR Documentation

Fast Integrative Clustering for High-Dimensional Multi-View Data Using Variational Bayesian Inference

Description

iClusterVB offers a novel, fast, and integrative approach to clustering high-dimensional, mixed-type, and multi-view data. By employing variational Bayesian inference, iClusterVB facilitates effective feature selection and identification of disease subtypes, enhancing clinical decision-making.

Usage

iClusterVB(
  mydata,
  dist,
  K = 10,
  initial_method = "VarSelLCM",
  VS_method = 0,
  initial_cluster = NULL,
  initial_vs_prob = NULL,
  initial_fit = NULL,
  initial_omega = NULL,
  input_hyper_parameters = NULL,
  max_iter = 200,
  early_stop = 1,
  per = 10,
  convergence_threshold = 1e-04
)

Arguments

mydata

A list of length R, where R is the number of datasets, containing the input data.

  • Note: For categorical data, 0's must be re-coded to another, non-0 value.

dist

A vector of length R specifying the type of data or distribution. Options include: 'gaussian' (for continuous data), 'multinomial' (for binary or categorical data), and 'poisson' (for count data).

K

The maximum number of clusters, with a default value of 10. The algorithm will converge to a model with dominant clusters, removing redundant clusters and automating the determination of the number of clusters.

initial_method

The initialization method for cluster allocation. Options include: "VarSelLCM" (default), "random", "kproto" (k-prototypes), "kmeans" (continuous data only), "mclust" (continuous data only), or "lca" (poLCA, categorical data only).

VS_method

The variable/feature selection method. Options are 0 for clustering without variable/feature selection (default) and 1 for clustering with variable/feature selection.

initial_cluster

The initial cluster membership. The default is NULL, which uses initial_method for initial cluster allocation. If not NULL, it will override the initial values setting for this parameter.

initial_vs_prob

The initial variable/feature selection probability, a scalar. The default is NULL, which assigns a value of 0.5.

initial_fit

Initial values based on a previously fitted iClusterVB model (an iClusterVB object). The default is NULL.

initial_omega

Customized initial values for feature inclusion probabilities. The default is NULL. If not NULL, it will override the initial values setting for this parameter. If VS_method = 1, initial_omega is a list of length R, with each element being an array with dimensions {dim=c(N, p[[r]])}. Here, N is the sample size and p[[r]] is the number of features for dataset r, where r = 1, ..., R.

input_hyper_parameters

A list of the initial hyper-parameters of the prior distributions for the model. The default is NULL, which assigns alpha_00 = 0.001, mu_00 = 0, s2_00 = 100, a_00 = 1, b_00 = 1,kappa_00 = 1, u_00 = 1, v_00 = 1.

max_iter

The maximum number of iterations for the VB algorithm. The default is 200.

early_stop

Whether to stop the algorithm upon convergence or to continue until max_iter is reached. Options are 1 (default) to stop when the algorithm converges, and 0 to stop only when max_iter is reached.

per

Print information every "per" iterations. The default is 10.

convergence_threshold

The convergence threshold for the change in ELBO. The default is 0.0001.

Value

The iClusterVB function creates an object (list) of class iClusterVB. Relevant outputs include:

elbo:

The evidence lower bound for each iteration.

cluster:

The cluster assigned to each individual.

initial_values:

A list of the initial values.

hyper_parameters:

A list of the hyper-parameters.

model_parameters:

A list of the model parameters after the algorithm is run.

  • Of particular interest is rho, a list of the posterior inclusion probabilities for the features in each of the data views. This is the probability of including a certain predictor in the model, given the observations. This is only available if VS_method = 1.

Note

If any of the data views are "gaussian", please include them first, both in the input data mydata and correspondingly in the distribution vector dist. For example, dist <- c("gaussian","gaussian", "poisson", "multinomial"), and not dist <- c("poisson", "gaussian","gaussian", "multinomial") or dist <- c("gaussian", "poisson", "gaussian", "multinomial")

Examples

# sim_data comes with the iClusterVB package.
dat1 <- list(
  gauss_1 = sim_data$continuous1_data[c(1:20, 61:80, 121:140, 181:200), 1:75],
  gauss_2 = sim_data$continuous2_data[c(1:20, 61:80, 121:140, 181:200), 1:75],
  poisson_1 = sim_data$count_data[c(1:20, 61:80, 121:140, 181:200), 1:75])

dist <- c(
  "gaussian", "gaussian",
  "poisson")

# Note: `max_iter` is a time-intensive step.
# For the purpose of testing the code, use a small value (e.g. 10).
# For more accurate results, use a larger value (e.g. 200).

fit_iClusterVB <- iClusterVB(
  mydata = dat1,
  dist = dist,
  K = 4,
  initial_method = "VarSelLCM",
  VS_method = 1,
  max_iter = 10
)

# We can obtain a summary using the summary() function
summary(fit_iClusterVB)


iClusterVB documentation built on April 3, 2025, 6:22 p.m.