klic: Kernel learning integrative clustering
In acabassi/klic: Kernel Learning Integrative Clustering

Description Usage Arguments Value Author(s) References Examples

View source: R/klic.R

This function allows to perform Kernel Learning Integrative Clustering on M data sets relative to the same observations. The similarities between the observations in each data set are summarised into M different kernels, that are then fed into a kernel k-means clustering algorithm. The output is a clustering of the observations that takes into account all the available data types and a set of weights that sum up to one, indicating how much each data set contributed to the kernel k-means clustering.

klic(
  data,
  M,
  individualK = NULL,
  individualMaxK = 6,
  individualClAlgorithm = "kkmeans",
  globalK = NULL,
  globalMaxK = 6,
  B = 1000,
  C = 100,
  scale = FALSE,
  savePNG = FALSE,
  fileName = "klic",
  verbose = TRUE,
  annotations = NULL,
  ccClMethods = "kmeans",
  ccDistHCs = "euclidean",
  widestGap = FALSE,
  dunns = FALSE,
  dunn2s = FALSE
)

`data`	List of M datasets, each of size N X P_m, m = 1, ..., M.
`M`	number of datasets.
`individualK`	Vector containing the number of clusters in each dataset. Default is NULL. If the number of clusters is not provided, then all the possible values between 2 and individualMaxK are considered and the best value is chosen for each dataset by maximising the silhouette.
`individualMaxK`	Maximum number of clusters considered for the individual data. Default is 6.
`individualClAlgorithm`	Clustering algorithm used for clustering of each dataset individually if is required to find the best number of clusters.
`globalK`	Number of global clusters. Default is NULL. If the number of clusters is not provided, then all the possible values between 2 and globalMaxK are considered and the best value is chosen by maximising the silhouette.
`globalMaxK`	Maximum number of clusters considered for the final clustering. Default is 6.
`B`	Number of iterations for consensus clustering. Default is 1000.
`C`	Maximum number of iterations for localised kernel k-means. Default is 100.
`scale`	Boolean. If TRUE, each dataset is scaled such that each column has zero mean and unitary variance.
`savePNG`	Boolean. If TRUE, a plot of the silhouette is saved in the working folder. Default is FALSE.
`fileName`	If `savePNG` is TRUE, this is the name of the png file. Can be used to specify the folder path too. Default is "klic".
`verbose`	Boolean. Default is TRUE.
`annotations`	Data frame containing annotations for final plot.
`ccClMethods`	The i-th element of this vector goes into the `clMethod` argument of consensusCluster() for the i-th dataset. If only one string is provided, then the same method is used for all datasets.
`ccDistHCs`	The i-th element of this vector goes into the `dist` argument of `consensusCluster()` for the i-th dataset.
`widestGap`	Boolean. If TRUE, compute also widest gap index to choose best number of clusters. Default is FALSE.
`dunns`	Boolean. If TRUE, compute also Dunn's index to choose best number of clusters. Default is FALSE.
`dunn2s`	Boolean. If TRUE, compute also alternative Dunn's index to choose best number of clusters. Default is FALSE.

The function returns a list contatining:

`consensusMatrices`	an array containing one consensus matrix per data set.
`weights`	a vector containing the weights assigned by the kernel k-means algorithm to each consensus matrix.
`weightedKM`	the weighted kernel matrix obtained by taking a weighted sum of all kernels, where the weights are those specified in the `weights` matrix.
`globalClusterLabels`	a vector containing the cluster labels of the observations, according to kernel k-means clustering done on the kernel matrices.
`bestK`	a vector containing the best number of clusters between 2 and `maxIndividualK` for each kernel. These are chosen so as to maximise the silhouette and only returned if the number of clusters `individualK` is not provided.
`globalK`	the best number of clusters for the final (global) clustering. This is chosen so as to maximise the silhouette and only returned if the final number of clusters `globalK` is not provided.

Alessandra Cabassi alessandra.cabassi@mrc-bsu.cam.ac.uk

Cabassi, A. and Kirk, P. D. W. (2019). Multiple kernel learning for integrative consensus clustering of genomic datasets. arXiv preprint. arXiv:1904.07701.

if(requireNamespace("Rmosek", quietly = TRUE) &&
(!is.null(utils::packageDescription("Rmosek")$Configured.MSK_VERSION))){

# Load synthetic data
data1 <- as.matrix(read.csv(system.file('extdata',
'dataset1.csv', package = 'klic'), row.names = 1))
data2 <- as.matrix(read.csv(system.file('extdata',
'dataset2.csv', package = 'klic'), row.names = 1))
data3 <- as.matrix(read.csv(system.file('extdata',
'dataset3.csv', package = 'klic'), row.names = 1))
data <- list(data1, data2, data3)

# Perform clustering with KLIC assuming to know the
# number of clusters in each individual dataset and in
# the final clustering
klicOutput <- klic(data, 3, individualK = c(4, 4, 4),
globalK = 4, B = 30, C = 5)

# Extract cluster labels
klic_labels <- klicOutput$globalClusterLabels

cluster_labels <- as.matrix(read.csv(system.file('extdata',
'cluster_labels.csv', package = 'klic'), row.names = 1))
# Compute ARI
ari <- mclust::adjustedRandIndex(klic_labels, cluster_labels)
}