Semi-supervised clustering

Share:

Description

For a given partition, this function assigns new samples to one of the clusters in the partition. Partition (clustering) is composed of non-overlapping clusters

Usage

1
2
3
cluster_pred(X, partition, surv.time = NULL, status = NULL, te.index, minclus = 4, 
             te.surv.time = NULL, te.status = NULL, method = "conc", maxmiss = 30, 
              plot.it = FALSE, ...)

Arguments

X

An object of class ExpressionSet or data matrix in which columns are assumed to represent the samples, and rows represents the sample's features. X can also be a square distance matrix or object cass of dist. It must include the data set from which partition is obtained and the data set of test samples.

partition

A numeric vector contains the non-overlapping cluster labels.

surv.time

A numeric vector contains follow-up time of patients in the partition.

status

A binary vector contains survival status of patients in the partition, 0 = alive, 1 = dead.

te.index

A numeric vector contains the indices of columns in X corresponds to the test samples.

minclus

The minimum number samples allowed to form a cluster for the test set. This is to avoid returning tiny clusters and reduce the effect of outliers.

te.surv.time

An optional vector contains follow-up time of patients in the test set. If supplied with te.status, the logrank test p-value is calculated for the test set.

te.status

An optional vector contains survival status of patients in the test set. If supplied with te.surv.time, the logrank test p-value is calculated for the test set.

method

Type of methods to use in assigning test samples to one of the clusters in the partition. Must be either Ward distance 'ward' or Harrel's concordance index 'conc' (default).

maxmiss

Maximum percentage of missing values per row in X

plot.it

If TRUE and follow-up data of the test samples are given, Kaplan Meier curve(s) will be generated for each cluster in the test set.

...

Arguments for impute.knn from the impute package.

Details

User has two options to assign test set to one of the clusters in the partition. One option is to use the Ward distance. Specifically, an average distance is calculated between a test sample and samples in each cluster in the partition, separately. The test sample is assigned to a cluster for which average distance is the smallest. Follow-up data is not required for this option.

Second option is to use the Harrel's concordance index (Harrel et al., 1982). For this option both main and follow-up data corresponds to the given partition are required. Main data is used to find the pseudo nearest neighbours (PNN) of a test sample (Obulkasim et a., 2011), and follow-up data is used to check how much PNN's follow-up info is concordant with follow-up info of samples in each cluster. The test sample is assigned to a cluster for which average concordance is the highest.

Before selecting either one of the options, we recommend user to check the correlation between main data and follow-up info (e.g. using global test). If correlation is relatively large, we recommend to use 'conc' option, and vice versa.

Value

If plot.it is FALSE, function returns a vector of predicted cluster labels of the test set. If TRUE and follow-up data of the test set are given, function returns a list object contains following components:

St

a data frame with following five columns:

[,1] the unique survival times found in the test set's follow-up.
[,2] the cluster labels of the test set.
[,3] the number of patients at risk at each unique time point.
[,4] the survival probability at each unique time point.
[,5] the variance of the survival probability at each unique time point.
value

logrank test p-value for the test set

Author(s)

Askar Obulkasim

References

Obulkasim,A. et al., (2013). "Semi-supervised adaptive-height snipping of the Hierarchical Clustering tree", submitted.

Harrel,E.F. et al., (1982). "Evaluating the yield of medical tests", JAMA, 247, 2543-2546.

Obulkasim,A. et al., (2011). "Stepwise classification of cancer samples using clinical and molecular data", BMC Bioinformatics, 12, 422.

Troyanskaya,O. et al., (2001). "Missing value estimation methods for DNA microarrays". Bioinformatics, 17, 520-525.

See Also

TwoHC_assign

Examples

1
2
3
4
5
6
7
data(BullingerLeukemia)
attach(BullingerLeukemia)
cl <- HCsnipper(em[, 1:30], min = 5)
cl <- cl$partitions[cl$id, ]
result <- cluster_pred(X = em[, 1:50], partition = cl[1, ], surv.time = surv.time[1:30], 
                       status = status[1:30], te.index = 31:50)
names(result)