Function to assign new samples to one of the two given hierarchical clustering trees in a semi-supervised way

Description

For given molecular data sets from two non-overlapping groups of patients, this functions constructs two independent HC trees and assigns new samples to one of them in semi-supervised way. See details.

Usage

1
2
3
TwoHC_assign(X, index1, index2, new.X, dis.method = "cor", link.method = "ward", 
             minclus = 4, maxmiss = 30, surv.time, status, method1 = "BIC", 
              method2 = "g2")

Arguments

X

An object of class ExpressionSet or data matrix from which two HC tress to be derived. Columns are assumed to represent the samples, and rows represent the sample's features. Missing values are allowed.

index1

Column indices of patients in X correspond to the first group.

index2

Column indices of patients in X correspond to the second group.

new.X

An object of class ExpressionSet or data matrix corresponds to new samples. Columns are assumed to represent the samples, and rows represents the sample's features. Missing values are allowed.

dis.method

The distance measure to be used. This must be one of method acceptable for dist function or the Pearson correlation (default).

link.method

The agglomeration method to be used. This should be one of "ward" (default), "single", "complete", "average", "mcquitty", "median" or "centroid".

minclus

The minimum number of samples allowed to form a cluster. This parameter inversely proportional to the number of partition returned from a HC tree. e.g. a large value returns small number of partitions, and vice versa.

maxmiss

Maximum percentage of missing values per row in X.

surv.time

A numeric vector contains follow-up information of patient's in X

status

A binary vector contains survival status of patients in X, normally 0=alive, 1=dead.

method1

Type of partition evaluation measures to use for assessing the relationship between follow-up and a partition. Default is "BIC".

method2

Type of Partition evaluation measure to use for assessing the relationship between data matrix X and a partition. Default is Goodman and Kruskal index "g2".

Details

Say molecular profiles of two groups patients (without overlap) treated with two different drugs or the same drugs in different combinations are available. Besides that, their follow-up information are also given. When a new patient comes in (for which only molecular profiles are available), question will be to which group this patient should be assigned so that he/she will benefit most by the type of treatment this group received.

This function is designed for this problem. it works as follows: first, two independent HC trees will be derived from given data; second, partitions are extracted and the optimal partition is selected from each HC tree, separately; third, new patient's molecular profile is compared with each cluster in each optimal partition to calculate average similarity and identify two most similar clusters (competing clusters) fromt the two HC trees; finally, new sample is assigned to one of the two competing clusters which has better overall survival.

Value

A list object contains following components:

hc1

HC tree derived from the data corresponds to the first group.

hc2

HC tree derived from the data corresponds to the second group.

partitions.hc1

A matrix includes partitions extracted from hc1. Rows represent partitions and columns represent samples.

partitions.hc2

A matrix includes partitions extracted from hc2. Rows represent partitions and columns represent samples.

best.hc1

Optimal partition found on the hc1

best.hc2

Optimal partition found on the hc2

score.hc1

A matrix with two columns. The first column contains the quality scores of partitions.hc1 calculated using the follow-up data. The second column contains the quality scores of partition.hc1 calculated by using X.

score.hc2

The same as score.hc1, but for partitions.hc2.

Assign

A matrix with three columns. The first column contains the indices of HC trees to which a test sample was assigned. The second column contains the indices of clusters in best.hc1 to which a test sample was most similar. The third column contains the indices of clusters in best.hc2 to which a test sample was most similar.

surv.time

The same as input

status

The same as input

index1

The same as input

index2

The same as input

new.X

The same as input

X

The same as input

method1

The same as input

method2

The same as input

minclus

The same as input

id1

indices of the partitions obtained from the hc1 in which minimum cluster size is equal or larger than minclus.

id2

indices of the partitions obtained from the hc2 in which minimum cluster size is equal or larger than minclus.

Author(s)

Askar Obulkasim

References

Harrel,E.F. et al., (1982). "Evaluating the yield of medical tests", JAMA, 247, 2543-2546.

Obulkasim,A. et al., (2011). "Stepwise classification of cancer samples using clinical and molecular data", BMC Bioinformatics, 12, 422.

Troyanskaya,O. et al., (2001). "Missing value estimation methods for DNA microarrays". Bioinformatics, 17, 520-525.

Obulkasim,A. et al., (2013). "Semi-supervised adaptive-height snipping of the Hierarchical Clustering tree", submitted.

See Also

See also TwoHC_perm, cluster_pred

Examples

1
2
3
4
5
6
7
8
data(TcgaGBM)
attach(TcgaGBM)
id1 <- which(drugs == "Avastin")
id2 <- which(drugs == "Temodar") 
result <- TwoHC_assign(X = em[ ,c(id1[1:30], id2[1:30])], index1 = 1:30, index2 = 31:60, 
                      new.X = em[, c(id1[31:60], id2[31:60])], minclus = 4,
                     surv.time = surv.time[c(id1[1:30], id2[1:30])], 
                     status = status[c(id1[1:30], id2[1:30])])