Description Usage Arguments Details Value Author(s) References See Also Examples
For a given set of partitions (each partition is composed of non-overlapping clusters), this function uses two types of data to evaluate each partition and select the optimal one which has the highest rank in terms of both data type (presumed that score1 and score2 were from two different data source). Permutation approach used to calculate the corrected p-value of the selected partition.
1 |
partitions |
A matrix in which rows represent partitions and columns represent samples |
surv.time |
A numeric vector contains follow-up time of patients in the partition |
status |
A binary vector contains survival status of patients in the partition, 0 = alive, 1 = dead |
score1 |
A numeric vector contains the quality score for each partition. Scores are assumed to be calculated using the follow-up data. Note, prepare this vector in a way that high value corresponds to good quality partition. |
score2 |
A numeric vector contains the quality score for each partition calculated by using any data type except for follow-up. The same as score1 this vector must be prepared in a way that high value corresponds to good quality partition. |
method |
Type of partition evaluation measure to use. Must be the same as the type of measure used in calculating the score1. Default is 'BIC' |
nperm |
The number of permutations. |
When studying association of cluster membership with follow-up data, we cannot use the standard testing procedures. Because score1 is already used the follow-up data. Thus, we would use the follow-up data twice and the resulting p-value is likely to be too small. We avoid this bias by also applying the semi-supervised partition selection under the null-hypothesis. This null-hypothesis is simply the absence of association between the data type used to generate the score2 and the follow-up. Our partition selection in combination with a suitable test statistic is designed to detect associations that can be represented by groups of samples. We adapt the p-value computation as follows:
Use a suitable test statistic (e.g. log-rank for time-to-event data and chi-square for nominal data) to compute the conditional p-value given the cluster labels in the selected partition: p_obs.
For i = 1...nperm:
Randomly permute follow-up data among the samples.
Apply exactly the same type of evaluation measure to evaluate all partitions, e.g. generate new score1 , but score2 is fixed. Selected the best partition as before.
Conditional on the resulting partition, compute p-value p_i.
Finally, the p-value of interest is equal the number of time p_i smaller (or equal) than the p_obs divided by the number of permutations ran.
Here, p satisfies a crucial property of p-value: it is uniformly distributed when the null-hypothesis is true, because then p_obs and p_i are exchangeable random variables. The exchangeability is a result from the null-hypothesis and the use of exactly the same procedures to compute p_obs and p_i.
A list object contains following objects:
obs.p |
Observed p-value |
perm.p |
A vector of p-values from permutations. |
best |
Selected optimal partition |
Askar Obulkasim
Obulkasim,A. et al., (2013). "Semi-supervised adaptive-height snipping of the Hierarchical Clustering tree", submitted.
1 2 3 4 5 6 7 8 9 10 | data(BullingerLeukemia)
attach(BullingerLeukemia)
cl <- HCsnipper(em[, 1:30], min = 5)
cl <- cl$partitions[cl$id, ]
m <- apply(cl, 1, function(x) measure(parti = x, dis = 1-cor(em[, 1:30])))
s <- apply(cl, 1, function(x) surv_measure(x, surv.time[1:30], status[1:30]))
result <- perm_test(cl, surv.time[1:30], status[1:30], score1 = s, score2 = m, nperm = 10)
### Visualize cluster differences in terms of Entropy.
H <- EnvioPlot(X = em[, 1:30], parti = result$best)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.