# A function to select an optimal partition (clustering) from large number of candidates and calculate the p-value for it.

### Description

For a given set of partitions (each partition is composed of non-overlapping clusters), this function uses two types of data to evaluate each partition and select the optimal one which has the highest rank in terms of both data type (presumed that *score1* and *score2* were from two different data source). Permutation approach used to calculate the corrected *p-value* of the selected partition.

### Usage

1 |

### Arguments

`partitions` |
A matrix in which rows represent partitions and columns represent samples |

`surv.time` |
A numeric vector contains follow-up time of patients in the partition |

`status` |
A binary vector contains survival status of patients in the partition, 0 = alive, 1 = dead |

`score1` |
A numeric vector contains the quality score for each partition. Scores are assumed to be calculated using the follow-up data. Note, prepare this vector in a way that high value corresponds to good quality partition. |

`score2` |
A numeric vector contains the quality score for each partition calculated by using any data type except for follow-up. The same as |

`method` |
Type of partition evaluation measure to use. Must be the same as the type of measure used in calculating the |

`nperm` |
The number of permutations. |

### Details

When studying association of cluster membership with follow-up data, we cannot use the standard testing procedures. Because *score1* is already used the follow-up data. Thus, we would use the follow-up data twice and the resulting *p*-value is likely to be too small. We avoid this bias by also applying the semi-supervised partition selection under the null-hypothesis. This null-hypothesis is simply the absence of association between the data type used to generate the *score2* and the follow-up. Our partition selection in combination with a suitable test statistic is designed to detect associations that can be represented by groups of samples. We adapt the p-value computation as follows:

Use a suitable test statistic (e.g. log-rank for time-to-event data and chi-square for nominal data) to compute the conditional p-value given the cluster labels in the selected partition:

*p_obs*.For i = 1...nperm:

Randomly permute follow-up data among the samples.

Apply exactly the same type of evaluation measure to evaluate all partitions, e.g. generate new

*score1*, but*score2*is fixed. Selected the best partition as before.Conditional on the resulting partition, compute

*p*-value*p_i*.

Finally, the

*p*-value of interest is equal the number of time*p_i*smaller (or equal) than the*p_obs*divided by the number of permutations ran.

Here, *p* satisfies a crucial property of *p*-value: it is uniformly distributed when the null-hypothesis is true, because then *p_obs* and *p_i* are exchangeable random variables. The exchangeability is a result from the null-hypothesis and the use of exactly the same procedures to compute *p_obs* and *p_i*.

### Value

A list object contains following objects:

`obs.p` |
Observed |

`perm.p ` |
A vector of |

`best` |
Selected optimal partition |

### Author(s)

Askar Obulkasim

### References

Obulkasim,A. et al., (2013). "Semi-supervised adaptive-height snipping of the Hierarchical Clustering tree", submitted.

### See Also

`TwoHC_perm`

### Examples

1 2 3 4 5 6 7 8 9 10 | ```
data(BullingerLeukemia)
attach(BullingerLeukemia)
cl <- HCsnipper(em[, 1:30], min = 5)
cl <- cl$partitions[cl$id, ]
m <- apply(cl, 1, function(x) measure(parti = x, dis = 1-cor(em[, 1:30])))
s <- apply(cl, 1, function(x) surv_measure(x, surv.time[1:30], status[1:30]))
result <- perm_test(cl, surv.time[1:30], status[1:30], score1 = s, score2 = m, nperm = 10)
### Visualize cluster differences in terms of Entropy.
H <- EnvioPlot(X = em[, 1:30], parti = result$best)
``` |