Description Usage Arguments Details Value Author(s) References See Also
Perform subtyping using one type of high-dimensional data
1 2 3 4 5 6 7 8 | PINSPerturbationClustering(
data,
Kmax = 10,
noisePercent = "med",
iter = 200,
kmIter = 20,
PCAFunction = NULL
)
|
data |
input matrix or data frame. The rows represent samples while the columns represent features. |
Kmax |
the maximum number of clusters. Default value is 10. |
noisePercent |
the parameter to determine the noise standard deviation. Default is "med", i.e. the noise standard deviation is the medium standard deviation of the features. If noisePercent is numeric, then the noise standard deviation is noisePercent * sd(data). |
iter |
the number of perturbed datasets. Default value is 200. |
kmIter |
the number of initial centers used in k-means clustering. |
PCAFunction |
Custom PCA function for dimension reduction. |
The data are first clustered using k-means. For each value of k in the range [2:Kmax], the algorithm buils an original connectivity matrix using the partitioning obtained from k-means. The algorithm then adds Gaussian noise to the data and rebuild the connectivity between samples. For each value of k, the algorithm builds iter connectivity matrices and then average them to provide one perturbed connectivity matrix.
For each value of k, the algorithm then constructs a difference matrix, which is the absolute difference between the original and perturbed connectivity matrices for the given k. It then calculates the empirical cumulative distribution functions (CDF) for the entries of the difference matrix. The area under the CDF curve (AUC) is used to assess the stability of the clustering. The algorithm chooses the optimal value of k for which the AUC value is maximized.
It is well known that the k-means algorithm may converge to a local minimum depending on the initialization. To overcome this, the k-means algorithm is run multiple times (using kmIter parameter) with randomly chosen seeds and the partitioing with the least residual sum of squares (RSS) is returned.
PerturbationClustering returns a list with at least the following components:
k |
The optimal number of clusters |
groups |
A vector of labels indicating the cluster to which each sample is allocated |
origS |
A list of original connectivity matrices |
pertS |
A list of perturbed connectivity matrices |
Tin Nguyen and Sorin Draghici
Tin Nguyen, Rebecca Tagett, Diana Diaz, and Sorin Draghici. A novel method for data integration and disease subtyping. Genome Research, 27(12):2025-2039, 2017.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.