progenyClust: Progeny Clustering

Description Usage Arguments Value Author(s) References Examples

Description

Select the optimal number for clustering using Progeny Clustering.

Usage

1
2
3
4
5
progenyClust(data, FUNclust = kmeans, method = "gap", score.invert = F, ncluster = 2:10, 
size = 10, iteration = 100, repeats = 1, nrandom = 10, ...)

## S3 method for class 'progenyClust'
summary(object,...)

Arguments

data

data matrix or data frame for clustering: each row correpsonds to a sample or observation, whereas each column corresponds to a feature or variable.

FUNclust

clustering function: accepts data as its first argument and the number for clustering as the second argument; returns a list containing a component called 'cluster' which is a vector of integers recording the clustering assignment for all samples. The default function is kmeans.

method

character string indicating the criterion used to pick the optimal cluster number.
'gap': the default value, selecting the cluster number that has the biggest or smallest (when score.invert=TRUE) gap from its neighboring numbrs. The optimal cluster number is picked based on the input data only, and is not compared against any random datasets, thus is quick to compute. Note that this method does not evaluate the minimum and maximum cluster numbers.
'score': selects the cluster number that has the highest or lowest (when score.invert=TRUE) score when comparing against scores generated from random datasets. Due to the repeats on progeny clustering on random datasets, this method is slower to compute.
'both': uses and outputs results from both the 'gap' and 'score' criteria.

score.invert

logical flag: specifies whether the score should be inverted. The default score is the ratio of true classification probabilities over false classification probilities. The inverted score is the ratio of false classification over true classification probilities, which can prevent the algorithm from generating infinite score values in cases of perfect clustering. When score.invert=TRUE, the optimla cluster number is picked based on the lowest score.

ncluster

sequence of integers specifying candidate cluster numbers for evaluation: ncluster needs to be continuous if the method 'gap' is chosen.

size

integer specifying the number of progenies generated from each cluster. Default value is 10.

iteration

integer specifying the number of times the algorithm samples progenies and evalutes similarity among progenies. Default value is 100.

repeats

integer specifying the number of times the algorithm should be run: needs to be greater than 0. Values greater than 1 output standard deviations of the scores, which are plotted as error bars in print(...,errorbar=T,...) function. Default value is 1.

nrandom

integer specifying the number of random datasets used to generate reference scores when using method 'score'. Default value is 10.

object

the S3 object of class "progenyClust".

...

additional arguments for FUNclust in progenyClust(...).

Value

progenyClust returns an object of class "progenyClust" which has a plot and summary method. It is a list with the following components:

cluster

matrix of clustering memberships for all samples under given cluster numbers: each row corresponds to a sample; each column corresponds to a given cluster number.

score

matrix of stability scores from clustering the input data under given cluster numbers: each column corresponds to a given cluster number; each row corresponds to a repeat, the number of which is defined by 'repeats' in the input argument.

random.score

matrix of stability scores from clustering random datasets under given cluster numbers: each column corresponds to a given cluster number; each row corresponds to a random dataset, the number of which is defined by 'nrandom' in the input argument.

random.score

matrix of stability scores from clustering random datasets under given cluster numbers: each column corresponds to a given cluster number; each row corresponds to a random dataset, the number of which is defined by 'nrandom' in the input argument.

mean.gap

vector of mean stability scores based on the 'gap' criterion when the input argument 'method' is set to be 'gap' or 'both'.

mean.score

vector of mean stability scores based on the 'score' criterion when the input argument 'method' is set to be 'score' or 'both'.

sd.gap

vector of standard deviations of stability scores for each given cluster number based on the 'gap' criterion, when the input argument 'method' is set to be 'gap' or 'both'.

sd.score

vector of standard deviations of stability scores for each given cluster number based on the 'score' criterion, when the input argument 'method' is set to be 'score' or 'both'.

call

the call with arguments specified.

ncluster

the specified value of input argument 'ncluster'.

method

the specified value of input argument 'method'.

score.invert

the specified value of input argument 'score.invert'.

Author(s)

C.W. Hu, Rice University

References

Hu, C.W., et al. "Progeny Clustering: A Method to Identify Biological Phenotypes." Scientific reports 5 (2015).
http://www.nature.com/articles/srep12894

Examples

1
2
3
4
5
6
7
8
# a 3-cluster 2-dimensional example dataset
data('test')

# default progeny clsutering
progenyClust(test,ncluster=2:5)->pc

summary(pc)
plot(pc)

progenyClust documentation built on May 2, 2019, 6:40 a.m.