condcomp: Comparison of data conditions in a clustering

Description Usage Arguments Details Value Examples

View source: R/condcomp.R

Description

Performs a condition comparison on a given clustering. The comparison is performed on each cluster separately between each condcition (cond). Several statistics are used and, when analysed in conjunction, they might give some insight regarding the heterogeneity of some of the clusters.

Usage

1
condcomp(clustering, cond, dmatrix, n = 1000, remove.na = TRUE)

Arguments

clustering

A clustering of the data.

cond

A factor indicating the condition which each data point is subject to.

dmatrix

A distance matrix describing the data to be analysed.

n

The number of random silhouettes to be performed. Keep in mind that the computation of several random silhouettes is the bottleneck of this process.

remove.na

Logical. Remove lines with NA (i.e. clusters which the silhouette could not be computed).

Details

For a given cluster, several metrics are computed, see the 'Return' section for details about each metric. Some metrics make use of Random Silhouettes, which is defined as follows: given a labeled data set, assign a random label (from the set of labels) to each data point without changing the original ratio of groups. Then compute the silhouette index for this data considering these randomly assigned labels, the average silhouette width is the Random Silhouette for the data (with randomly assigned labels). Being an stochastic process, the Monte Carlo approach is applied which gives a vector of several Random Silhouettes.

Value

A data frame with various statistics regarding data heterogeneity inside each cluster.

Each row of the data frame contains several metrics regarding the conditions found in an specific cluster. For now only a maximum of two conditions are supported. These metrics are described below:

x_perc

Numeric. The percentage of data points belonging to condition 'x'.

x_ratio

Numeric. The ratio of data points belonging to condition 'x'. For example, considering another condition 'y', the 'x_ratio' would be computed as x_perc / y_perc.

true_sil

Numeric. True silhouette. The silhouette for the data in this cluster considering the conditions, as defined by the parameter cond, as groups.

zscore

Numeric. The Z-score computed based on the silhouette. See the 'Details' section.

pval

Numeric. The p-value for 'true_sil'. Computed from the number of Random Silhouettes (see 'Details') that are greater than the 'true_sil' for this cluster.

iqr

Factor. Interquartile Range (IQR) based outlier detection. Considering the vector including the random silhouettes (see 'Details') and the 'true_sil', the method checks whether 'true_sil' is an outlier in said vector. This will be set to 'Diff' in case 'true_sil' is an outlier or 'Same' otherwise.

Examples

1
2
3
4
5
clustering <- iris$Species
dmatrix <- as.matrix(dist(iris[-length(iris)]))
# Suppose the conditions are 'young' and 'old' fish
cond <- sample(c("young", "old"), nrow(iris), replace=TRUE)
comp <- condcomp(clustering, cond, dmatrix=dmatrix, n=10)

CostaLab/condcomp documentation built on May 25, 2019, 7:16 a.m.