03-splits: Bimodal Splitting of Data Sets

splittersR Documentation

Bimodal Splitting of Data Sets

Description

Functions to split a data set into two parts.

Usage

registerSplitter(tag, description, FUN)
availableSplitters()
findSplit(data, metric = "euclidean", splitter = names(splitters),
  LR = c("L", "R"))
evalSplit(split, data, metric, tool = c("sw", "ssq"), N = 100)

Arguments

tag

A character string representing a short abbreviation used to identify and eventually call the associated "splitter" functoin that can dichtomize the data.

description

A character string containing the full name of the splitting algorithm. Potentially, this can be used in data summaries or plots.

FUN

A "splitter"; that is, a function that takes a distance matrix as input and returns a numeric vector that assigns samples in the data set to one of two classes labeled "1" or "2".

data

The data set to be split into two parts.

metric

A distance metric. This metric should be one of the values supported by the distanceMatrix fuction from the ClassDiscovery package. All metrics computed by the dist function are supported.

splitter

A tag that uniquely identifies one of the splitting functions that have been registered. Built-in registered "splitters", with their tags, include: (hc) agglomeratve hierarchical clustering; (km) k-means clustering; (dv) divisive heirarchical clustering, and (ap) affinity propagation clustering.

LR

A character vector of length two denoting the laels to be used for the two classes after splitting the data.

split

A clustering factor, typically the output of a call to the findSplit fuction.

tool

The tool/splitter to be used to evalaute the quality of the split. Built-in splitters include the mean silhouette width (sw) or the within-group sum of square distances (ssq).

N

Number of random splits used to estimate emoirical p-value.

Details

The core idea of this package is to construct decision trees that simulataneouly learn how to cluster the items in a data set while also learning a classifier that can predict the cluster assignments of new data. We support four built-in clustering algortihms to learn how to split a single data set into two parts:

hc

agglomerative hierarchical clustering as implemented in the hclust function, followed by cutree.

km

kmeans clustering with k = 2.

dv

divisive hierarchical clustering as implemented in the diana function.

ap

affinity propagation clustering, as implemented in the apcluster function.

Value

The registerSplitter funciton invisibly returns the tag assigned to the new splitter.

The findSplit function returns a two-level factor with length equal to the number of columns (samples) in the data set.

The evalSplit returns a list of length two, containing a statistic (either the men silhouette width or the within-group sum of square distances, depending on the tool being used) along with an estimated empirical pvalue.

Author(s)

Kevin R. Coombes <krc@silicovore.com>

Examples

nr <- 200 # features
nc <- 60  # samples
set.seed(80123)
comat <- matrix(rnorm(nr*nc, 0, 1), nrow = nr)
dimnames(comat) <- list(paste0("F", 1:nr),
                        paste0("S", 1:nc))
splay <- rep(c(2, -2), each = nc/2)
for(J in 1:30) comat[J, ] <- comat[J, ] + splay
C.hc <- findSplit(comat, splitter = "hc")
evalSplit(C.hc, comat, "euclid", "sw")

Condens8R documentation built on May 28, 2025, 3 a.m.