03-splits: Bimodal Splitting of Data Sets
In Condens8R: Classes and Methods for Learning Complex Decision Tree Models

splitters

R Documentation

Bimodal Splitting of Data Sets

Description

Functions to split a data set into two parts.

Usage

registerSplitter(tag, description, FUN)
availableSplitters()
findSplit(data, metric = "euclidean", splitter = names(splitters),
  LR = c("L", "R"))
evalSplit(split, data, metric, tool = c("sw", "ssq"), N = 100)

Arguments

`tag`	A character string representing a short abbreviation used to identify and eventually call the associated "splitter" functoin that can dichtomize the data.
`description`	A character string containing the full name of the splitting algorithm. Potentially, this can be used in data summaries or plots.
`FUN`	A "splitter"; that is, a function that takes a distance matrix as input and returns a numeric vector that assigns samples in the data set to one of two classes labeled "1" or "2".
`data`	The data set to be split into two parts.
`metric`	A distance metric. This metric should be one of the values supported by the `distanceMatrix` fuction from the `ClassDiscovery` package. All metrics computed by the `dist` function are supported.
`splitter`	A tag that uniquely identifies one of the splitting functions that have been registered. Built-in registered "splitters", with their tags, include: (hc) agglomeratve hierarchical clustering; (km) k-means clustering; (dv) divisive heirarchical clustering, and (ap) affinity propagation clustering.
`LR`	A character vector of length two denoting the laels to be used for the two classes after splitting the data.
`split`	A clustering factor, typically the output of a call to the `findSplit` fuction.
`tool`	The tool/splitter to be used to evalaute the quality of the split. Built-in splitters include the mean silhouette width (sw) or the within-group sum of square distances (ssq).
`N`	Number of random splits used to estimate emoirical p-value.

Details

The core idea of this package is to construct decision trees that simulataneouly learn how to cluster the items in a data set while also learning a classifier that can predict the cluster assignments of new data. We support four built-in clustering algortihms to learn how to split a single data set into two parts:

hc: agglomerative hierarchical clustering as implemented in the hclust function, followed by cutree.
km: kmeans clustering with k = 2.
dv: divisive hierarchical clustering as implemented in the diana function.
ap: affinity propagation clustering, as implemented in the apcluster function.

Value

The registerSplitter funciton invisibly returns the tag assigned to the new splitter.

The findSplit function returns a two-level factor with length equal to the number of columns (samples) in the data set.

The evalSplit returns a list of length two, containing a statistic (either the men silhouette width or the within-group sum of square distances, depending on the tool being used) along with an estimated empirical pvalue.

Author(s)

Kevin R. Coombes <krc@silicovore.com>

Examples

nr <- 200 # features
nc <- 60  # samples
set.seed(80123)
comat <- matrix(rnorm(nr*nc, 0, 1), nrow = nr)
dimnames(comat) <- list(paste0("F", 1:nr),
                        paste0("S", 1:nc))
splay <- rep(c(2, -2), each = nc/2)
for(J in 1:30) comat[J, ] <- comat[J, ] + splay
C.hc <- findSplit(comat, splitter = "hc")
evalSplit(C.hc, comat, "euclid", "sw")

Condens8R documentation built on May 28, 2025, 3 a.m.