nTARP_bisecting: Run nTARP repeatedly in a bisecting fashion

View source: R/nTARP_bisecting.R

nTARP_bisectingR Documentation

Run nTARP repeatedly in a bisecting fashion

Description

Repeatedly applies 'nTARP' to iteratively bisect a dataset until a minimum cluster size threshold is reached.

Usage

nTARP_bisecting(
  data,
  number_of_projections,
  withinss_threshold,
  ids = NULL,
  minimum_cluster_size_percent = 20,
  contextual_variable = NULL
)

Arguments

data

Numeric matrix — dataset to be clustered using 'nTARP'

number_of_projections

Numeric — number of random projections for 'nTARP' to try for each run

withinss_threshold

Numeric — maximum value defining what a "quality cluster" is, based on the solution's normalized within-cluster sum of squares (typically 0.36)

ids

Numeric or character vector — identifying labels for individuals in the clusters

minimum_cluster_size_percent

Numeric — minimum size allowable for a cluster (expressed as a percentage)

contextual_variable

Vector of integers or characters — variable to use as the basis for comparing clusters. This is 'NULL' by default, which analytically corresponds to option (1).

Details

This function supports two strategies for selecting the optimal split at each step:

(1) Within-Cluster Compactness Criterion: The optimal solution is selected based on the normalized within-cluster sum of squares (WSS). The split that minimizes normalized WSS is retained.

(2) Contextual Purity Criterion: The optimal solution is selected using a contextual variable. Inspired by decision tree learning, the algorithm evaluates candidate splits based on improvements in class purity (i.e., Gini reduction) with respect to the contextual variable. The split that maximizes purity gain is retained.

The process continues recursively (bisecting the largest eligible cluster) until no resulting cluster meets the user-defined minimum size threshold.

Value

A list containing: (1) Complete solutions (i.e., outputs from the 'nTARP' function), (2) Clusters with the best gains identified using the 'pull_best_solution_and_gain' function, (3) Within-cluster sum of squares for each solution, (4) Gains for each solution (if a contextual variable is used).

Examples

# 20-point example dataset
data <- data.frame(
  X1 = c(0.5, -0.2, 0.1, 0.3, -0.1, 0.2, 5.2, 4.8, 5.1, 5.0,
         -4.5, -5.2, -4.8, -5.1, -4.9, -5.3, 0.0, 0.2, 5.3, -5.0),
  X2 = c(0.3, -0.1, 0.2, 0.1, 0.0, 0.2, 5.0, 4.9, 5.3, 5.1,
         5.0, 5.2, 4.7, 4.9, 5.1, 4.8, -0.2, 0.0, 5.2, -4.9),
  X3 = c(0.4, 0.0, 0.1, -0.1, 0.2, 0.0, 5.1, 4.7, 5.2, 5.0,
         -5.0, -4.8, -5.3, -5.1, -4.9, -5.2, 0.1, 0.3, 5.0, -5.1)
)

# Run nTARP without contextual variable
result1 <- nTARP_bisecting(
  data = data,
  number_of_projections = 10,
  withinss_threshold = 0.36
)
str(result1)

# Add a latent group as contextual variable
latent_group <- c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2,
                  3, 3, 3, 3, 3, 3, 1, 1, 2, 3)

# Run nTARP with contextual variable
result2 <- nTARP_bisecting(
  data = data,
  number_of_projections = 10,
  withinss_threshold = 0.36,
  contextual_variable = latent_group
)
str(result2)



nTARP documentation built on March 20, 2026, 5:09 p.m.