createDisMatrix: Create a Dissimilarity Matrix from an Ensemble Model

View source: R/createDisMatrix.R

createDisMatrixR Documentation

Create a Dissimilarity Matrix from an Ensemble Model

Description

The function createDisMatrix creates a dissimilarity matrix among observations from an ensemble tree. This optimized version is designed for large datasets (50K-500K observations) with improved memory management and chunking capabilities.

Usage

createDisMatrix(
  ensemble,
  data,
  label,
  parallel = list(active = FALSE, no_cores = 1),
  verbose = FALSE,
  chunk_size = NULL,
  memory_limit = NULL,
  use_disk = FALSE,
  temp_dir = tempdir(),
  batch_aggregate = 10
)

Arguments

ensemble

is an ensemble tree object

data

is a data frame containing the variables in the model. It is the data frame used for ensemble learning.

label

is a character. It indicates the response label.

parallel

A list with two elements: active (logical) and no_cores (integer). If active = TRUE, the function performs parallel computation using the number of cores specified in no_cores. If no_cores is NULL or equal to 0, it defaults to using all available cores minus one. If active = FALSE, the function runs on a single core. Default: list(active = FALSE, no_cores = 1).

verbose

Logical. If TRUE, the function prints progress messages and other information during execution. If FALSE (the default), messages are suppressed.

chunk_size

Integer. Number of rows to process in each chunk. If NULL, automatically determined based on available memory and dataset size. Default: NULL (auto).

memory_limit

Numeric. Maximum memory to use in GB. Default: NULL (no limit).

use_disk

Logical. If TRUE and dataset is very large, intermediate results are saved to disk. Default: FALSE.

temp_dir

Character. Directory for temporary files if use_disk = TRUE. Default: tempdir().

batch_aggregate

Integer. Number of tree results to aggregate at once before adding to main matrix (reduces memory peaks). Default: 10.

Details

This optimized version implements several strategies for handling large datasets:

  • Memory-efficient aggregation: Results from parallel trees are aggregated in batches to avoid memory peaks

  • Chunking: For very large matrices, computation can be split into manageable chunks

  • Sparse matrix optimization: Maintains sparsity throughout computation

  • Automatic garbage collection: Explicit memory cleanup at critical points

  • Disk-based computation: Optional saving of intermediate results for datasets exceeding memory capacity

Supported ensemble types for classification or regression tasks:

  • randomForest

  • ranger

  • xgb.Booster (xgboost)

  • lgb.Booster (lightgbm)

  • gbm (gbm)

  • catboost.CatBoost (catboost)

Value

A dissimilarity matrix. This is a dissimilarity matrix measuring the discordance between two observations concerning a given random forest model.

Interpretation note (RF vs boosting)

For bagging ensembles (randomForest, ranger) the trees are grown independently on bootstrap samples; co-occurrence in the same leaf captures local similarity in the predictor space. For boosting ensembles (xgb.Booster, lgb.Booster, gbm, catboost) each tree is fit to the residual of the previous ones, so leaf co-occurrence reflects similarity in the error-correction trajectory rather than in the final prediction space. The resulting dissimilarity matrices therefore have systematically different scales (typically \bar D \in [0.85, 0.95] for bagging vs. [0.35, 0.70] for boosting). The surrogate tree built on top of D should be interpreted accordingly.

The returned matrix carries an ensemble_backend attribute identifying the backend used, which downstream functions check to detect mismatched (D, ensemble) pairs.

Examples


data("iris")

# Create training and validation set:
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
training <- iris[train_ind, ]
validation <- iris[-train_ind, ]
response_training <- training[,5]
response_validation <- validation[,5]

# Perform training:
## "randomForest" package
ensemble <- randomForest::randomForest(Species ~ ., data=training,
importance=TRUE, proximity=TRUE)

## "ranger" package
if (requireNamespace("ranger", quietly = TRUE)) {
  ensemble <- ranger::ranger(Species ~ ., data = iris,
    num.trees = 1000, importance = 'impurity')
}

# Compute dissimilarity matrix with optimizations
D <- createDisMatrix(
  ensemble,
  data = training,
  label = "Species",
  parallel = list(active = FALSE, no_cores = 1),
  chunk_size = 10000,  # Process 10K rows at a time
  batch_aggregate = 20, # Aggregate 20 trees at once
  verbose = TRUE
)



e2tree documentation built on May 15, 2026, 5:06 p.m.