createDisMatrix: Create a Dissimilarity Matrix from an Ensemble Model
In e2tree: Explainable Ensemble Trees

createDisMatrix

R Documentation

Create a Dissimilarity Matrix from an Ensemble Model

Description

The function createDisMatrix creates a dissimilarity matrix among observations from an ensemble tree. This optimized version is designed for large datasets (50K-500K observations) with improved memory management and chunking capabilities.

Usage

createDisMatrix(
  ensemble,
  data,
  label,
  parallel = list(active = FALSE, no_cores = 1),
  verbose = FALSE,
  chunk_size = NULL,
  memory_limit = NULL,
  use_disk = FALSE,
  temp_dir = tempdir(),
  batch_aggregate = 10
)

Arguments

`ensemble`	is an ensemble tree object
`data`	is a data frame containing the variables in the model. It is the data frame used for ensemble learning.
`label`	is a character. It indicates the response label.
`parallel`	A list with two elements: `active` (logical) and `no_cores` (integer). If `active = TRUE`, the function performs parallel computation using the number of cores specified in `no_cores`. If `no_cores` is NULL or equal to 0, it defaults to using all available cores minus one. If `active = FALSE`, the function runs on a single core. Default: `list(active = FALSE, no_cores = 1)`.
`verbose`	Logical. If TRUE, the function prints progress messages and other information during execution. If FALSE (the default), messages are suppressed.
`chunk_size`	Integer. Number of rows to process in each chunk. If NULL, automatically determined based on available memory and dataset size. Default: NULL (auto).
`memory_limit`	Numeric. Maximum memory to use in GB. Default: NULL (no limit).
`use_disk`	Logical. If TRUE and dataset is very large, intermediate results are saved to disk. Default: FALSE.
`temp_dir`	Character. Directory for temporary files if use_disk = TRUE. Default: tempdir().
`batch_aggregate`	Integer. Number of tree results to aggregate at once before adding to main matrix (reduces memory peaks). Default: 10.

Details

This optimized version implements several strategies for handling large datasets:

Memory-efficient aggregation: Results from parallel trees are aggregated in batches to avoid memory peaks
Chunking: For very large matrices, computation can be split into manageable chunks
Sparse matrix optimization: Maintains sparsity throughout computation
Automatic garbage collection: Explicit memory cleanup at critical points
Disk-based computation: Optional saving of intermediate results for datasets exceeding memory capacity

Supported ensemble types for classification or regression tasks:

randomForest
ranger
xgb.Booster (xgboost)
lgb.Booster (lightgbm)
gbm (gbm)
catboost.CatBoost (catboost)

Value

A dissimilarity matrix. This is a dissimilarity matrix measuring the discordance between two observations concerning a given random forest model.

Interpretation note (RF vs boosting)

For bagging ensembles (randomForest, ranger) the trees are grown independently on bootstrap samples; co-occurrence in the same leaf captures local similarity in the predictor space. For boosting ensembles (xgb.Booster, lgb.Booster, gbm, catboost) each tree is fit to the residual of the previous ones, so leaf co-occurrence reflects similarity in the error-correction trajectory rather than in the final prediction space. The resulting dissimilarity matrices therefore have systematically different scales (typically \bar D \in [0.85, 0.95] for bagging vs. [0.35, 0.70] for boosting). The surrogate tree built on top of D should be interpreted accordingly.

The returned matrix carries an ensemble_backend attribute identifying the backend used, which downstream functions check to detect mismatched (D, ensemble) pairs.

Examples


data("iris")

# Create training and validation set:
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
training <- iris[train_ind, ]
validation <- iris[-train_ind, ]
response_training <- training[,5]
response_validation <- validation[,5]

# Perform training:
## "randomForest" package
ensemble <- randomForest::randomForest(Species ~ ., data=training,
importance=TRUE, proximity=TRUE)

## "ranger" package
if (requireNamespace("ranger", quietly = TRUE)) {
  ensemble <- ranger::ranger(Species ~ ., data = iris,
    num.trees = 1000, importance = 'impurity')
}

# Compute dissimilarity matrix with optimizations
D <- createDisMatrix(
  ensemble,
  data = training,
  label = "Species",
  parallel = list(active = FALSE, no_cores = 1),
  chunk_size = 10000,  # Process 10K rows at a time
  batch_aggregate = 20, # Aggregate 20 trees at once
  verbose = TRUE
)

e2tree documentation built on May 15, 2026, 5:06 p.m.