View source: R/createDisMatrix.R
| createDisMatrix | R Documentation |
The function createDisMatrix creates a dissimilarity matrix among observations from an ensemble tree. This optimized version is designed for large datasets (50K-500K observations) with improved memory management and chunking capabilities.
createDisMatrix(
ensemble,
data,
label,
parallel = list(active = FALSE, no_cores = 1),
verbose = FALSE,
chunk_size = NULL,
memory_limit = NULL,
use_disk = FALSE,
temp_dir = tempdir(),
batch_aggregate = 10
)
ensemble |
is an ensemble tree object |
data |
is a data frame containing the variables in the model. It is the data frame used for ensemble learning. |
label |
is a character. It indicates the response label. |
parallel |
A list with two elements: |
verbose |
Logical. If TRUE, the function prints progress messages and other information during execution. If FALSE (the default), messages are suppressed. |
chunk_size |
Integer. Number of rows to process in each chunk. If NULL, automatically determined based on available memory and dataset size. Default: NULL (auto). |
memory_limit |
Numeric. Maximum memory to use in GB. Default: NULL (no limit). |
use_disk |
Logical. If TRUE and dataset is very large, intermediate results are saved to disk. Default: FALSE. |
temp_dir |
Character. Directory for temporary files if use_disk = TRUE. Default: tempdir(). |
batch_aggregate |
Integer. Number of tree results to aggregate at once before adding to main matrix (reduces memory peaks). Default: 10. |
This optimized version implements several strategies for handling large datasets:
Memory-efficient aggregation: Results from parallel trees are aggregated in batches to avoid memory peaks
Chunking: For very large matrices, computation can be split into manageable chunks
Sparse matrix optimization: Maintains sparsity throughout computation
Automatic garbage collection: Explicit memory cleanup at critical points
Disk-based computation: Optional saving of intermediate results for datasets exceeding memory capacity
Supported ensemble types for classification or regression tasks:
randomForest
ranger
xgb.Booster (xgboost)
lgb.Booster (lightgbm)
gbm (gbm)
catboost.CatBoost (catboost)
A dissimilarity matrix. This is a dissimilarity matrix measuring the discordance between two observations concerning a given random forest model.
For bagging ensembles (randomForest, ranger) the trees are
grown independently on bootstrap samples; co-occurrence in the same leaf
captures local similarity in the predictor space. For boosting ensembles
(xgb.Booster, lgb.Booster, gbm, catboost)
each tree is fit to the residual of the previous ones, so leaf
co-occurrence reflects similarity in the error-correction trajectory
rather than in the final prediction space. The resulting dissimilarity
matrices therefore have systematically different scales (typically
\bar D \in [0.85, 0.95] for bagging vs. [0.35, 0.70] for
boosting). The surrogate tree built on top of D should be
interpreted accordingly.
The returned matrix carries an ensemble_backend attribute identifying
the backend used, which downstream functions check to detect mismatched
(D, ensemble) pairs.
data("iris")
# Create training and validation set:
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
training <- iris[train_ind, ]
validation <- iris[-train_ind, ]
response_training <- training[,5]
response_validation <- validation[,5]
# Perform training:
## "randomForest" package
ensemble <- randomForest::randomForest(Species ~ ., data=training,
importance=TRUE, proximity=TRUE)
## "ranger" package
if (requireNamespace("ranger", quietly = TRUE)) {
ensemble <- ranger::ranger(Species ~ ., data = iris,
num.trees = 1000, importance = 'impurity')
}
# Compute dissimilarity matrix with optimizations
D <- createDisMatrix(
ensemble,
data = training,
label = "Species",
parallel = list(active = FALSE, no_cores = 1),
chunk_size = 10000, # Process 10K rows at a time
batch_aggregate = 20, # Aggregate 20 trees at once
verbose = TRUE
)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.