SCE | R Documentation |
This package provides two main modeling approaches:
SCA (Stepwise Cluster Analysis): A single tree model that recursively partitions the data space based on Wilks' Lambda statistic, creating a tree structure for prediction.
SCE (Stepwise Clustered Ensemble): An ensemble of SCA trees built using bootstrap samples and random feature selection, providing improved prediction accuracy and robustness.
Both functions include comprehensive input validation for data types, missing values, and sample size requirements, and support both single and multiple predictants.
SCA(Training_data, X, Y, Nmin, alpha = 0.05, resolution = 1000, verbose = FALSE)
SCE(Training_data, X, Y, mfeature, Nmin, Ntree,
alpha = 0.05, resolution = 1000, verbose = FALSE, parallel = TRUE)
Training_data |
A data.frame or matrix containing the training data. Must include all specified predictors and predictants. Must not contain missing values. |
X |
A character vector specifying the names of independent variables (e.g., c("Prcp","SRad","Tmax")). Must be present in Training_data. All variables must be numeric. |
Y |
A character vector specifying the name(s) of dependent variable(s) (e.g., c("swvl3","swvl4")). Must be present in Training_data. All variables must be numeric. |
Nmin |
Integer specifying the minimal number of samples in a leaf node for cutting. Must be a positive number and less than the sample size. |
mfeature |
An integer specifying how many features will be randomly selected for each tree. Recommended value is round(0.5 * length(X)). Only used for SCE. |
Ntree |
An integer specifying how many trees (ensemble members) will be built. Recommended values range from 50 to 500 depending on data complexity. Only used for SCE. |
alpha |
Numeric significance level for clustering, between 0 and 1. Default value is 0.05. |
resolution |
Numeric value specifying the resolution for splitting. Controls the granularity of the search for optimal split points. Default value is 1000. |
verbose |
A logical value indicating whether to print progress information during model building. Default value is FALSE. |
parallel |
A logical value indicating whether to use parallel processing for tree construction. When TRUE, uses multiple CPU cores for faster computation. When FALSE, processes trees sequentially. Default value is TRUE. Only used for SCE. |
Model Building Process:
SCA (Single Tree):
Input validation (data types, missing values, sample size requirements)
Data preparation (conversion to matrix format, parameter initialization)
Tree construction (recursive partitioning based on Wilks' Lambda)
SCE (Ensemble):
Input validation (data types, missing values, sample size requirements)
Data preparation (conversion to matrix format, parameter initialization)
Tree construction (bootstrap samples, random feature selection, parallel SCA tree building)
Model evaluation (OOB error calculation, tree weighting)
Key Differences:
SCA: Single tree, deterministic, faster training, potentially less robust
SCE: Multiple trees, ensemble approach, improved accuracy, OOB validation, parallel processing
When to Use:
SCA: Quick exploration, simple relationships, limited computational resources
SCE: Production models, complex relationships, when accuracy is critical
For SCA: An S3 object of class "SCA" containing:
Tree: The SCA tree structure
Map: Mapping information for predictions
XName: Names of predictors used
YName: Names of predictants
type: Mapping type (currently "mean")
totalNodes: Total number of nodes in the tree
leafNodes: Number of leaf nodes
cuttingActions: Number of cutting actions performed
mergingActions: Number of merging actions performed
call: Function call
For SCE: An S3 object of class "SCE" containing the ensemble model with the following components:
trees
: A list of SCA tree models, each containing:
Tree
: The SCA tree structure
Map
: Mapping information
XName
: Names of predictors used
YName
: Names of predictants
type
: Mapping type
totalNodes
: Total number of nodes
leafNodes
: Number of leaf nodes
cuttingActions
: Number of cutting actions
mergingActions
: Number of merging actions
OOB_error
: Out-of-bag R-squared error
OOB_sim
: Out-of-bag predictions
Sample
: Bootstrap sample indices
Tree_Info
: Tree-specific information
Training_data
: Training data used for the tree
weight
: Tree weight based on OOB performance
predictors
: Names of predictor variables
predictants
: Names of predictant variables
parameters
: Model parameters
call
: Function call
Both objects support S3 methods: print()
, summary()
,
predict()
, importance()
, and evaluate()
.
Xiuquan Wang <xxwang@upei.ca> (original SCA) Kailong Li <lkl98509509@gmail.com> (Resolution-search-based SCA and SCE ensemble)
Li, Kailong, Guohe Huang, and Brian Baetz. Development of a Wilks feature importance method with improved variable rankings for supporting hydrological inference and modelling. Hydrology and Earth System Sciences 25.9 (2021): 4947-4966.
Wang, X., G. Huang, Q. Lin, X. Nie, G. Cheng, Y. Fan, Z. Li, Y. Yao, and M. Suo (2013), A stepwise cluster analysis approach for downscaled climate projection - A Canadian case study. Environmental Modelling & Software, 49, 141-151.
Huang, G. (1992). A stepwise cluster analysis method for predicting air quality in an urban environment. Atmospheric Environment (Part B. Urban Atmosphere), 26(3): 349-357.
Liu, Y. Y. and Y. L. Wang (1979). Application of stepwise cluster analysis in medical research. Scientia Sinica, 22(9): 1082-1094.
predict
, importance
, evaluate
for S3 methods,
RFE_SCE
for recursive feature elimination
## Load SCE package
library(SCE)
## Load training and testing data
data("Streamflow_training_10var")
data("Streamflow_testing_10var")
## Define independent (x) and dependent (y) variables
Predictors <- c("Prcp","SRad","Tmax","Tmin","VP","smlt","swvl1","swvl2","swvl3","swvl4")
Predictants <- c("Flow")
## Example 1: Build SCA model (single tree)
sca_model <- SCA(
Training_data = Streamflow_training_10var,
X = Predictors,
Y = Predictants,
Nmin = 5,
alpha = 0.05,
resolution = 1000
)
## Use S3 methods for SCA model inspection
print(sca_model)
summary(sca_model)
## Make predictions using S3 method
sca_predictions <- predict(sca_model, Streamflow_testing_10var)
## Calculate variable importance using S3 method
sca_importance <- importance(sca_model)
## Evaluate SCA model performance using S3 method
sca_evaluation <- evaluate(
object = sca_model,
Testing_data = Streamflow_testing_10var,
Predictant = Predictants
)
## Example 2: Build SCE model (ensemble)
sce_model <- SCE(
Training_data = Streamflow_training_10var,
X = Predictors,
Y = Predictants,
mfeature = round(0.5 * length(Predictors)),
Nmin = 5,
Ntree = 48,
alpha = 0.05,
resolution = 1000,
parallel = FALSE
)
## Use S3 methods for SCE model inspection
print(sce_model)
summary(sce_model)
## Generate predictions using S3 method
sce_predictions <- predict(sce_model, Streamflow_testing_10var)
## Access different prediction components
training_predictions <- sce_predictions$Training
validation_predictions <- sce_predictions$Validation
testing_predictions <- sce_predictions$Testing
## Calculate variable importance using S3 method
sce_importance <- importance(sce_model)
## Evaluate SCE model performance using S3 method
sce_evaluation <- evaluate(
object = sce_model,
Testing_data = Streamflow_testing_10var,
Training_data = Streamflow_training_10var,
Predictant = Predictants
)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.