SCE: Stepwise Clustered Ensemble (SCE) and Stepwise Cluster...
In SCE: Stepwise Clustered Ensemble

SCE	R Documentation

Stepwise Clustered Ensemble (SCE) and Stepwise Cluster Analysis (SCA) Models

Description

This package provides two main modeling approaches:

SCA (Stepwise Cluster Analysis): A single tree model that recursively partitions the data space based on Wilks' Lambda statistic, creating a tree structure for prediction.

SCE (Stepwise Clustered Ensemble): An ensemble of SCA trees built using bootstrap samples and random feature selection, providing improved prediction accuracy and robustness.

Both functions include comprehensive input validation for data types, missing values, and sample size requirements, and support both single and multiple predictants.

Usage

SCA(Training_data, X, Y, Nmin, alpha = 0.05, resolution = 1000, verbose = FALSE)

SCE(Training_data, X, Y, mfeature, Nmin, Ntree,
	alpha = 0.05, resolution = 1000, verbose = FALSE, parallel = TRUE)

Arguments

`Training_data`	A data.frame or matrix containing the training data. Must include all specified predictors and predictants. Must not contain missing values.
`X`	A character vector specifying the names of independent variables (e.g., c("Prcp","SRad","Tmax")). Must be present in Training_data. All variables must be numeric.
`Y`	A character vector specifying the name(s) of dependent variable(s) (e.g., c("swvl3","swvl4")). Must be present in Training_data. All variables must be numeric.
`Nmin`	Integer specifying the minimal number of samples in a leaf node for cutting. Must be a positive number and less than the sample size.
`mfeature`	An integer specifying how many features will be randomly selected for each tree. Recommended value is round(0.5 * length(X)). Only used for SCE.
`Ntree`	An integer specifying how many trees (ensemble members) will be built. Recommended values range from 50 to 500 depending on data complexity. Only used for SCE.
`alpha`	Numeric significance level for clustering, between 0 and 1. Default value is 0.05.
`resolution`	Numeric value specifying the resolution for splitting. Controls the granularity of the search for optimal split points. Default value is 1000.
`verbose`	A logical value indicating whether to print progress information during model building. Default value is FALSE.
`parallel`	A logical value indicating whether to use parallel processing for tree construction. When TRUE, uses multiple CPU cores for faster computation. When FALSE, processes trees sequentially. Default value is TRUE. Only used for SCE.

Details

Model Building Process:

SCA (Single Tree):

Input validation (data types, missing values, sample size requirements)
Data preparation (conversion to matrix format, parameter initialization)
Tree construction (recursive partitioning based on Wilks' Lambda)

SCE (Ensemble):

Input validation (data types, missing values, sample size requirements)
Data preparation (conversion to matrix format, parameter initialization)
Tree construction (bootstrap samples, random feature selection, parallel SCA tree building)
Model evaluation (OOB error calculation, tree weighting)

Key Differences:

SCA: Single tree, deterministic, faster training, potentially less robust
SCE: Multiple trees, ensemble approach, improved accuracy, OOB validation, parallel processing

When to Use:

SCA: Quick exploration, simple relationships, limited computational resources
SCE: Production models, complex relationships, when accuracy is critical

Value

For SCA: An S3 object of class "SCA" containing:

Tree: The SCA tree structure
Map: Mapping information for predictions
XName: Names of predictors used
YName: Names of predictants
type: Mapping type (currently "mean")
totalNodes: Total number of nodes in the tree
leafNodes: Number of leaf nodes
cuttingActions: Number of cutting actions performed
mergingActions: Number of merging actions performed
call: Function call

For SCE: An S3 object of class "SCE" containing the ensemble model with the following components:

trees: A list of SCA tree models, each containing:
- Tree: The SCA tree structure
- Map: Mapping information
- XName: Names of predictors used
- YName: Names of predictants
- type: Mapping type
- totalNodes: Total number of nodes
- leafNodes: Number of leaf nodes
- cuttingActions: Number of cutting actions
- mergingActions: Number of merging actions
- OOB_error: Out-of-bag R-squared error
- OOB_sim: Out-of-bag predictions
- Sample: Bootstrap sample indices
- Tree_Info: Tree-specific information
- Training_data: Training data used for the tree
- weight: Tree weight based on OOB performance
predictors: Names of predictor variables
predictants: Names of predictant variables
parameters: Model parameters
call: Function call

Both objects support S3 methods: print(), summary(), predict(), importance(), and evaluate().

Author(s)

Xiuquan Wang <xxwang@upei.ca> (original SCA) Kailong Li <lkl98509509@gmail.com> (Resolution-search-based SCA and SCE ensemble)

References

Li, Kailong, Guohe Huang, and Brian Baetz. Development of a Wilks feature importance method with improved variable rankings for supporting hydrological inference and modelling. Hydrology and Earth System Sciences 25.9 (2021): 4947-4966.

Wang, X., G. Huang, Q. Lin, X. Nie, G. Cheng, Y. Fan, Z. Li, Y. Yao, and M. Suo (2013), A stepwise cluster analysis approach for downscaled climate projection - A Canadian case study. Environmental Modelling & Software, 49, 141-151.

Huang, G. (1992). A stepwise cluster analysis method for predicting air quality in an urban environment. Atmospheric Environment (Part B. Urban Atmosphere), 26(3): 349-357.

Liu, Y. Y. and Y. L. Wang (1979). Application of stepwise cluster analysis in medical research. Scientia Sinica, 22(9): 1082-1094.

Examples


	## Load SCE package
	library(SCE)

	## Load training and testing data
	data("Streamflow_training_10var")
	data("Streamflow_testing_10var")

	## Define independent (x) and dependent (y) variables
	Predictors <- c("Prcp","SRad","Tmax","Tmin","VP","smlt","swvl1","swvl2","swvl3","swvl4")
	Predictants <- c("Flow")

	## Example 1: Build SCA model (single tree)
	sca_model <- SCA(
		Training_data = Streamflow_training_10var,
		X = Predictors,
		Y = Predictants,
		Nmin = 5,
		alpha = 0.05,
		resolution = 1000
	)
	
	## Use S3 methods for SCA model inspection
	print(sca_model)
	summary(sca_model)
	
	## Make predictions using S3 method
	sca_predictions <- predict(sca_model, Streamflow_testing_10var)
	
	## Calculate variable importance using S3 method
	sca_importance <- importance(sca_model)
	
	## Evaluate SCA model performance using S3 method
	sca_evaluation <- evaluate(
		object = sca_model,
		Testing_data = Streamflow_testing_10var,
		Predictant = Predictants
	)

	## Example 2: Build SCE model (ensemble)
	sce_model <- SCE(
		Training_data = Streamflow_training_10var,
		X = Predictors,
		Y = Predictants,
		mfeature = round(0.5 * length(Predictors)),
		Nmin = 5,
		Ntree = 48,
		alpha = 0.05,
		resolution = 1000,
		parallel = FALSE
	)

	## Use S3 methods for SCE model inspection
	print(sce_model)
	summary(sce_model)

	## Generate predictions using S3 method
	sce_predictions <- predict(sce_model, Streamflow_testing_10var)

	## Access different prediction components
	training_predictions <- sce_predictions$Training
	validation_predictions <- sce_predictions$Validation
	testing_predictions <- sce_predictions$Testing

	## Calculate variable importance using S3 method
	sce_importance <- importance(sce_model)

	## Evaluate SCE model performance using S3 method
	sce_evaluation <- evaluate(
		object = sce_model,
		Testing_data = Streamflow_testing_10var,
		Training_data = Streamflow_training_10var,
		Predictant = Predictants
	)

SCE documentation built on July 2, 2025, 9:08 a.m.