CASMI.mineCombination: Discover Factor Combinations based on 'CASMI'
In CASMI: 'CASMI'-Based Functions

View source: R/CombinationMining.R

CASMI.mineCombination

R Documentation

Discover Factor Combinations based on CASMI

Description

The 'CASMI.mineCombination()' function is designed to suggest combinations of factors that are most strongly associated with the outcome in a dataset. This function is partially developed based on the 'CASMI.selectFeatures()' function. (Synonyms for "factor" in this document: "independent variable," "feature," and "predictor.")

Usage

CASMI.mineCombination(
  data,
  NumOfVar = NULL,
  NA.handle = "stepwise",
  alpha = 0.05,
  alpha.ind = 0.1,
  intermediate.steps = FALSE,
  kappa.star.cap = 1,
  NumOfComb = 3
)

Arguments

`data`	data frame with variables as columns and observations as rows. The data MUST include at least one feature (a.k.a., independent variable, predictor, factor) and only one outcome variable (Y). The outcome variable MUST BE THE LAST COLUMN. Both the features and the outcome MUST be categorical or discrete. If variables are not naturally discrete, you may preprocess them using the 'autoBin.binary()' function in the same package.
`NumOfVar`	the number of variables in a combination (integer). This setting is optional. If NULL, an automatically suggested number of variables will be returned.
`NA.handle`	method for handling missing values. This parameter is inherited from the 'CASMI.selectFeature()' function. There are three possible options: 'NA.handle = "stepwise"' (default), 'NA.handle = "na.omit"', or 'NA.handle = "NA as a category"'. Check the 'CASMI.selectFeature()' documentation for more details.
`alpha`	level of significance used for the confidence intervals in the results; the default is 0.05.
`alpha.ind`	level of significance used for the initial screening of features based on a test of independence; the default is 0.1. This parameter is also used in the 'CASMI.selectFeature()' function; check the 'CASMI.selectFeature()' documentation for more details.
`intermediate.steps`	setting for outputting intermediate steps while awaiting the final results. There are two possible settings: 'intermediate.steps = TRUE' or 'intermediate.steps = FALSE'.
`kappa.star.cap`	threshold of 'kappa*' for halting the feature selection process. This parameter is inherited from the 'CASMI.selectFeature()' function; check the 'CASMI.selectFeature()' documentation for more details. This setting is applicable only when 'NumOfVar' is set to NULL (default).
`NumOfComb`	the number of top combinations to be returned; the default is 3. This setting is used only when a 'NumOfVar' value is defined (not NULL); if 'NumOfVar == NULL', only the automatically suggested combination will be returned.

Value

'CASMI.mineCombination()' returns the following components:

`Outcome`: Name of the outcome variable (last column) in the input dataset.
`Conf.Level`: Confidence level used for the results.
`NumOfVar`: The number of variables in each combination.
`TopResults`: A results data frame. The number of combinations (rows) returned depends on the 'NumOfComb' setting.
`Comb.Idx`: Indices of the variables in the combination.
`n`: Number of observations used in the analysis.
`kappa*`: A comprehensive score reflecting the association between the factor combination and the outcome. A larger 'kappa*' indicates that the factor combination has a stronger association with the outcome. For more information about 'kappa*', please refer to the paper: Shi, J., Zhang, J. and Ge, Y. (2019), "CASMI—An Entropic Feature Selection Method in Turing’s Perspective" <doi:10.3390/e21121179>
`kappa*.low`: Lower bound of the confidence interval for 'kappa*'.
`kappa*.upr`: Upper bound of the confidence interval for 'kappa*'.
`SMIz`: Standardized Mutual Information (SMI) (using the z-estimator) between the factor combination and the outcomes.
`SMIz.low`: Lower bound of the confidence interval for 'SMIz'.
`SMIz.upr`: Upper bound of the confidence interval for 'SMIz'.
`p.MIz`: P-value between the factor combination and the outcome using the mutual information test of independence based on the z-estimator.
`Var.Name`: Names of the variables in the combination.

Examples

# ---- Generate a toy dataset for usage examples: "data" ----
set.seed(123)
n <- 200
x1 <- sample(c("A", "B", "C", "D"), size = n, replace = TRUE, prob = c(0.1, 0.2, 0.3, 0.4))
x2 <- sample(c("W", "X", "Y", "Z"), size = n, replace = TRUE, prob = c(0.4, 0.3, 0.2, 0.1))
x3 <- sample(c("E", "F", "G", "H", "I"), size = n,
             replace = TRUE, prob = c(0.2, 0.3, 0.2, 0.2, 0.1))
x4 <- sample(c("A", "B", "C", "D"), size = n, replace = TRUE)
x5 <- sample(c("L", "M", "N"), size = n, replace = TRUE)
x6 <- sample(c("E", "F", "G", "H", "I"), size = n, replace = TRUE)

# Generate y variable dependent on x1 to x3
x1_num <- as.numeric(factor(x1, levels = c("A", "B", "C", "D")))
x2_num <- as.numeric(factor(x2, levels = c("W", "X", "Y", "Z")))
x3_num <- as.numeric(factor(x3, levels = c("E", "F", "G", "H", "I")))
# Calculate y with added noise
y_numeric <- 3*x1_num + 2*x2_num - 2*x3_num + rnorm(n,mean=0,sd=2)
# Discretize y into categories
y <- cut(y_numeric, breaks = 10, labels = paste0("Category", 1:10))

# Combine into a dataframe
data <- data.frame(x1, x2, x3, x4, x5, x6, y)

# The outcome of the toy dataset is dependent on x1, x2, and x3
# but is independent of x4, x5, and x6.
head(data)


# ---- Usage Examples ----

## Return the suggested combination with the default settings:
CASMI.mineCombination(data)

## Return combinations when the number of variables to be included
## in each combination is specified (e.g., NumOfVar = 2):
CASMI.mineCombination(data, NumOfVar = 2)

## Return combinations when the number of variables to be included
## in each combination is specified (e.g., NumOfVar = 2),
## while the number of top combinations to return is specified
## (e.g., NumOfComb = 2):
CASMI.mineCombination(data,
                     NumOfVar = 2,
                     NumOfComb = 2)

CASMI documentation built on April 3, 2025, 10:56 p.m.