D2MCS: Data Driven Multiple Classifier System.

D2MCSR Documentation

Data Driven Multiple Classifier System.

Description

The class is responsible of managing the whole process. Concretely builds the M.L. models (optimizes models hyperparameters), selects the best M.L. model for each cluster and executes the classification stage.

Methods

Public methods


Method new()

The function is used to initialize all parameters needed to build a Multiple Classifier System.

Usage
D2MCS$new(
  dir.path,
  num.cores = NULL,
  socket.type = "PSOCK",
  outfile = NULL,
  serialize = FALSE
)
Arguments
dir.path

A character defining location were the trained models should be saved.

num.cores

An optional numeric value specifying the number of CPU cores used for training the models (only if parallelization is allowed). If not defined (num.cores - 2) cores will be used.

socket.type

A character value defining the type of socket used to communicate the workers. The default type, "PSOCK", calls makePSOCKcluster. Type "FORK" calls makeForkCluster. For more information see makeCluster

outfile

Where to direct the stdout and stderr connection output from the workers. "" indicates no redirection (which may only be useful for workers on the local machine). Defaults to '/dev/null'

serialize

A logical value. If TRUE (default) serialization will use XDR: where large amounts of data are to be transferred and all the nodes are little-endian, communication may be substantially faster if this is set to false.


Method train()

The function is responsible of performing the M.L. model training stage.

Usage
D2MCS$train(
  train.set,
  train.function,
  num.clusters = NULL,
  model.recipe = DefaultModelFit$new(),
  ex.classifiers = c(),
  ig.classifiers = c(),
  metrics = NULL,
  saveAllModels = FALSE
)
Arguments
train.set

A Trainset object used as training input for the M.L. models

train.function

A TrainFunction defining the training configuration options.

num.clusters

An numeric value used to define the number of clusters from the Trainset that should be utilized during the training stage. If not defined all clusters will we taken into account for training.

model.recipe

An unprepared recipe object inherited from GenericModelFit class.

ex.classifiers

A character vector containing the name of the M.L. models used in training stage. See getModelInfo and https://topepo.github.io/caret/available-models.html for more information about all the available models.

ig.classifiers

A character vector containing the name of the M.L. that should be ignored when performing the training stage. See getModelInfo and https://topepo.github.io/caret/available-models.html for more information about all the available models.

metrics

A character vector containing the metrics used to perform the M.L. model hyperparameter optimization during the training stage. See SummaryFunction, UseProbability and NoProbability for more information.

saveAllModels

A logical parameter. A TRUE saves all trained models while A FALSE saves only the M.L. model achieving the best performance on each cluster.

Returns

A TrainOutput object containing all the information computed during the training stage.


Method classify()

The function is responsible for executing the classification stage.

Usage
D2MCS$classify(train.output, subset, voting.types, positive.class = NULL)
Arguments
train.output

The TrainOutput object computed in the train stage.

subset

A Subset containing the data to be classified.

voting.types

A list containing SingleVoting or CombinedVoting objects.

positive.class

An optional character parameter used to define the positive class value.

Returns

A ClassificationOutput with all the values computed during classification stage.


Method getAvailableModels()

The function obtains all the available M.L. models.

Usage
D2MCS$getAvailableModels()
Returns

A data.frame containing the information of the available M.L. models.


Method clone()

The objects of this class are cloneable with this method.

Usage
D2MCS$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

See Also

Dataset, Subset, Trainset

Examples


# Specify the random number generation
set.seed(1234)

## Create Dataset Handler object.
loader <- DatasetLoader$new()

## Load 'hcc-data-complete-balanced.csv' dataset file.
data <- loader$load(filepath = system.file(file.path("examples",
                                                     "hcc-data-complete-balanced.csv"),
                                           package = "D2MCS"),
                    header = TRUE, normalize.names = TRUE)
## Get column names
data$getColumnNames()

## Split data into 4 partitions keeping balance ratio of 'Class' column.
data$createPartitions(num.folds = 4, class.balance = "Class")

## Create a subset comprising the first 2 partitions for clustering purposes.
cluster.subset <- data$createSubset(num.folds = c(1, 2), class.index = "Class",
                                    positive.class = "1")

## Create a subset comprising second and third partitions for trainning purposes.
train.subset <- data$createSubset(num.folds = c(2, 3), class.index = "Class",
                                  positive.class = "1")

## Create a subset comprising last partitions for testing purposes.
test.subset <- data$createSubset(num.folds = 4, class.index = "Class",
                                 positive.class = "1")

## Distribute the features into clusters using MCC heuristic.
distribution <- SimpleStrategy$new(subset = cluster.subset,
                                   heuristic = MCCHeuristic$new())
distribution$execute()

## Get the best achieved distribution
distribution$getBestClusterDistribution()

## Create a train set from the computed clustering distribution
train.set <- distribution$createTrain(subset = train.subset)

## Not run: 

## Initialization of D2MCS configuration parameters.
##  - Defining training operation.
##    + 10-fold cross-validation
##    + Use only 1 CPU core.
##    + Seed was set to ensure straightforward reproductivity of experiments.
trFunction <- TwoClass$new(method = "cv", number = 10, savePredictions = "final",
                           classProbs = TRUE, allowParallel = TRUE,
                           verboseIter = FALSE, seed = 1234)

#' ## - Specify the models to be trained
ex.classifiers <- c("ranger", "lda", "lda2")

## Initialize D2MCS
#' d2mcs <- D2MCS$new(dir.path = tempdir(),
                      num.cores = 1)

## Execute training stage for using 'MCC' and 'PPV' measures to optimize model hyperparameters.
trained.models <- d2mcs$train(train.set = train.set,
                              train.function = trFunction,
                              ex.classifiers = ex.classifiers,
                              metrics = c("MCC", "PPV"))

## Execute classification stage using two different voting schemes
predictions <- d2mcs$classify(train.output = trained.models,
                              subset = test.subset,
                              voting.types = c(
                                    SingleVoting$new(voting.schemes = c(ClassMajorityVoting$new(),
                                                                        ClassWeightedVoting$new()),
                                                     metrics = c("MCC", "PPV"))))

## Compute the performance of each voting scheme using PPV and MMC measures.
predictions$getPerformances(test.subset, measures = list(MCC$new(), PPV$new()))

## Execute classification stage using multiple voting schemes (simple and combined)
predictions <- d2mcs$classify(train.output = trained.models,
                              subset = test.subset,
                              voting.types = c(
                                    SingleVoting$new(voting.schemes = c(ClassMajorityVoting$new(),
                                                                         ClassWeightedVoting$new()),
                                                      metrics = c("MCC", "PPV")),
                                    CombinedVoting$new(voting.schemes = ClassMajorityVoting$new(),
                                                        combined.metrics = MinimizeFP$new(),
                                                        methodology = ProbBasedMethodology$new(),
                                                        metrics = c("MCC", "PPV"))))

## Compute the performance of each voting scheme using PPV and MMC measures.
predictions$getPerformances(test.subset, measures = list(MCC$new(), PPV$new()))

## End(Not run)



D2MCS documentation built on Aug. 23, 2022, 5:07 p.m.