gaOptimizeSegmentationParams: Optimization of segmentation parameters using genetic...

View source: R/SOptim_OptimizationFunctions.R

gaOptimizeSegmentationParamsR Documentation

Optimization of segmentation parameters using genetic algorithms

Description

This function makes some data checks and then performs the optimization of segmentation parameters using genetic algorithms.

Usage

gaOptimizeSegmentationParams(
  rstFeatures,
  trainData,
  segmentMethod,
  trainThresh = 0.5,
  segmStatsFuns = c("mean", "sd"),
  bylayer = FALSE,
  tiles = NULL,
  classificationMethod = "RF",
  classificationMethodParams = NULL,
  balanceTrainData = TRUE,
  balanceMethod = "ubUnder",
  evalMethod = "5FCV",
  trainPerc = 0.8,
  nRounds = 10,
  evalMetric = "Kappa",
  minTrainCases = 30,
  minCasesByClassTrain = 10,
  minCasesByClassTest = 10,
  minImgSegm = 30,
  ndigits = 2,
  verbose = TRUE,
  ...,
  lower,
  upper,
  population = GA::gaControl("real-valued")$population,
  selection = GA::gaControl("real-valued")$selection,
  crossover = GA::gaControl("real-valued")$crossover,
  mutation = GA::gaControl("real-valued")$mutation,
  popSize = 20,
  pcrossover = 0.8,
  pmutation = 0.1,
  elitism = base::max(1, round(popSize * 0.05)),
  maxiter = 100,
  run = 20,
  maxFitness = 1,
  keepBest = TRUE,
  parallel = FALSE,
  seed = NULL
)

Arguments

rstFeatures

Features used for supervised classification (typically a multi-layer SpatRaster with one feature per band). May be defined as a string with the path to a raster dataset or a RasterStack object.

trainData

Input train data used for supervised classification. It must be a SpatRaster containing train areas (in raster format)

segmentMethod

Character string used to define the segmentation method. Available options are:

  • "SAGA_SRG" - SAGA Simple Region Growing;

  • "GRASS_RG" - GRASS Region Growing;

  • "ArcGIS_MShift" - ArcGIS Mean Shift algorithm;

  • "Terralib_Baatz" - TerraLib Baatz algorithm;

  • "Terralib_MRGrow" - TerraLib Mean Region Growing;

  • "RSGISLib_Shep" - RSGISLib Shepherd algorithm;

  • "OTB_LSMS" - OTB Large Scale Mean Shift algorithm;

  • "OTB_LSMS2" - OTB Large Scale Mean Shift algorithm with two separate sets of parameters, one for mean-shift smoothing and another for large-scale segmentation step;

trainThresh

A threshold value defining the minimum proportion of the segment ]0, 1] that must be covered by a certain class to be considered as a training case. This threshold will only apply if x is a RasterLayer which means you are using train areas/pixels. If you are running a "single-class" problem then this threshold only applies to the class of interest (coded as 1's). Considering this, if a given segment has a proportion cover of that class higher than thresh then it is considered a train case. In contrast, for the background class (coded as 0's), only segments/objects totaly covered by that class are considered as train cases. If you are running a "multi-class" problem then thresh is applied differently. First, the train class is determined by a majority rule then if that class covers more than the value specified in thresh this case is kept in train data otherwise it will be filtered out. See also useThresh.

segmStatsFuns

An aggregation function (e.g., mean) applied to the elements within each segment. Either a function object or a function name.

bylayer

Calculate statistics layer by layer instead of all at once? (slightly increases computation time but spares memory load; default: FALSE).

tiles

Number of times to slice the SpatRaster across row and column direction. The total number of tiles will be given by: N_{tiles} = nd^{2}.

classificationMethod

An input string defining the classification algorithm to be used. Available options are: "RF" (random forests), "GBM" (generalized boosted models), "SVM" (support vector machines), "KNN" (k-nearest neighbour), and, "FDA" (flexible discriminant analysis).

classificationMethodParams

A list object with a customized set of parameters to be used for the classification algorithms (default = NULL). See also generateDefaultClassifierParams to see which parameters can be changed and how to structure the list object.

balanceTrainData

Defines if data balancing is to be used (only available for single-class problems; default: TRUE).

balanceMethod

A character string used to set the data balancing method. Available methods are based on under-sampling "ubUnder" or over-sampling "ubOver" the target class.

evalMethod

A character string defining the evaluation method. The available methods are "10FCV" (10-fold cross-validation; the default), "5FCV" (5-fold cross-validation), "HOCV" (holdout cross-validation with the training percentage defined by trainPerc and the number of rounds defined in nRounds), and, "OOB" (out-of-bag evaluation; only applicable to random forests).

trainPerc

A decimal number defining the training proportion (default: 0.8; if "HOCV" is used).

nRounds

Number of training rounds used for holdout cross-validation (default: 20; if "HOCV" is used).

evalMetric

A character string setting the evaluation metric or a function that calculates the performance score based on two vectors one for observed and the other for predicted values (see below for more details). This option defines the outcome value of the genetic algorithm fitness function and the output of grid or random search optimization routines. Check evalPerformanceGeneric for available options. When runFullCalibration=TRUE this metric will be calculated however other evaluation metrics can be quantified using evalPerformanceClassifier.

minTrainCases

The minimum number of training cases used for calibration (default: 20). If the number of rows in x is below this number then calibrateClassifier will not run.

minCasesByClassTrain

Minimum number of cases by class for each train data split so that the classifier is able to run.

minCasesByClassTest

Minimum number of cases by class for each test data split so that the classifier is able to run.

minImgSegm

Minimum number of image segments/objects necessary to generate train data.

ndigits

Number of decimal plates to consider for rounding the fitness function output. For example, if ndigits=2 then only improvements of 0.01 will be considered by the GA algorithm.

verbose

Print output messages? (default: TRUE).

...

Additional parameters passed to the segmentation functions that will not be optimized (see also: segmentationGeneric). It must also contain the input segmentation data (typically a multi-layer SpatRaster dataset with one input feature per band) depending one the algorithm selected.

lower

a vector of length equal to the decision variables providing the lower bounds of the search space in case of real-valued or permutation encoded optimizations. Formerly this argument was named min; its usage is allowed but deprecated.

upper

a vector of length equal to the decision variables providing the upper bounds of the search space in case of real-valued or permutation encoded optimizations. Formerly this argument was named max; its usage is allowed but deprecated.

population

an R function for randomly generating an initial population. See ga_Population for available functions.

selection

an R function performing selection, i.e. a function which generates a new population of individuals from the current population probabilistically according to individual fitness. See ga_Selection for available functions.

crossover

an R function performing crossover, i.e. a function which forms offsprings by combining part of the genetic information from their parents. See ga_Crossover for available functions.

mutation

an R function performing mutation, i.e. a function which randomly alters the values of some genes in a parent chromosome. See ga_Mutation for available functions.

popSize

the population size.

pcrossover

the probability of crossover between pairs of chromosomes. Typically this is a large value and by default is set to 0.8.

pmutation

the probability of mutation in a parent chromosome. Usually mutation occurs with a small probability, and by default is set to 0.1.

elitism

the number of best fitness individuals to survive at each generation. By default the top 5% individuals will survive at each iteration.

maxiter

the maximum number of iterations to run before the GA search is halted.

run

the number of consecutive generations without any improvement in the best fitness value before the GA is stopped.

maxFitness

the upper bound on the fitness function after that the GA search is interrupted.

keepBest

a logical argument specifying if best solutions at each iteration should be saved in a slot called bestSol. See ga-class.

parallel

An optional argument which allows to specify if the Genetic Algorithm should be run sequentially or in parallel.

For a single machine with multiple cores, possible values are:

  • a logical value specifying if parallel computing should be used (TRUE) or not (FALSE, default) for evaluating the fitness function;

  • a numerical value which gives the number of cores to employ. By default, this is obtained from the function detectCores;

  • a character string specifying the type of parallelisation to use. This depends on system OS: on Windows OS only "snow" type functionality is available, while on Unix/Linux/Mac OSX both "snow" and "multicore" (default) functionalities are available.

In all the cases described above, at the end of the search the cluster is automatically stopped by shutting down the workers.

If a cluster of multiple machines is available, evaluation of the fitness function can be executed in parallel using all, or a subset of, the cores available to the machines belonging to the cluster. However, this option requires more work from the user, who needs to set up and register a parallel back end. In this case the cluster must be explicitly stopped with stopCluster.

seed

an integer value containing the random number generator state. This argument can be used to replicate the results of a GA search. Note that if parallel computing is required, the doRNG package must be installed.

Details

– INTRODUCTION –

Genetic algorithms (GAs) are stochastic search algorithms inspired by the basic principles of biological evolution and natural selection. GAs simulate the evolution of living organisms, where the fittest individuals dominate over the weaker ones, by mimicking the biological mechanisms of evolution, such as selection, crossover and mutation. The GA package is a collection of general purpose functions that provide a flexible set of tools for applying a wide range of genetic algorithm methods (from package GA). By default SegOptim uses genetic algorithm optimization for "real-valued" type, i.e., optimization problems where the decision variables (i.e., segmentation parameters) are floating-point representations of real numbers.

– INPUT DATA PREPARATION –

TODO: ...

– COMPUTING TIME AND COMPLEXITY –

Depending on the size of the raster dataset and the amount of segmentation features/layers used, take into consideration that running this function may take quite some time!! Therefore it is crucial to use only a relevant subset (or subsets) of your data to run this procedure. Also, choosing an appropriate parameterization of the genetic algorithm is key to decrease computing time. For example, if popSize is set to 30 and maxiter to 100, then a maximum number of 3000 image segmentation runs would be required to stop the optimization! (usually, running the segmentation is the most time-consuming task of the optimization procedure). However, if run is set to 20, this means that if that number of iterations records no improvement in fitness (i.e., the classification score) then the optimization stops and returns the best set of parameters. Bottom-line is that setting this parameters appropriately is fundamental to get good results in a admissible ammount of time. On another hand, classification algorithms are also working in the background of the fitness function. Using the previous example and admitting that we set evalMethod to "HOCV" and nRounds to 20 this means that classification would run a maximum of 3000 \times 20 = 60000 times!! So keep this in mind when setting nRounds value or select a more conservative cross-validation method such as 10- or 5-fold CV. The time required for training the classifier also depends on the input data thus a segmentation solution with larger number of objects will take longer. The number of classification features (or variables) also affects computation time, higher number of these will make classification algorithms running slower.

– SETTING PARAMETERS FOR OPTIMIZATION –

TODO: Controlling for 'biased' classification performance due to class inbalance

TODO: Setting appropriate ranges for segmentation algorithms

TODO: Setting appropriate genetic algorithm parametrization

Value

An object with GA optimization results. See ga-class for a description of available slots information.

Note

Do not use parallel option when performing image segmentation with OTB LSMS algorithm! Since the software uses a parallel implementation, this will probably freeze the system by consuming all CPU resources.

References

Scrucca L. (2013). GA: A Package for Genetic Algorithms in R. Journal of Statistical Software, 53(4), 1-37, http://www.jstatsoft.org/v53/i04/.

Scrucca L. (2016). On some extensions to GA package: hybrid optimisation, parallelisation and islands evolution. Submitted to R Journal. Pre-print available at: http://arxiv.org/abs/1605.01931.

See Also

  • Genetic algorithms: ga

  • Control parameters in GA: gaControl

  • Fitness function: fitFuncGeneric (this function controls much of the processes behind the optimization)


joaofgoncalves/SegOptim documentation built on Feb. 5, 2024, 11:10 p.m.