select: Variable Selection by Genetic Algorithm.
In zhrlin/GA: What the Package Does (One Line, Title Case)

Description Usage Arguments Details Value References Examples

View source: R/select.R

select performs variable selection in regression problems (lm/glm) by genetic algorithm. The details of the algorithm used are based on section 3.4 of the Givens and Hoteing book on Computational Statistics.

1	select(formula, data, fitness = NULL, family = gaussian, m = 0.01, gap = 1)

`formula`	an object of class `formula` (or one that can be coerced to that class) that specifies the regression model that needs variable selection.
`data`	a data frame (or one that an be coerced to that class) containing the variables for regression.
`fitness`	an optional function to be used as the fitness function; Negative AIC is used by default. A higher value should indicate greater fitness.
`family`	an optional family function, that gives the error distribution and link function to be used in the model. Gaussian by default if none specified.
`m`	an optional number between 0 and 1 that specifies the mutation rate (the probability that mutation occurs) to be used in the process. Default takes the value 0.01
`gap`	an optional number specifying the generation gap; default takes the value 1. Note that when `gap = 1`, all individuals in each generation would be replaced by the generated offspring. Thus, `gap = 1` corresponds to distinct and nonoverlapping generations. (The details of generation gap are given under 'Details')

The Genetic Algorithm used can be broken down to five points:

the first generation (initial population)
fitness
selection of parents
crossover
mutation

The first generation is created at random using binary values. This can be thought of as encoding the genes in a chromosome. Specifically, the length of each chromosome C equals the number of terms in the formula response~terms, while the generation size P is taken to be twice this length.

The fitness function measures how fit an indivudual is. It should be noted that a typical fitness function should return a greater value when an individual is fitter. By default, select uses the negative AIC in determining the fitness level of each individual. Additionally, to avoid premature convergence when there exists large variations or other difficulties caused by the actual form of the function, select adopts a rank-based method in which selectivity is based on relative fitness. Having a higher rank of fitness among the population then indicates a higher probability of being chosen for reproduction. Specifically, probability = rank/sum(rank).

The selection of parents mechanism is to select parents based on their fitness rank. Basically, the idea of selection is to let the fittest individuals pass their genes to the next generation. select adopts a simple method of selection: when selecting the parents, one is selected with probability proportional to the fitness rank, using methods specified in the above paragraph, while the other is selected completely at random. This is to encourage variation among offsprings and mimic natural selection. select repeats this selection process 100 times to generate a final population that in theory displays a greater fitness level compared to the first generation. It should also be noted that, select does not allow duplicate individuals in the population as this potentially distorts the parent selection criterion by inflating the probability to produce offspring for duplicated chromosomes. Specifically, select eliminates duplicates in this process by comparing each generation of offspring to the previous offsprings as well as those that will move on to the next generation.

Populations can be partially updated with the specification of the generation gap. The generation gap indicates the proportion of the generation to be replcaced by the generated offspring. When gap = 1, each generation is distinct, though users should be aware of the potential disappearance of the fittest chromosomes in previous generations, even if the following offsprings show no improvements in fitness. gap < 1 avoids this possibility, though this might reduce variability in population.

Crossover is the fundamental genetic operator in the generic algorithm. Here, it is done by the simplest method: select a random position between two adjacent loci and split both parent chromosomes at this position, then glue the left chromosome segment from one parent to the right segment from the other parent to form an offspring chromosome. For example, if the two parents are 100110001 and 110100110, and the random split point is between the third and fourth loci, then the potential offspring are 100100110 and 110110001. During crossover, select only keeps one offspring chromosome for each pair of parents; the second offspring that is formed by combining the remaining segments is discarded.

Mutation, being another important genetic operator, ensures diversity among the population. Here, it is done with changing an offspring chromosome by randomly introducing one or more alleles in loci where those alleles are not seen in the corresponding loci of either parent chromosome. This implies that some of the bits could be flipped. The default probability of mutation is 0.01, though users can specify their own mutation rate as desired. When choosing the mutation rate, it should be taken into account that an overly high rate would disturb the fitness selectivity, while an overly low rate would discourage revolution and miss potential improvements.

The algorithm terminates when it reaches the maximum number of iterations. select uses n = 100 number of iterations. If the number of independent variables in the formula is fewer than or equal to two, there will be no iterations.

select returns a list of the selection results.

Specifically, the returned list contains four elements:

last: a matrix that contains the chromosomes generated in the final iteration
Neg: an 100 by 2C matrix that contains the negative fitness (negative AIC if default adopted) of each chromesome in each iteration
selected: the selected independent variables
fitness: the corresponding fitness for the selected variables

Givens, Geof H., and Jennifer A. Hoeting. Computational Statistics. Wiley, 2013.

# An example with the built-in mtcars dataset
formula <- mpg~cyl+disp+hp+cyl:disp+wt+gear+carb+gear
result <- select(formula, mtcars)
selected_vars <- result$selected
max_negAIC <- result$fitness


# A plotting example with the built-in airquality dataset
formula <- Ozone~Solar.R+Wind+Temp+Month+Wind:Temp
result <- select(formula, airquality, m = 0.05)
AIC <- -result$Neg
averaged_AIC <- apply(AIC, 1, mean)
plot(averaged_AIC, type = 'l')