All code and related materials can be found on Github at https://github.berkeley.edu/jake-anderson/GA. This code has been formatted as an R package. It may be used by running the following command:
devtools::install_github("jakemanderson/GA")
We implemented the genetic feature selection algorithm in this project. The code is all written in functional form and vectorized to improve computational efficiency. For any given dataset with N columns (features), the function will select features under the metric given by the user and output the names of the selected columns.
The code has 3 helper functions. First is the Fitness Assignment which builds a linear model (or generalized linear model if user specifies) using the chromosome vector and output the fitting score. In our implementation, the chromosome vector is a vector with 0 or 1 as entries. 1 means the corresponding feature is selected, 0 otherwise. The second part is called Selection by Ranking; for each chromosome vector, it will assign a score to it and rank them according to the score. The Fitness Assignment function will be called repeatedly for each computation. The third step is called "crossover". Top score vectors will have the ability to generate their children following the "one child policy". This cross over process follows Darwen assumption. E.g. [1,0,1] and [1,0,0] would have children [1,0,1] and [1,0,0] with the same probability. Mutation could happen in this process with certain rate prespecified.
The main function will call each three functions above repeatedly i.e. each child vector will fit the model again and be ranked by their fitting scores. The process will end until the score converges.
Yinuo Sun drew up a color-coded flowchart for the algorithm, researched related algorithms, and at our first meeting, suggested a plan of action and timeline for the workflow. Yinuo wrote the main functions found in select.R
, and ran the function on some examples. Zhongyao Ma found additional examples to run the function on, and carried out several iterations of improvement on the select function, including vectorizing appropriate functions. Jake composed this document, wrote tests, created the package, maintained the github throughout version changes, debugged through the final itertion of the package, and compiled the final product.
# Initialize a chromosome vector initialize_gene <- function(chromo_length){ chromo <- sample(c(TRUE, FALSE), chromo_length, replace = TRUE) } #### Fitness assignment #### fitness_assignment <- function(chromo, x, y, fit_method = 'lm', metric = 'aic'){ mini_df <- data.frame(y, x[, chromo]) if (fit_method == 'lm'){ linearMod <- lm(y ~ ., mini_df) } else if(fit_method =='glm'){ linearMod <- glm(y ~ ., mini_df, family = gaussian) } else{ print("Method not defined.") exit() } if(class(metric) == "character"){ residual <- residuals(linearMod) if (metric == 'aic' || metric == 'bic'){ score <- eval(parse(text=paste0(toupper(metric), "(linearMod)"))) } else if(metric == 'rsquare_negative'){ score <- -eval(parse(text=paste0("summary(linearMod)$r.squared"))) } else if(metric == 'rmse'){ score <- sqrt(mean(residual^2, na.rm = TRUE)) } else if(metric == 'mae'){ score <- mean(abs(residual), na.rm = TRUE) } } else if(class(metric) == "function"){ score <- eval(parse(text=paste0("metric", "(linearMod, mini_df)"))) } else{ print("Method not defined.") exit() } return(score) } #### Selection by ranking #### selection <- function(gene_mat, x, y, fit_method = 'lm', metric = 'aic'){ # rank matrix has P rows, 2 columns [index, score] P <- dim(gene_mat)[1] P_parent <- ceiling(P / 2) rank_mat <- matrix(c(1 : P, rep(0, P)), ncol = 2, nrow = P) rank_mat[, 2] <- t(apply(gene_mat, 1, function(k) fitness_assignment(k, x, y, fit_method, metric))) rank_mat <- rank_mat[order(rank_mat[, 2], decreasing = FALSE), ] print(mean(rank_mat[, 2])) parant_gene <- gene_mat[rank_mat[1 : P_parent, 1], ] return(list(parant_gene, mean(rank_mat[, 2]))) } #### Cross over #### cross_over_two_chromo <- function(chromo_1, chromo_2){ C <- length(chromo_1) mutation_rate <- 1 / C selection_sampler <- sample(c(TRUE, FALSE), C, replace = TRUE) result <- (chromo_1 * selection_sampler + chromo_2 * (1 - selection_sampler) == 1) ## mutation ## mutation_sampler <- (runif(C) < mutation_rate) result <- !(result == mutation_sampler) return(result) } # Single child policy cross_over <- function(parant_gene, P){ P_parent <- ceiling(P / 2) C <- dim(parant_gene)[2] gene_mat <- matrix(rep(FALSE, C * P), ncol = C, nrow = P) p_mat <- matrix(sample.int(2 * P, n = P_parent, replace = T), ncol = 2, nrow = P) gene_mat <- t(apply(p_mat, 1, function(x) cross_over_two_chromo(parant_gene[x[1], ], parant_gene[x[2], ]))) return(gene_mat) } select <- function(data, target, fit_method = 'lm', metric = 'aic'){ y <- data[, target] x <- data[, names(data) != target] #### Initialization #### # Parameters C <- dim(x)[2] # chromosome_length (# of features) P <- C * 10 # population size mutation_rate <- 1 / C repetition <- 30 gene_mat <- matrix(rep(initialize_gene(C), P), ncol = C, nrow = P) # For graph error_record = rep(NA, repetition) feature_count = rep(NA, repetition) # Start for (k in 1 : repetition){ temp <- selection(gene_mat, x, y, fit_method, metric) parent_gene <- temp[[1]] gene_mat <- cross_over(parent_gene, P) # For graph error_record[k] <- temp[[2]] feature_count[k] <- mean(parent_gene[1, ]) } result_column_boolean <- parent_gene[1, ] result_column_names <- names(x)[result_column_boolean] # For graph return(list(result_column_names, error_record, repetition, feature_count)) }
Here is Example 5:
# Example 5 # Basic data set loading and test of function lm() # Load a build in data set. BostonHousing library(mlbench) data(BostonHousing) res <- select(BostonHousing, 'medv', fit_method = 'lm', metric = 'aic')
Here is the trend of errors with regards to generation:
# For graph library(ggplot2) ggplot(data.frame(Generation=c(1:res[[3]]), Feature_Selection_Ratio=res[[4]]), aes(x=Generation, y=Feature_Selection_Ratio)) + geom_point() + geom_line()
This is a graph for illustration of the genetic algorithm.
# For illustration # install.packages("Gmisc") library(grid) suppressMessages(library(Gmisc)) grid.newpage() # create boxes (b1 <- boxGrob("Algorithm Initialization", x=.4, y=.9, box_gp = gpar(fill = "lightgreen"), width = .25)) (b2 <- boxGrob("Step 1: Fitting", x=.4, y=.78, box_gp = gpar(fill = "lightblue"), width = .25)) (b3 <- boxGrob("Step 2: Selection", x=.4, y=.66, box_gp = gpar(fill = "wheat"), width = .25)) (b4 <- boxGrob("Step 3: Crossover", x=.4, y=.54, box_gp = gpar(fill = "lightpink"), width = .25)) (b5 <- boxGrob("Step 4: Mutation", x=.4, y=.42, box_gp = gpar(fill = "yellow"), width = .25)) (b6 <- boxGrob("Check Convergence", x=.2, y=.28, box_gp = gpar(fill = "lightgrey"), width = .25)) (b7 <- boxGrob("Algorithm Output", x=.4, y=.15, box_gp = gpar(fill = "lightgreen"), width = .25)) (b8 <- boxGrob("No", x=.2, y=.35, box_gp = gpar(fill = "lightgrey"), width = .05)) (b9 <- boxGrob("Yes", x=.2, y=.21, box_gp = gpar(fill = "lightgrey"), width = .05)) (b10 <- boxGrob("Randomly generate samples", x=.75, y=.78, box_gp = gpar(fill = "lightgrey"), width = .3)) (b11 <- boxGrob("Randomly select samples ", x=.75, y=.66, box_gp = gpar(fill = "lightgrey"), width = .3)) (b12 <- boxGrob("Recombine samples and filter", x=.75, y=.54, box_gp = gpar(fill = "lightgrey"), width = .3)) (b13 <- boxGrob("Randomly modify samples", x=.75, y=.42, box_gp = gpar(fill = "lightgrey"), width = .3)) (b14 <- boxGrob("Feature selection indicator\n [ 0 1 1 1 0 1 0 0 0 0 1 1 0 1 ]", x=.75, y=.15, box_gp = gpar(fill = "lightgrey"), width = .3)) (b15 <- boxGrob("Fitness metrics:\n AIC, BIC, RMSE, R2, MAE ...", x=.75, y=.28, box_gp = gpar(fill = "lightgrey"), width = .3)) # connect boxes connectGrob(b1, b2, "v") connectGrob(b2, b3, "v") connectGrob(b3, b4, "v") connectGrob(b4, b5, "v") connectGrob(b5, b6, "L") connectGrob(b6, b2, "L") connectGrob(b6, b7, "L")
Additional examples of running the code successfully can be found in /vignettes/Examples.R
. We feel seeing the code line by line with different data is often most useful to understanding the procedure.
Gomez, F. and Quesada, A. Genetic algorithms for feature selection Retrieved from
https://www.neuraldesigner.com/blog/genetic_algorithms_for_feature_selection
Givens, H. G. and Hoeting, A. J. Computational Statistics, Second Edition, Section 3.4, John Wiley & Sons, Inc, 2012
Kuhn. M. Feature Selection using Genetic Algorithms. Retrieved from
https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.