Authors

Code

All code and related materials can be found on Github at https://github.berkeley.edu/jake-anderson/GA. This code has been formatted as an R package. It may be used by running the following command:

devtools::install_github("jakemanderson/GA")

Introduction

We implemented the genetic feature selection algorithm in this project. The code is all written in functional form and vectorized to improve computational efficiency. For any given dataset with N columns (features), the function will select features under the metric given by the user and output the names of the selected columns.

Code Structure

The code has 3 helper functions. First is the Fitness Assignment which builds a linear model (or generalized linear model if user specifies) using the chromosome vector and output the fitting score. In our implementation, the chromosome vector is a vector with 0 or 1 as entries. 1 means the corresponding feature is selected, 0 otherwise. The second part is called Selection by Ranking; for each chromosome vector, it will assign a score to it and rank them according to the score. The Fitness Assignment function will be called repeatedly for each computation. The third step is called "crossover". Top score vectors will have the ability to generate their children following the "one child policy". This cross over process follows Darwen assumption. E.g. [1,0,1] and [1,0,0] would have children [1,0,1] and [1,0,0] with the same probability. Mutation could happen in this process with certain rate prespecified.

The main function will call each three functions above repeatedly i.e. each child vector will fit the model again and be ranked by their fitting scores. The process will end until the score converges.

Division of Labor

Yinuo Sun drew up a color-coded flowchart for the algorithm, researched related algorithms, and at our first meeting, suggested a plan of action and timeline for the workflow. Yinuo wrote the main functions found in select.R, and ran the function on some examples. Zhongyao Ma found additional examples to run the function on, and carried out several iterations of improvement on the select function, including vectorizing appropriate functions. Jake composed this document, wrote tests, created the package, maintained the github throughout version changes, debugged through the final itertion of the package, and compiled the final product.

Implementation

# Initialize a chromosome vector
initialize_gene <- function(chromo_length){
  chromo <- sample(c(TRUE, FALSE), chromo_length, replace = TRUE)
}


#### Fitness assignment ####
fitness_assignment <- function(chromo, x, y, fit_method = 'lm', metric = 'aic'){
  mini_df <- data.frame(y, x[, chromo])
  if (fit_method == 'lm'){
    linearMod <- lm(y ~ ., mini_df)
  } else if(fit_method =='glm'){
    linearMod <- glm(y ~ ., mini_df, family = gaussian)
  } else{
    print("Method not defined.")
    exit()
  }

  if(class(metric) == "character"){
    residual <- residuals(linearMod)
    if (metric == 'aic' || metric == 'bic'){
      score <- eval(parse(text=paste0(toupper(metric), "(linearMod)")))
    } else if(metric == 'rsquare_negative'){
      score <- -eval(parse(text=paste0("summary(linearMod)$r.squared")))
    } else if(metric == 'rmse'){
      score <- sqrt(mean(residual^2, na.rm = TRUE))
    } else if(metric == 'mae'){
      score <- mean(abs(residual), na.rm = TRUE)
    }
  } else if(class(metric) == "function"){
    score <- eval(parse(text=paste0("metric", "(linearMod, mini_df)")))
  } else{
    print("Method not defined.")
    exit()
  }
  return(score)
}


#### Selection by ranking ####
selection <- function(gene_mat, x, y, fit_method = 'lm', metric = 'aic'){
  # rank matrix has P rows, 2 columns [index, score]
  P <- dim(gene_mat)[1]
  P_parent <- ceiling(P / 2)
  rank_mat <- matrix(c(1 : P, rep(0, P)), ncol = 2, nrow = P)
  rank_mat[, 2] <- t(apply(gene_mat, 1, function(k) fitness_assignment(k, x, y, fit_method, metric)))
  rank_mat <- rank_mat[order(rank_mat[, 2], decreasing = FALSE), ]
  print(mean(rank_mat[, 2]))
  parant_gene <- gene_mat[rank_mat[1 : P_parent, 1], ]
  return(list(parant_gene, mean(rank_mat[, 2])))
}


#### Cross over ####
cross_over_two_chromo <- function(chromo_1, chromo_2){
  C <- length(chromo_1)
  mutation_rate <- 1 / C
  selection_sampler <- sample(c(TRUE, FALSE), C, replace = TRUE)
  result <- (chromo_1 * selection_sampler + chromo_2 * (1 - selection_sampler) == 1)

  ## mutation ##
  mutation_sampler <- (runif(C) < mutation_rate)
  result <- !(result == mutation_sampler)
  return(result)
}

# Single child policy
cross_over <- function(parant_gene, P){
  P_parent <- ceiling(P / 2)
  C <- dim(parant_gene)[2]
  gene_mat <- matrix(rep(FALSE, C * P), ncol = C, nrow = P)
  p_mat <- matrix(sample.int(2 * P, n = P_parent, replace = T), ncol = 2, nrow = P)
  gene_mat <- t(apply(p_mat, 1, function(x) cross_over_two_chromo(parant_gene[x[1], ], parant_gene[x[2], ])))
  return(gene_mat)
}

select <- function(data, target, fit_method = 'lm', metric = 'aic'){
  y <- data[, target]
  x <- data[, names(data) != target]
  #### Initialization ####
  # Parameters
  C <- dim(x)[2]  # chromosome_length (# of features)
  P <- C * 10  # population size
  mutation_rate <- 1 / C
  repetition <- 30
  gene_mat <- matrix(rep(initialize_gene(C), P), ncol = C, nrow = P)

  # For graph
  error_record = rep(NA, repetition)
  feature_count = rep(NA, repetition)

  # Start
  for (k in 1 : repetition){
    temp <- selection(gene_mat, x, y, fit_method, metric)
    parent_gene <- temp[[1]]
    gene_mat <- cross_over(parent_gene, P)

    # For graph
    error_record[k] <- temp[[2]]
    feature_count[k] <- mean(parent_gene[1, ])
  }
  result_column_boolean <- parent_gene[1, ]
  result_column_names <- names(x)[result_column_boolean]

  # For graph
  return(list(result_column_names, error_record, repetition, feature_count))
}

Here is Example 5:

# Example 5
# Basic data set loading and test of function lm()
# Load a build in data set. BostonHousing
library(mlbench)
data(BostonHousing)
res <- select(BostonHousing, 'medv', fit_method = 'lm', metric = 'aic')

Here is the trend of errors with regards to generation:

# For graph
library(ggplot2)
ggplot(data.frame(Generation=c(1:res[[3]]), Feature_Selection_Ratio=res[[4]]), 
       aes(x=Generation, y=Feature_Selection_Ratio)) + 
  geom_point() + geom_line()

This is a graph for illustration of the genetic algorithm.

# For illustration
# install.packages("Gmisc")
library(grid)
suppressMessages(library(Gmisc))
grid.newpage()
# create boxes
(b1 <- boxGrob("Algorithm Initialization", x=.4, y=.9, 
               box_gp = gpar(fill = "lightgreen"), width = .25))
(b2 <- boxGrob("Step 1: Fitting", x=.4, y=.78, 
               box_gp = gpar(fill = "lightblue"), width = .25))
(b3 <- boxGrob("Step 2: Selection", x=.4, y=.66, 
               box_gp = gpar(fill = "wheat"), width = .25))
(b4 <- boxGrob("Step 3: Crossover", x=.4, y=.54, 
               box_gp = gpar(fill = "lightpink"), width = .25))
(b5 <- boxGrob("Step 4: Mutation", x=.4, y=.42, 
               box_gp = gpar(fill = "yellow"), width = .25))
(b6 <- boxGrob("Check Convergence", x=.2, y=.28, 
               box_gp = gpar(fill = "lightgrey"), width = .25))
(b7 <- boxGrob("Algorithm Output", x=.4, y=.15, 
               box_gp = gpar(fill = "lightgreen"), width = .25))
(b8 <- boxGrob("No", x=.2, y=.35, 
               box_gp = gpar(fill = "lightgrey"), width = .05))
(b9 <- boxGrob("Yes", x=.2, y=.21, 
               box_gp = gpar(fill = "lightgrey"), width = .05))
(b10 <- boxGrob("Randomly generate samples", x=.75, y=.78, 
                box_gp = gpar(fill = "lightgrey"), width = .3))
(b11 <- boxGrob("Randomly select samples ", x=.75, y=.66, 
                box_gp = gpar(fill = "lightgrey"), width = .3))
(b12 <- boxGrob("Recombine samples and filter", x=.75, y=.54, 
                box_gp = gpar(fill = "lightgrey"), width = .3))
(b13 <- boxGrob("Randomly modify samples", x=.75, y=.42, 
                box_gp = gpar(fill = "lightgrey"), width = .3))
(b14 <- boxGrob("Feature selection indicator\n [ 0 1 1 1 0 1 0 0 0 0 1 1 0 1 ]",
                x=.75, y=.15, 
                box_gp = gpar(fill = "lightgrey"), width = .3))
(b15 <- boxGrob("Fitness metrics:\n AIC, BIC, RMSE, R2, MAE ...",
                x=.75, y=.28, 
                box_gp = gpar(fill = "lightgrey"), width = .3))
# connect boxes
connectGrob(b1, b2, "v")
connectGrob(b2, b3, "v")
connectGrob(b3, b4, "v")
connectGrob(b4, b5, "v")
connectGrob(b5, b6, "L")
connectGrob(b6, b2, "L")
connectGrob(b6, b7, "L")

Additional Examples

Additional examples of running the code successfully can be found in /vignettes/Examples.R. We feel seeing the code line by line with different data is often most useful to understanding the procedure.

References

Gomez, F. and Quesada, A. Genetic algorithms for feature selection Retrieved from
https://www.neuraldesigner.com/blog/genetic_algorithms_for_feature_selection Givens, H. G. and Hoeting, A. J. Computational Statistics, Second Edition, Section 3.4, John Wiley & Sons, Inc, 2012 Kuhn. M. Feature Selection using Genetic Algorithms. Retrieved from https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html



jakemanderson/GA documentation built on Jan. 1, 2020, 1:03 p.m.