select: Genetic algorithm for variable selection
In jakemanderson/GA: Genetic Algorithm

Description Usage Arguments Value Examples

View source: R/select.R

This function implements a genetic algorithm for variable selection in linear regression and GLM. Genetic algorithms is essentially an optimization problem. In feature selection, it uses the given fitness function (e.g. AIC) as the objective function and conduct multiple rounds of an update process to approach the optimal solution. For feature selection, it first generates a population of many possible combination for selecting a subset of features. Then, among this population, the best ones are selected according the objective function and from these parents, a new population with the same size as before are randomly generated. After many iterations, the best solutions from the population would approach the optimal sulotion, which is a binary string indicating the selection of a subset of independent variables.

1	select(data, target, fit_method = "lm", metric = "aic")

`data`	A data frame with one response variable and arbitrary number of dependent variables. Order does not matter.
`target`	Column name of the response variable in `data`. Parameter type should be numeric of character.
`fit_method`	Regression method, either `lm` or `glm`. Parameter type should be character.
`metric`	Objective function, default is `aic`. The function also supports `bic`, `rmse`, `mae`, `rsquare_negative` and user-defined function. Since this is a minimization problem, the R-square is inversed. User can defined arbitrary error function for minimization. Function input should be `model` and `data`. Function output should be a self-defined error. Pass the function name directly as a function object. For other regular metrics, parameter type should be character.

A vector of selected column names in the input data frame data. Return type is character vector.

# Example 1
Setup:
# https://www.kaggle.com/c/house-prices-advanced-regression-techniques
# House Prices: Advanced Regression Techniques
# Predict sales prices
dt_house <- read.csv("../data/data_house.csv")
dt_house <- dt_house[, c("MSSubClass", "MSZoning", "LotArea", "LotShape", "Alley", "LandContour", "LotConfig", "LandSlope", "Neighborhood", "BldgType", "WoodDeckSF", "OpenPorchSF", "HouseStyle", "OverallQual", "OverallCond","SaleType", "SaleCondition", "LotFrontage", "MoSold", "SalePrice")]
dt_house[, "MSSubClass"] <- as.factor(dt_house[, "MSSubClass"])
dt_house[, "MoSold"] <- as.factor(dt_house[, "MoSold"])
dt_house[, "LotArea"] <- as.numeric(dt_house[, "LotArea"])
dt_house[, "LotShape"] <- as.factor(dt_house[, "LotShape"])
dt_house[, "Alley"] <- as.factor(dt_house[, "Alley"])
dt_house[, "LandContour"] <- as.factor(dt_house[, "LandContour"])
dt_house[, "LotConfig"] <- as.factor(dt_house[, "LotConfig"])
dt_house[, "LandSlope"] <- as.factor(dt_house[, "LandSlope"])
dt_house[, "Neighborhood"] <- as.factor(dt_house[, "Neighborhood"])
dt_house[, "BldgType"] <- as.factor(dt_house[, "BldgType"])
dt_house[, "WoodDeckSF"] <- as.numeric(dt_house[, "WoodDeckSF"])
dt_house[, "OpenPorchSF"] <- as.numeric(dt_house[, "OpenPorchSF"])
dt_house[, "HouseStyle"] <- as.factor(dt_house[, "HouseStyle"])
dt_house[, "OverallQual"] <- as.numeric(dt_house[, "OverallQual"])
dt_house[, "OverallCond"] <- as.numeric(dt_house[, "OverallCond"])
dt_house[, "SaleType"] <- as.factor(dt_house[, "SaleType"])
dt_house[, "SaleCondition"] <- as.factor(dt_house[, "SaleCondition"])
dt_house[, "LotFrontage"] <- as.numeric(dt_house[, "LotFrontage"])
dt_house[, "MoSold"] <- as.factor(dt_house[, "MoSold"])
dt_house[, "SalePrice"] <- as.numeric(dt_house[, "SalePrice"])
# Execution
select(dt_house, 'SalePrice', fit_method = 'lm', metric = 'aic')


# Example 2
# Setup:
# https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009
# Red Wine Quality
dt_wine <- read.csv("../data/data_wine.csv")
dt_wine[, "quality"] <- as.numeric(dt_wine[, "quality"])
# Execution:
select(dt_wine, 'quality', fit_method = 'lm', metric = 'aic')


# Example 3
# Setup:
# https://www.kaggle.com/kumarajarshi/life-expectancy-who
# Life Expectancy (WHO)
# Statistical Analysis on factors influencing Life Expectancy
dt_life <- read.csv("./data/data_life.csv")
dt_life[, "Country"] <- as.factor(dt_life[, "Country"])
dt_life[, "Year"] <- as.numeric(dt_life[, "Year"])
dt_life[, "Status"] <- as.factor(dt_life[, "Status"])
dt_life[, "Life.expectancy"] <- as.numeric(dt_life[, "Life.expectancy"])
for(i in 5:dim(dt_life)[2]){ dt_life[, i] <- as.numeric(dt_life[, i]) }
# Execution:
select(dt_life, 'Life.expectancy', fit_method = 'lm', metric = 'aic')


# Example 4
# Setup:
# Bike sharing dataset
dt_bike <- read.csv("./data/data_bike.csv")
dt_bike[, 'dteday'] <- as.numeric(as.Date(dt_bike[, 'dteday']))
dt_bike[, 'yr'] <- as.factor(dt_bike[, 'yr'])
dt_bike[, 'mnth'] <- as.factor(dt_bike[, 'mnth'])
dt_bike[, 'holiday'] <- as.factor(dt_bike[, 'holiday'])
dt_bike[, 'workingday'] <- as.factor(dt_bike[, 'workingday'])
dt_bike[, 'weathersit'] <- as.factor(dt_bike[, 'weathersit'])
dt_bike$instant <- NULL
dt_bike$registered <- NULL
dt_bike$casual <- NULL
# Execution:
select(dt_bike, 'cnt', fit_method = 'lm', metric = 'aic')


# Example 5
# Setup:
# Basic data set loading and test of function lm()
# Load a build in data set BostonHousing
library(mlbench)
data(BostonHousing)
# Execution:
select(BostonHousing, 'medv', fit_method = 'lm', metric = 'aic')
score_value <- rep(0,repetition)
score_value[k] <- temp[[2]]
plot(score_value)