select: Genetic algorithm for variable selection

Description Usage Arguments Value Examples

View source: R/select.R

Description

This function implements a genetic algorithm for variable selection in linear regression and GLM. Genetic algorithms is essentially an optimization problem. In feature selection, it uses the given fitness function (e.g. AIC) as the objective function and conduct multiple rounds of an update process to approach the optimal solution. For feature selection, it first generates a population of many possible combination for selecting a subset of features. Then, among this population, the best ones are selected according the objective function and from these parents, a new population with the same size as before are randomly generated. After many iterations, the best solutions from the population would approach the optimal sulotion, which is a binary string indicating the selection of a subset of independent variables.

Usage

1
select(data, target, fit_method = "lm", metric = "aic")

Arguments

data

A data frame with one response variable and arbitrary number of dependent variables. Order does not matter.

target

Column name of the response variable in data. Parameter type should be numeric of character.

fit_method

Regression method, either lm or glm. Parameter type should be character.

metric

Objective function, default is aic. The function also supports bic, rmse, mae, rsquare_negative and user-defined function. Since this is a minimization problem, the R-square is inversed. User can defined arbitrary error function for minimization. Function input should be model and data. Function output should be a self-defined error. Pass the function name directly as a function object. For other regular metrics, parameter type should be character.

Value

A vector of selected column names in the input data frame data. Return type is character vector.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# Example 1
Setup:
# https://www.kaggle.com/c/house-prices-advanced-regression-techniques
# House Prices: Advanced Regression Techniques
# Predict sales prices
dt_house <- read.csv("../data/data_house.csv")
dt_house <- dt_house[, c("MSSubClass", "MSZoning", "LotArea", "LotShape", "Alley", "LandContour", "LotConfig", "LandSlope", "Neighborhood", "BldgType", "WoodDeckSF", "OpenPorchSF", "HouseStyle", "OverallQual", "OverallCond","SaleType", "SaleCondition", "LotFrontage", "MoSold", "SalePrice")]
dt_house[, "MSSubClass"] <- as.factor(dt_house[, "MSSubClass"])
dt_house[, "MoSold"] <- as.factor(dt_house[, "MoSold"])
dt_house[, "LotArea"] <- as.numeric(dt_house[, "LotArea"])
dt_house[, "LotShape"] <- as.factor(dt_house[, "LotShape"])
dt_house[, "Alley"] <- as.factor(dt_house[, "Alley"])
dt_house[, "LandContour"] <- as.factor(dt_house[, "LandContour"])
dt_house[, "LotConfig"] <- as.factor(dt_house[, "LotConfig"])
dt_house[, "LandSlope"] <- as.factor(dt_house[, "LandSlope"])
dt_house[, "Neighborhood"] <- as.factor(dt_house[, "Neighborhood"])
dt_house[, "BldgType"] <- as.factor(dt_house[, "BldgType"])
dt_house[, "WoodDeckSF"] <- as.numeric(dt_house[, "WoodDeckSF"])
dt_house[, "OpenPorchSF"] <- as.numeric(dt_house[, "OpenPorchSF"])
dt_house[, "HouseStyle"] <- as.factor(dt_house[, "HouseStyle"])
dt_house[, "OverallQual"] <- as.numeric(dt_house[, "OverallQual"])
dt_house[, "OverallCond"] <- as.numeric(dt_house[, "OverallCond"])
dt_house[, "SaleType"] <- as.factor(dt_house[, "SaleType"])
dt_house[, "SaleCondition"] <- as.factor(dt_house[, "SaleCondition"])
dt_house[, "LotFrontage"] <- as.numeric(dt_house[, "LotFrontage"])
dt_house[, "MoSold"] <- as.factor(dt_house[, "MoSold"])
dt_house[, "SalePrice"] <- as.numeric(dt_house[, "SalePrice"])
# Execution
select(dt_house, 'SalePrice', fit_method = 'lm', metric = 'aic')


# Example 2
# Setup:
# https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009
# Red Wine Quality
dt_wine <- read.csv("../data/data_wine.csv")
dt_wine[, "quality"] <- as.numeric(dt_wine[, "quality"])
# Execution:
select(dt_wine, 'quality', fit_method = 'lm', metric = 'aic')


# Example 3
# Setup:
# https://www.kaggle.com/kumarajarshi/life-expectancy-who
# Life Expectancy (WHO)
# Statistical Analysis on factors influencing Life Expectancy
dt_life <- read.csv("./data/data_life.csv")
dt_life[, "Country"] <- as.factor(dt_life[, "Country"])
dt_life[, "Year"] <- as.numeric(dt_life[, "Year"])
dt_life[, "Status"] <- as.factor(dt_life[, "Status"])
dt_life[, "Life.expectancy"] <- as.numeric(dt_life[, "Life.expectancy"])
for(i in 5:dim(dt_life)[2]){ dt_life[, i] <- as.numeric(dt_life[, i]) }
# Execution:
select(dt_life, 'Life.expectancy', fit_method = 'lm', metric = 'aic')


# Example 4
# Setup:
# Bike sharing dataset
dt_bike <- read.csv("./data/data_bike.csv")
dt_bike[, 'dteday'] <- as.numeric(as.Date(dt_bike[, 'dteday']))
dt_bike[, 'yr'] <- as.factor(dt_bike[, 'yr'])
dt_bike[, 'mnth'] <- as.factor(dt_bike[, 'mnth'])
dt_bike[, 'holiday'] <- as.factor(dt_bike[, 'holiday'])
dt_bike[, 'workingday'] <- as.factor(dt_bike[, 'workingday'])
dt_bike[, 'weathersit'] <- as.factor(dt_bike[, 'weathersit'])
dt_bike$instant <- NULL
dt_bike$registered <- NULL
dt_bike$casual <- NULL
# Execution:
select(dt_bike, 'cnt', fit_method = 'lm', metric = 'aic')


# Example 5
# Setup:
# Basic data set loading and test of function lm()
# Load a build in data set BostonHousing
library(mlbench)
data(BostonHousing)
# Execution:
select(BostonHousing, 'medv', fit_method = 'lm', metric = 'aic')
score_value <- rep(0,repetition)
score_value[k] <- temp[[2]]
plot(score_value)

jakemanderson/GA documentation built on Jan. 1, 2020, 1:03 p.m.