# GenAlgForSubsetSelection: Genetic algorithm for subset selection In STPGA: Selection of Training Populations by Genetic Algorithm

## Description

It uses a genetic algorithm to select n_{Train} individuals so that optimality criterion is minimum.

## Usage

 ```1 2 3 4 5 6 7``` ```GenAlgForSubsetSelection(P, Candidates, Test, ntoselect, npop = 100, nelite = 5, keepbest = TRUE, tabu = TRUE, tabumemsize = 1, mutprob = 0.8, mutintensity = 1, niterations = 500, minitbefstop = 200, niterreg = 5, lambda = 1e-06, plotiters = FALSE, plottype=1,errorstat = "PEVMEAN2", C = NULL, mc.cores = 1, InitPop = NULL, tolconv = 1e-07, Vg = NULL, Ve = NULL, Fedorov=FALSE) ```

## Arguments

 `P` depending on the criterion this is either a numeric data matrix or a symmetric similarity matrix. When it is a data matrix, the union of the identifiers of the candidate (and test) individuals should be put as rownames (and column names in case of a similarity matrix). For methods using the relationships, this is the inverse of the relationship matrix with row and column names as the the identifiers of the candidate (and test) individuals. `Candidates` vector of identifiers for the individuals in the candidate set. `Test` vector of identifiers for the individuals in the test set. `ntoselect` n_{Train}: number of individuals to select in the training set. `npop` genetic algorithm parameter, number of solutions at each iteration `nelite` genetic algorithm parameter, number of solutions selected as elite parents which will generate the next set of solutions. `keepbest` genetic algorithm parameter, TRUE or FALSE. If TRUE then the best solution is always kept in the next generation of solutions (elitism). `tabu` genetic algorithm parameter, TRUE or FALSE. If TRUE then the solutions that are saved in tabu memory will not be retried. `tabumemsize` genetic algorithm parameter, integer>0. Number of generations to hold in tabu memory. `mutprob` genetic algorithm parameter, probability of mutation for each generated solution. `mutintensity` mean of the poisson variable that is used to decide the number of mutations for each cross. `niterations` genetic algorithm parameter, number of iterations. `minitbefstop` genetic algorithm parameter, number of iterations before stopping if no change is observed in criterion value. `niterreg` genetic algorithm parameter, number of iterations to use regressions, an integer with minimum value of 1 `lambda` scalar shrinkage parameter (λ>0). `plotiters` plot the convergence: TRUE or FALSE. Default is TRUE. `plottype` type of plot, default is 1. possible values 1,2,3. `errorstat` optimality criterion: One of the optimality criterion. Default is "PEVMEAN". It is possible to use user defined functions as shown in the examples. `mc.cores` number of cores to use. `InitPop` a list of initial solutions `tolconv` if the algorithm cannot improve the errorstat more than tolconv for the last minitbefstop iterations it will stop. `C` Contrast Matrix. `Vg` covariance matrix between traits generated by the relationship K (only for multi-trait version of PEVMEANMM). `Ve` residual covariance matrix for the traits (only for multi-trait version of PEVMEANMM). `Fedorov` Whether the Fedorovs exchange algorithm from `AlgDesign` Package should be used for initial solutions.

## Value

A list of length nelite+1. The first nelite elements of the list are optimized training samples of size n_{train} and they are listed in increasing order of the optimization criterion. The last item on the list is a vector that stores the minimum values of the objective function at each iteration.

## Note

The GA does not guarantee convergence to globally optimal solutions and it is highly recommended that the algorithm is replicated to obtain ”good” training samples.

Deniz Akdemir

## Examples

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92``` ``` ## Not run: #################################### library(EMMREML) library(STPGA) data(WheatData) svdWheat<-svd(Wheat.K, nu=5, nv=5) PC50WHeat<-Wheat.K%*%svdWheat\$v plot(PC50WHeat[,1],PC50WHeat[,2]) rownames(PC50WHeat)<-rownames(Wheat.K) DistWheat<-dist(PC50WHeat) TreeWheat<-hclust(DistWheat) TreeWheat<-cutree(TreeWheat, k=4) Test<-rownames(PC50WHeat)[TreeWheat==4] length(Test) Candidates<-setdiff(rownames(PC50WHeat), Test) ###instead of using the algorithm directly using a wrapper to ###implement an for multiple starting points for genetic algorithm. repeatgenalg<-function(numrepsouter,numrepsinner){ StartingPopulation2=NULL for (i in 1:numrepsouter){ print("Rep:") print(i) StartingPopulation<-lapply(1:numrepsinner, function(x){ GenAlgForSubsetSelection(P=PC50WHeat,Candidates=Candidates, Test=Test, ntoselect=50, InitPop=StartingPopulation2, npop=50, nelite=5, mutprob=.5, mutintensity = rpois(1,4), niterations=10,minitbefstop=5, tabumemsize = 2,plotiters=TRUE, lambda=1e-9,errorstat="CDMEAN", mc.cores=1)}) StartingPopulation2<-vector(mode="list", length = numrepsouter*1) ij=1 for (i in 1:numrepsinner){ for (j in 1:1){ StartingPopulation2[[ij]]<-StartingPopulation[[i]][[j]] ij=ij+1 } } } ListTrain<-GenAlgForSubsetSelection(P=PC50WHeat,Candidates=Candidates, Test=Test,ntoselect=50, InitPop=StartingPopulation2,npop=100, nelite=10, mutprob=.5, mutintensity = 1,niterations=300, minitbefstop=100, tabumemsize = 1,plotiters=T, lambda=1e-9,errorstat="CDMEAN", mc.cores=1) return(ListTrain) } ListTrain<-repeatgenalg(20, 3) ###test sample deptestopt<-Wheat.Y[Wheat.Y\$id%in%Test,] ##predictions by optimized sample deptrainopt<-Wheat.Y[(Wheat.Y\$id%in%ListTrain[[1]]),] Ztrain<-model.matrix(~-1+deptrainopt\$id) Ztest<-model.matrix(~-1+deptestopt\$id) modelopt<-emmreml(y=deptrainopt\$plant.height,X=matrix(1, nrow=nrow(deptrainopt), ncol=1), Z=Ztrain, K=Wheat.K) predictopt<-Ztest%*%modelopt\$uhat corvecrs<-c() for (rep in 1:300){ ###predictions by a random sample of the same size rs<-sample(Candidates, 50) deptestrs<-Wheat.Y[Wheat.Y\$id%in%Test,] deptrainrs<-Wheat.Y[(Wheat.Y\$id%in%rs),] Ztrain<-model.matrix(~-1+deptrainrs\$id) Ztest<-model.matrix(~-1+deptestrs\$id) library(EMMREML) modelrs<-emmreml(y=deptrainrs\$plant.height,X=matrix(1, nrow=nrow(deptrainrs), ncol=1), Z=Ztrain, K=Wheat.K) predictrs<-Ztest%*%modelrs\$uhat corvecrs<-c(corvecrs,cor(predictrs, deptestrs\$plant.height)) } mean(corvecrs) cor(predictopt, deptestopt\$plant.height) plot(PC50WHeat[,1],PC50WHeat[,2], col=rownames(PC50WHeat)%in%ListTrain[[1]]+1, pch=2*rownames(PC50WHeat)%in%Test+1, xlab="pc1", ylab="pc2") ## End(Not run) ```

