GenAlgForSubsetSelection: Genetic algorithm for subset selection
In STPGA: Selection of Training Populations by Genetic Algorithm

Description Usage Arguments Value Note Author(s) Examples

View source: R/GenAlgForSubsetSelection.R

It uses a genetic algorithm to select n_{Train} individuals so that optimality criterion is minimum.

GenAlgForSubsetSelection(P, Candidates, Test, ntoselect, npop = 100, nelite =
                 5, keepbest = TRUE, tabu = TRUE, tabumemsize = 1, mutprob
                 = 0.8, mutintensity = 1, niterations = 500,
                 minitbefstop = 200, niterreg = 5, lambda = 1e-06,
                 plotiters = FALSE, plottype=1,errorstat = "PEVMEAN2", C = NULL,
                 mc.cores = 1, InitPop = NULL, tolconv = 1e-07, Vg =
                 NULL, Ve = NULL, Fedorov=FALSE)

`P`	depending on the criterion this is either a numeric data matrix or a symmetric similarity matrix. When it is a data matrix, the union of the identifiers of the candidate (and test) individuals should be put as rownames (and column names in case of a similarity matrix). For methods using the relationships, this is the inverse of the relationship matrix with row and column names as the the identifiers of the candidate (and test) individuals.
`Candidates`	vector of identifiers for the individuals in the candidate set.
`Test`	vector of identifiers for the individuals in the test set.
`ntoselect`	n_{Train}: number of individuals to select in the training set.
`npop`	genetic algorithm parameter, number of solutions at each iteration
`nelite`	genetic algorithm parameter, number of solutions selected as elite parents which will generate the next set of solutions.
`keepbest`	genetic algorithm parameter, TRUE or FALSE. If TRUE then the best solution is always kept in the next generation of solutions (elitism).
`tabu`	genetic algorithm parameter, TRUE or FALSE. If TRUE then the solutions that are saved in tabu memory will not be retried.
`tabumemsize`	genetic algorithm parameter, integer>0. Number of generations to hold in tabu memory.
`mutprob`	genetic algorithm parameter, probability of mutation for each generated solution.
`mutintensity`	mean of the poisson variable that is used to decide the number of mutations for each cross.
`niterations`	genetic algorithm parameter, number of iterations.
`minitbefstop`	genetic algorithm parameter, number of iterations before stopping if no change is observed in criterion value.
`niterreg`	genetic algorithm parameter, number of iterations to use regressions, an integer with minimum value of 1
`lambda`	scalar shrinkage parameter (λ>0).
`plotiters`	plot the convergence: TRUE or FALSE. Default is TRUE.
`plottype`	type of plot, default is 1. possible values 1,2,3.
`errorstat`	optimality criterion: One of the optimality criterion. Default is "PEVMEAN". It is possible to use user defined functions as shown in the examples.
`mc.cores`	number of cores to use.
`InitPop`	a list of initial solutions
`tolconv`	if the algorithm cannot improve the errorstat more than tolconv for the last minitbefstop iterations it will stop.
`C`	Contrast Matrix.
`Vg`	covariance matrix between traits generated by the relationship K (only for multi-trait version of PEVMEANMM).
`Ve`	residual covariance matrix for the traits (only for multi-trait version of PEVMEANMM).
`Fedorov`	Whether the Fedorovs exchange algorithm from `AlgDesign` Package should be used for initial solutions.

A list of length nelite+1. The first nelite elements of the list are optimized training samples of size n_{train} and they are listed in increasing order of the optimization criterion. The last item on the list is a vector that stores the minimum values of the objective function at each iteration.

The GA does not guarantee convergence to globally optimal solutions and it is highly recommended that the algorithm is replicated to obtain ”good” training samples.

Deniz Akdemir

	## Not run: 
####################################
library(EMMREML)
library(STPGA)
data(WheatData)

svdWheat<-svd(Wheat.K, nu=5, nv=5)
PC50WHeat<-Wheat.K%*%svdWheat$v
plot(PC50WHeat[,1],PC50WHeat[,2])
rownames(PC50WHeat)<-rownames(Wheat.K)
DistWheat<-dist(PC50WHeat)
TreeWheat<-hclust(DistWheat)
TreeWheat<-cutree(TreeWheat, k=4)

Test<-rownames(PC50WHeat)[TreeWheat==4]
length(Test)
Candidates<-setdiff(rownames(PC50WHeat), Test)


###instead of using the algorithm directly using a wrapper to 
###implement an for multiple starting points for genetic algorithm.
repeatgenalg<-function(numrepsouter,numrepsinner){
  StartingPopulation2=NULL 
  for (i in 1:numrepsouter){
    print("Rep:")
    print(i)
    StartingPopulation<-lapply(1:numrepsinner, function(x){
    	GenAlgForSubsetSelection(P=PC50WHeat,Candidates=Candidates, 
    	Test=Test, ntoselect=50, InitPop=StartingPopulation2,
 npop=50, nelite=5, mutprob=.5, mutintensity = rpois(1,4),
 niterations=10,minitbefstop=5, tabumemsize = 2,plotiters=TRUE, 
 lambda=1e-9,errorstat="CDMEAN", mc.cores=1)})
    StartingPopulation2<-vector(mode="list", length = numrepsouter*1)
    ij=1
    for (i in 1:numrepsinner){
      for (j in 1:1){
        StartingPopulation2[[ij]]<-StartingPopulation[[i]][[j]]
        ij=ij+1
      }
    }
  }
  ListTrain<-GenAlgForSubsetSelection(P=PC50WHeat,Candidates=Candidates, 
    	Test=Test,ntoselect=50, InitPop=StartingPopulation2,npop=100, 
    	nelite=10, mutprob=.5, mutintensity = 1,niterations=300,
    	minitbefstop=100, tabumemsize = 1,plotiters=T,
    	lambda=1e-9,errorstat="CDMEAN", mc.cores=1)
  return(ListTrain)
}


ListTrain<-repeatgenalg(20, 3)

###test sample
deptestopt<-Wheat.Y[Wheat.Y$id%in%Test,]

##predictions by optimized sample
deptrainopt<-Wheat.Y[(Wheat.Y$id%in%ListTrain[[1]]),]

Ztrain<-model.matrix(~-1+deptrainopt$id)
Ztest<-model.matrix(~-1+deptestopt$id)

modelopt<-emmreml(y=deptrainopt$plant.height,X=matrix(1, nrow=nrow(deptrainopt), ncol=1), 
Z=Ztrain, K=Wheat.K)
predictopt<-Ztest%*%modelopt$uhat

corvecrs<-c()
for (rep in 1:300){
###predictions by a random sample of the same size
  rs<-sample(Candidates, 50)
  
  deptestrs<-Wheat.Y[Wheat.Y$id%in%Test,]
  
  deptrainrs<-Wheat.Y[(Wheat.Y$id%in%rs),]
  
  Ztrain<-model.matrix(~-1+deptrainrs$id)
  Ztest<-model.matrix(~-1+deptestrs$id)
  
  library(EMMREML)
  modelrs<-emmreml(y=deptrainrs$plant.height,X=matrix(1, nrow=nrow(deptrainrs), ncol=1), 
  Z=Ztrain, K=Wheat.K)
  predictrs<-Ztest%*%modelrs$uhat
corvecrs<-c(corvecrs,cor(predictrs, deptestrs$plant.height))

}
mean(corvecrs)
cor(predictopt, deptestopt$plant.height)


plot(PC50WHeat[,1],PC50WHeat[,2], col=rownames(PC50WHeat)%in%ListTrain[[1]]+1,
pch=2*rownames(PC50WHeat)%in%Test+1, xlab="pc1", ylab="pc2")

## End(Not run)