HIest: Find the joint maximum likelihood estimates of ancestry and...
In HIest: Hybrid index estimation

Description Usage Arguments Details Value Author(s) References See Also Examples

HIest provides approaches to find maximum likelihood estimates of S and H, using the likelihood functions described by Fitzpatrick (2012).

1 2	HIest(G, P, type, method = "SANN", iterations = 1000, Cscale = NULL, surf = FALSE, startgrid = 99, start = c(.5,.5), control = list(fnscale = -1, maxit = iterations))

`G`	A matrix or data frame of genetic marker data. Each column is a locus. For `type="dominant"` or `type="allele.count"`, there should be one row per individual. For `type="codominant"`, each individual is to be represented in consecutive rows (one for each allele).
`P`	A matrix or data frame with the following columns (order is important!): Locus name, Allele name, P1 allele frequency, P2 allele frequency. For `type="dominant"` or `type="allele.count"`, there should be one row per locus, giving the frequencies of the dominant or "1" allele. For `type="codominant"` there should be a separate row for each allele AND the Allele names should match the data in `G`.
`type`	A string representing the data type. The options are `"codominant"`, `"dominant"`, and `"allele.count"`.
`method`	Optimization method to search for maximum likelihood estimates of ancestry and heterozygosity. Alternatives are `"SANN"`, `"L-BFGS-B"`, `"surf"`, and `"mcmc"`. See details.
`iterations`	The desired number of MCMC steps to perform when `method="mcmc" or "SANN"`.
`Cscale`	An integer, controlling the the proposal distribution for `method = "SANN"` or `"mcmc"`. The default value is 100. Smaller values will cause the algorithm to search more broadly, but could make the search inefficient. See details.
`surf`	Logical: Should the function find starting values by evaluating likelihoods on a grid?
`startgrid`	Integer. This controls the size of the grid for `method="surf"`. It is the same as the argument `size` in the function `HIsurf`.
`start`	A vector including the starting values of S and H.
`control`	A list of options to be passed to `control` in the `optim` function. Whatever else is chosen, be sure `fnscale` is negative to make `optim` search for a maximum rather than a minimum.

Given two ancestral species or parental populations (P1 and P2), the ancestry index (S) is the proportion of an individual's alleles descending from alleles in the P1 population and the interclass heterozygosity (H) is the proportion of an individual's loci that have one allele from each ancestral population (Lynch 1991). The likelihood functions are described in Fitzpatrick (2012). The likelihood functions take advantage of the correspondence between ancestry and heterozygosity and the genomic proportions: p_{11} = proportion of one's genome that is homozygous for alleles inherited from parental lineage 1, p_{12} = H = proportion one's genome that is heterozygous for alleles inhertied from each parental lineage, and p_{22} = proportion of one's genome that is homozygous for alleles inherited from parental lineage 2.

Currently, the function provides four methods for searching the likelihood surface. method = "SANN" is probably the best; it uses the general purpose optimization function optim with its simulated annealing algorithm. For this estimation problem, a custom proposal function is passed to the option gr. This proposal function draws new genomic proportions (p_{11},p_{12},p_{22}) from a three dimensional Dirichlet distribution centered on the old genomic proportions. The concentration of the proposal distribution is controlled by Cscale; the larger this value, the more the proposal distribution is concentrated near the current state.

method = "mcmc" uses a Markov-Chain Monte Carlo with Metropolis-Hastings sampling to explore the likelihood surface. It also uses the Dirichlet proposal distribution, and could be useful (with some modification of the code) for generating posterior distributions. method = "SANN" is probaby superior for simply finding the MLE.

method = "L-BFGS-B" also uses optim, but with a quasi-Newton likelihood search algorithm to look for the maximum likelihood. This method is relatively fast, but it can miss the MLE if it is near the edge of the sample space.

"surf", finds all likelihoods on a grid defined by startgrid and chooses the maximum. This is not going to find the MLE unless the MLE happens to be one of the grid points. However, using the option surf = TRUE with the SANN or mcmc methods can improve efficiency by initiating the search at the grid point nearest the MLE.

A data frame with one row for each individual and three columns:

`S`	The maximim likelihood estimate of the ancestry index
`H`	The maximum likelihood estimate of the interclass heterozygosity
`logLik`	The maximum log-likelihood

Ben Fitzpatrick

Fitzpatrick, B. M. 2008. Hybrid dysfunction: Population genetic and quantitative genetic perspectives. American Naturalist 171:491-198.

Fitzpatrick, B. M. 2012. Estimating ancestry and heterozygosity of hybrids using molecular markers. BMC Evolutionary Biology 12:131. http://www.biomedcentral.com/1471-2148/12/131

Fitzpatrick, B. M., J. R. Johnson, D. K. Kump, H. B. Shaffer, J. J. Smith, and S. R. Voss. 2009. Rapid fixation of non-native alleles revealed by genome-wide SNP analysis of hybrid tiger salamanders. BMC Evolutionary Biology 9:176. http://www.biomedcentral.com/1471-2148/9/176

Lynch, M. 1991. The genetic interpretation of inbreeding depression and outbreeding depression. Evolution 45:622-629.

HIsurf for a likelihood surface, HIclass for likelihoods of early generation hybrid classes, HItest to compare the classification to the maximum likelihood, HILL for the basic likelihood function.

##-- A random codominant data set of 5 individuals and 5 markers with three alleles each
L <- 10
P <- data.frame(Locus=rep(1:L,each=3),Allele=rep(1:3,L),
	P1=as.vector(rmultinom(L,10,c(.7,.2,.1)))/10, 
	P2=as.vector(rmultinom(L,10,c(.1,.2,.7)))/10)
G <- matrix(nrow=10,ncol=L)
for(i in 1:L){
	G[,i] <- sample(c(1,2,3),size=10,replace=TRUE,prob=rowMeans(P[P$Locus==i,3:4]))
	}
	
HI.cod <- HIest(G,P,type="codominant",iterations=99,surf=TRUE,startgrid=20)

# this is unlikely to converge on the MLE: increase iterations and/r startgrid.

# # optional plot
# plot(c(0,.5,1,0),c(0,1,0,0),type="l",xlab=expression(italic(S)),
	# ylab=expression(italic(H[I])),lwd=2,cex.lab=1.5,cex.axis=1.5,bty="n")
# points(HI.cod$S,HI.cod$H,cex=1.5,lwd=2)
# axis(1,labels=FALSE,lwd=2);axis(2,labels=FALSE,lwd=2)

# # other examples

# ##-- Make it into allele count data (count "3" alleles)
# P.c <- P[seq(from=3,to=dim(P)[2],by=3),]
# G.c <- matrix(nrow=5,ncol=L)
# for(i in 1:5){
	# G.c[i,] <- colSums(G[c(i*2-1,i*2),]==3)
	# }

# HI.ac <- HIest(G.c,P.c,type="allele.count",iterations=500,surf=TRUE,startgrid=50)

# ##-- Make it into dominant data where allele 3 is dominant
# G.d <- replace(G.c,G.c==2,1)

# HI.dom <- HIest(G.d,P.c,type="dominant",iterations=500,surf=TRUE,startgrid=50)

# ## -- A real dataset (Fitzpatrick et al. 2009)
# data(Bluestone)
# Bluestone <- replace(Bluestone,is.na(Bluestone),-9)
# # parental allele frequencies (assumed diagnostic)
# BS.P <- data.frame(Locus=names(Bluestone),Allele="BTS",P1=1,P2=0)

# # estimate ancestry and heterozygosity
# # BS.est <-HIest(Bluestone,BS.P,type="allele.count")
# ## shortcut for diagnostic markers
# BS.est <- HIC(Bluestone)

# # calculate likelihoods for early generation hybrid classes
# BS.class <- HIclass(Bluestone,BS.P,type="allele.count")

# # compare classification with maximum likelihood estimates
# BS.test <- HItest(BS.class,BS.est)

# table(BS.test$c1)
# # all 41 are TRUE, meaning the best classification is at least 2 log-likelihood units
# # better than the next best

# table(BS.test$c2)
# # 2 are TRUE, meaning the MLE S and H are within 2 log-likelihood units of the best
# # classification, i.e., the simple classification is rejected in all but 2 cases

# table(BS.test$Best.class,BS.test$c2)
# # individuals were classified as F2-like (class 3) or backcross to CTS (class 4), but
# # only two of the F2's were credible 

# BS.test[BS.test$c2,]
# # in only one case was the F2 classification a better fit (based on AIC) than the
# # continuous model.