Description Usage Arguments Details Value Author(s) References See Also Examples
HIest
provides approaches to find maximum likelihood estimates of S and H, using the likelihood functions described by Fitzpatrick (2012).
1 2 |
G |
A matrix or data frame of genetic marker data. Each column is a locus. For |
P |
A matrix or data frame with the following columns (order is important!): Locus name, Allele name, P1 allele frequency, P2 allele frequency. For |
type |
A string representing the data type. The options are |
method |
Optimization method to search for maximum likelihood estimates of ancestry and heterozygosity. Alternatives are |
iterations |
The desired number of MCMC steps to perform when |
Cscale |
An integer, controlling the the proposal distribution for |
surf |
Logical: Should the function find starting values by evaluating likelihoods on a grid? |
startgrid |
Integer.
This controls the size of the grid for |
start |
A vector including the starting values of S and H. |
control |
A list of options to be passed to |
Given two ancestral species or parental populations (P1 and P2), the ancestry index (S) is the proportion of an individual's alleles descending from alleles in the P1 population and the interclass heterozygosity (H) is the proportion of an individual's loci that have one allele from each ancestral population (Lynch 1991). The likelihood functions are described in Fitzpatrick (2012). The likelihood functions take advantage of the correspondence between ancestry and heterozygosity and the genomic proportions: p_{11} = proportion of one's genome that is homozygous for alleles inherited from parental lineage 1, p_{12} = H = proportion one's genome that is heterozygous for alleles inhertied from each parental lineage, and p_{22} = proportion of one's genome that is homozygous for alleles inherited from parental lineage 2.
Currently, the function provides four methods for searching the likelihood surface. method = "SANN"
is probably the best; it uses the general purpose optimization function optim
with its simulated annealing algorithm. For this estimation problem, a custom proposal function is passed to the option gr
. This proposal function draws new genomic proportions (p_{11},p_{12},p_{22}) from a three dimensional Dirichlet distribution centered on the old genomic proportions. The concentration of the proposal distribution is controlled by Cscale
; the larger this value, the more the proposal distribution is concentrated near the current state.
method = "mcmc"
uses a Markov-Chain Monte Carlo with Metropolis-Hastings sampling to explore the likelihood surface. It also uses the Dirichlet proposal distribution, and could be useful (with some modification of the code) for generating posterior distributions. method = "SANN"
is probaby superior for simply finding the MLE.
method = "L-BFGS-B"
also uses optim
, but with a quasi-Newton likelihood search algorithm to look for the maximum likelihood. This method is relatively fast, but it can miss the MLE if it is near the edge of the sample space.
"surf"
, finds all likelihoods on a grid defined by startgrid
and chooses the maximum. This is not going to find the MLE unless the MLE happens to be one of the grid points. However, using the option surf = TRUE
with the SANN
or mcmc
methods can improve efficiency by initiating the search at the grid point nearest the MLE.
A data frame with one row for each individual and three columns:
S |
The maximim likelihood estimate of the ancestry index |
H |
The maximum likelihood estimate of the interclass heterozygosity |
logLik |
The maximum log-likelihood |
Ben Fitzpatrick
Fitzpatrick, B. M. 2008. Hybrid dysfunction: Population genetic and quantitative genetic perspectives. American Naturalist 171:491-198.
Fitzpatrick, B. M. 2012. Estimating ancestry and heterozygosity of hybrids using molecular markers. BMC Evolutionary Biology 12:131. http://www.biomedcentral.com/1471-2148/12/131
Fitzpatrick, B. M., J. R. Johnson, D. K. Kump, H. B. Shaffer, J. J. Smith, and S. R. Voss. 2009. Rapid fixation of non-native alleles revealed by genome-wide SNP analysis of hybrid tiger salamanders. BMC Evolutionary Biology 9:176. http://www.biomedcentral.com/1471-2148/9/176
Lynch, M. 1991. The genetic interpretation of inbreeding depression and outbreeding depression. Evolution 45:622-629.
HIsurf
for a likelihood surface, HIclass
for likelihoods of early generation hybrid classes, HItest
to compare the classification to the maximum likelihood, HILL
for the basic likelihood function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | ##-- A random codominant data set of 5 individuals and 5 markers with three alleles each
L <- 10
P <- data.frame(Locus=rep(1:L,each=3),Allele=rep(1:3,L),
P1=as.vector(rmultinom(L,10,c(.7,.2,.1)))/10,
P2=as.vector(rmultinom(L,10,c(.1,.2,.7)))/10)
G <- matrix(nrow=10,ncol=L)
for(i in 1:L){
G[,i] <- sample(c(1,2,3),size=10,replace=TRUE,prob=rowMeans(P[P$Locus==i,3:4]))
}
HI.cod <- HIest(G,P,type="codominant",iterations=99,surf=TRUE,startgrid=20)
# this is unlikely to converge on the MLE: increase iterations and/r startgrid.
# # optional plot
# plot(c(0,.5,1,0),c(0,1,0,0),type="l",xlab=expression(italic(S)),
# ylab=expression(italic(H[I])),lwd=2,cex.lab=1.5,cex.axis=1.5,bty="n")
# points(HI.cod$S,HI.cod$H,cex=1.5,lwd=2)
# axis(1,labels=FALSE,lwd=2);axis(2,labels=FALSE,lwd=2)
# # other examples
# ##-- Make it into allele count data (count "3" alleles)
# P.c <- P[seq(from=3,to=dim(P)[2],by=3),]
# G.c <- matrix(nrow=5,ncol=L)
# for(i in 1:5){
# G.c[i,] <- colSums(G[c(i*2-1,i*2),]==3)
# }
# HI.ac <- HIest(G.c,P.c,type="allele.count",iterations=500,surf=TRUE,startgrid=50)
# ##-- Make it into dominant data where allele 3 is dominant
# G.d <- replace(G.c,G.c==2,1)
# HI.dom <- HIest(G.d,P.c,type="dominant",iterations=500,surf=TRUE,startgrid=50)
# ## -- A real dataset (Fitzpatrick et al. 2009)
# data(Bluestone)
# Bluestone <- replace(Bluestone,is.na(Bluestone),-9)
# # parental allele frequencies (assumed diagnostic)
# BS.P <- data.frame(Locus=names(Bluestone),Allele="BTS",P1=1,P2=0)
# # estimate ancestry and heterozygosity
# # BS.est <-HIest(Bluestone,BS.P,type="allele.count")
# ## shortcut for diagnostic markers
# BS.est <- HIC(Bluestone)
# # calculate likelihoods for early generation hybrid classes
# BS.class <- HIclass(Bluestone,BS.P,type="allele.count")
# # compare classification with maximum likelihood estimates
# BS.test <- HItest(BS.class,BS.est)
# table(BS.test$c1)
# # all 41 are TRUE, meaning the best classification is at least 2 log-likelihood units
# # better than the next best
# table(BS.test$c2)
# # 2 are TRUE, meaning the MLE S and H are within 2 log-likelihood units of the best
# # classification, i.e., the simple classification is rejected in all but 2 cases
# table(BS.test$Best.class,BS.test$c2)
# # individuals were classified as F2-like (class 3) or backcross to CTS (class 4), but
# # only two of the F2's were credible
# BS.test[BS.test$c2,]
# # in only one case was the F2 classification a better fit (based on AIC) than the
# # continuous model.
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.