esth: Bayesian hybrid index estimation.
In ribailey/gghybrid: Evolutionary Analysis of Hybrids and Hybrid Zones

View source: R/esth.r

esth	R Documentation

Bayesian hybrid index estimation.

Description

Bayesian hybrid index estimation.

Usage

esth(
  data.prep.object,
  read.data.precols,
  test.subject = "INDLABEL",
  test.subject.compare = NULL,
  include.Source = TRUE,
  return.likmeans = FALSE,
  fix.subject = FALSE,
  fix.value = FALSE,
  plot.ind = NULL,
  plot.col = NULL,
  nitt,
  burnin,
  init.var = 0.002,
  init.var2 = NULL,
  prior = c(0.5, 0.5),
  print.k = 50
)

Arguments

`data.prep.object`	Name of the `data.prep` object produced by the `data.prep` function.
`read.data.precols`	A `precols` object produced by `read.data`, or a custom character vector with names of all pre-marker columns in the `data`.
`test.subject`	Character string. Name of the field identifying subjects for hybrid index estimation. Default “INDLABEL”.
`test.subject.compare`	Character string. Name of another field in the data analysis table, such as ‘POPID’, with which you wish to compare the fit of the current test subject using either AIC or waic. Used to compare model fits at differing resolutions, such as individual-level versus population-level, with the subsequent comparison being made at the lower resolution. Only needs to be declared when running the higher resolution model, and produces a warning if the declared field is at a higher resolution than the test subject. A second run of `esth` will then be needed, with `test.subject.compare` in the current model declared as `test.subject`. The two models can then be compared using either or both of the downstream functions `compare.models` (for waic) or `calc_AIC`.
`include.Source`	Logical. Whether to estimate hybrid indices for test subjects declared as parental reference. Default is `TRUE`. Ignored if there are no parental reference samples.
`return.likmeans`	Logical. Whether to return a table of likelihood calculations. Required for `compare.models`. Default is `FALSE`.
`fix.subject`	Logical default, or a character vector. Which, if any, test subjects to fix at a specific hybrid index value. If set `TRUE`, the entry for `fix.value` should be a single value to be applied to all test subjects. Alternatively, a character vector including the names of all or a subset of test subjects can be entered, in which case the same set of test subjects can be assigned unique fixed values using `fix.value`. Potentially useful for model comparison with `compare.models`.
`fix.value`	Logical default, otherwise a numeric vector of length 1 or the same length and order as the character vector declared using `fix.subject`. The fixed hybrid index value(s). Note that if all test subjects are fixed, only `nitt=2` and `burnin=0` are required.
`plot.ind`	Character vector. List of up to 6 test subjects for which you wish to plot the posterior distribution values in real time. Default `NULL`.
`plot.col`	Character vector. List of up to 6 colours to plot the `plot.ind` test subjects. Default `NULL`.
`nitt`	Numeric. The total number of MCMC iterations including burnin. At least 2000, preferably 3000, is recommended.
`burnin`	Numeric. The number of burnin MCMC iterations. Typically, 1000 is sufficient.
`init.var`	Numeric. Starting value for the variance of the parameter proposal distribution, applied for the first 100 iterations. Default is 0.002, which works well with tested data sets.
`init.var2`	Numeric. Variance of the proposal distribution to be applied after the first 100 iterations, by which time the mcmc should be sampling the posterior distribution. The default entry is `NULL`, in which case the value is calculated internally using the mean number of allele copies per test subject. The internal formula is: (1/(mean(N allele copies per test subject)/2))/10, which works well with a big variety of tested numbers of loci per test subject, up to several million.
`prior`	Numeric vector. Values of `shape1` and `shape2` parameters for the beta-distributed prior. Default is `shape1=shape2=0.5` (Jeffrey's prior). Doesn't currently allow a unique prior for each test subject. If set to `c(0,0)`, the prior is excluded from the posterior likelihood calculation, in which case the best posterior estimate is the mean (‘beta_mean’) rather than the mode (‘h_posterior_mode’), the latter of which is the best estimate in the presence of a prior.
`print.k`	Numeric. The iteration is printed to screen at multiples of this number. Default is `50`.

Details

esth estimates hybrid index using the likelihood formulae of Buerkle (2005), with the addition of a prior. The default is Jeffrey's prior (beta distribution with shape1=shape2=0.5), which testing suggests is an improvement over a uniform prior (shape1=shape2=1).

esth excludes any pre-marker columns from the output with finer resolution than the declared test.subject, and hence has the same number of rows as the number of unique test.subject values.

Set return.likmeans=TRUE if you intend to carry out model comparison based on the widely applicable information criterion (waic).

Value

list containing hi, test.subject, init.var2, and likmeans (optional). init.var2 is numeric and is included in case the user wants to try a different, possibly smaller, value, for test subjects for which mcmc failed. likmeans contains the information needed to calculate waic. hi is a data.table and data.frame containing all pre-marker columns that do not have finer resolution than the declared test.subject, as well as the following fields:

`Source`	whether the sample is from source 0 ('S0'), source 1 ('S1') or is a test individual ('TEST').
`h_posterior_mode`	the hybrid index estimate (mode of the beta-distributed posterior).
`h_cred_int_lower`	lower 95 percent credible interval (2.5 percent quantile of the posterior beta distribution).
`h_cred_int_upper`	upper 95 percent credible interval (97.5 percent quantile of the posterior beta distribution).
`beta_mean`	mean of the posterior beta distribution (included for convenience).
`beta_var`	variance of the posterior beta distribution (included for convenience).
`beta_shape1`	beta shape1 parameter estimate for the posterior beta distribution (included for convenience).
`beta_shape2`	beta shape2 parameter estimate for the posterior beta distribution (included for convenience).
`npar_compare`	see compare.models. Absent unless test.subject.compare is set.

Author(s)

Richard Ian Bailey richardianbailey@gmail.com

References

Buerkle, C. A. (2005). Maximum likelihood estimation of a hybrid index based on molecular markers. Molecular Ecology Notes, 5(3), 684-687.

Examples


## Not run: 
#Use the sparrow data with loci filtered by parental allele frequency credible interval overlap.
dat=read.data("RB_Italy_ggcline_precol_headers_haploidND2.csv",nprecol=2,MISSINGVAL=NA,NUMINDS=569)
prepdata=data.prep(data=dat$data,loci=dat$loci,alleles=dat$alleles,S0=c("Kralove","Oslo"),S1="LesinaSPANISH",precols=dat$precols,AF.CIoverlap = FALSE)

###############################################
#In the presence of parental reference samples#
###############################################
hindlabel=esth(
 data.prep.object=prepdata$data.prep,
 read.data.precols=dat$precols,
 include.Source=TRUE,	         #Leave at default TRUE if you want hybrid indices for the parental reference individuals#
 plot.ind = c("PD07-254","PD07-160","PD07-159","PI07-110","PI08-498","PH08-285"),  #posterior sampling for up to 6 test subjects can be plotted in real time#
  plot.col = c("blue","green","cyan","purple","magenta","red"),
nitt=3000,                                                  #Testing suggests nitt=3000,burnin=1000 are plenty for accurate posterior estimation#
 burnin=1000
)

#There will be a warning if any NAs are produced for the best posterior estimate (h_posterior_mode).

###################################################
#esth in the absence of parental reference samples#
###################################################

#This requires a marker.info file with correct column names and containing parental allele frequencies. 
#First create this file.
dat=read.data("RB_Italy_ggcline_precol_headers_haploidND2.csv",nprecol=2,MISSINGVAL=NA,NUMINDS=569);
prepdata=data.prep(data=dat$data,loci=dat$loci,alleles=dat$alleles,S0=c("Kralove","Oslo"),S1="LesinaSPANISH",
 precols=dat$precols,AF.CIoverlap = FALSE,return.locus.table = TRUE);
sparrowMarkers=prepdata$locus.data[,.(locus,S0_allele,S1_allele,S0.prop_0,S1.prop_0)];
setnames(sparrowMarkers,c("S0_allele","S1_allele","S0.prop_0","S1.prop_0"),c("refAllele","alternateAllele","S0.prop_r","S1.prop_r"));
#Save it to the working directory.
fwrite(sparrowMarkers,"sparrowMarkers.csv")#By default data.table's "fwrite" saves as a comma-separated file, whatever the file extension#

#Now run the workflow without specifying parental reference samples.
dat=read.data(file="RB_Italy_ggcline_precol_headers_haploidND2.csv",nprecol=2,NUMINDS=569,MISSINGVAL=NA,marker.info.file="sparrowMarkers.csv",sourceAbsent=TRUE);

#Run data.prep filtering for allele frequency difference.
prepdata= data.prep(
 data=dat$data,
 loci=dat$loci,
 sourceAbsent = TRUE,        #***Required when no parental reference samples are declared***#
 marker.info=dat$marker.info,
 alleles=dat$alleles,
 precols=dat$precols,
 min.diff = 0.2
)

#esth can now be run as above.
hindlabelx=esth(data.prep.object=prepdata$data.prep,read.data.precols=dat$precols,nitt=3000,burnin=1000)

#################################################################
#Estimating a hybrid index per locus across all TEST individuals#
#################################################################

#Also known as locus-specific ancestry.

#Use a dataset including parental reference samples, and then run per-locus hybrid index estimation on the 'TEST' 
#individuals only (i.e. excluding S0 and S1). No plots this time.
dat=read.data("RB_Italy_ggcline_precol_headers_haploidND2.csv",nprecol=2,MISSINGVAL=NA,NUMINDS=569);
prepdata=data.prep(data=dat$data,loci=dat$loci,alleles=dat$alleles,S0=c("Kralove","Oslo"),S1="LesinaSPANISH",precols=dat$precols,AF.CIoverlap = FALSE);
hindlabel=esth(
 data.prep.object=prepdata$data.prep,
 read.data.precols=dat$precols,
 test.subject="locus",             #Switch from estimating h-index per individual (INDLABEL, the default) to per locus#
 include.Source=FALSE,	          #Exclude the parental reference individuals (which are included by default) to get the h-index per locus for TEST individuals only#
 nitt=3000,
 burnin=1000
)

##############################################################################
#Running esth on one of the files produced by the 'split_data_prep' function##
##############################################################################

dat=read.data("RB_Italy_ggcline_precol_headers_haploidND2.csv",nprecol=2,MISSINGVAL=NA,NUMINDS=569);
prepdata=data.prep(data=dat$data,loci=dat$loci,alleles=dat$alleles,S0=c("Kralove","Oslo"),S1="LesinaSPANISH",precols=dat$precols,AF.CIoverlap = FALSE);

#Produce one data analysis file per individual and save to WD.
split_data_prep(data.prep.object=prepdata$data.prep,splitBy="INDLABEL",keepN=1);

#Read in the first sub-file and estimate hybrid index.
prep=fread("prepdata_INDLABEL_1.csv");
hindlabel=esth(data.prep.object=prep,read.data.precols=dat$precols,nitt=3000,burnin=1000)

## End(Not run)

ribailey/gghybrid documentation built on Feb. 2, 2024, 12:53 a.m.