esth: Bayesian hybrid index estimation.

View source: R/esth.r

esthR Documentation

Bayesian hybrid index estimation.

Description

Bayesian hybrid index estimation.

Usage

esth(
  data.prep.object,
  read.data.precols,
  test.subject = "INDLABEL",
  test.subject.compare = NULL,
  include.Source = TRUE,
  return.likmeans = FALSE,
  fix.subject = FALSE,
  fix.value = FALSE,
  plot.ind = NULL,
  plot.col = NULL,
  nitt,
  burnin,
  init.var = 0.002,
  init.var2 = NULL,
  prior = c(0.5, 0.5),
  print.k = 50
)

Arguments

data.prep.object

Name of the data.prep object produced by the data.prep function.

read.data.precols

A precols object produced by read.data, or a custom character vector with names of all pre-marker columns in the data.

test.subject

Character string. Name of the field identifying subjects for hybrid index estimation. Default “INDLABEL”.

test.subject.compare

Character string. Name of another field in the data analysis table, such as ‘POPID’, with which you wish to compare the fit of the current test subject using either AIC or waic. Used to compare model fits at differing resolutions, such as individual-level versus population-level, with the subsequent comparison being made at the lower resolution. Only needs to be declared when running the higher resolution model, and produces a warning if the declared field is at a higher resolution than the test subject. A second run of esth will then be needed, with test.subject.compare in the current model declared as test.subject. The two models can then be compared using either or both of the downstream functions compare.models (for waic) or calc_AIC.

include.Source

Logical. Whether to estimate hybrid indices for test subjects declared as parental reference. Default is TRUE. Ignored if there are no parental reference samples.

return.likmeans

Logical. Whether to return a table of likelihood calculations. Required for compare.models. Default is FALSE.

fix.subject

Logical default, or a character vector. Which, if any, test subjects to fix at a specific hybrid index value. If set TRUE, the entry for fix.value should be a single value to be applied to all test subjects. Alternatively, a character vector including the names of all or a subset of test subjects can be entered, in which case the same set of test subjects can be assigned unique fixed values using fix.value. Potentially useful for model comparison with compare.models.

fix.value

Logical default, otherwise a numeric vector of length 1 or the same length and order as the character vector declared using fix.subject. The fixed hybrid index value(s). Note that if all test subjects are fixed, only nitt=2 and burnin=0 are required.

plot.ind

Character vector. List of up to 6 test subjects for which you wish to plot the posterior distribution values in real time. Default NULL.

plot.col

Character vector. List of up to 6 colours to plot the plot.ind test subjects. Default NULL.

nitt

Numeric. The total number of MCMC iterations including burnin. At least 2000, preferably 3000, is recommended.

burnin

Numeric. The number of burnin MCMC iterations. Typically, 1000 is sufficient.

init.var

Numeric. Starting value for the variance of the parameter proposal distribution, applied for the first 100 iterations. Default is 0.002, which works well with tested data sets.

init.var2

Numeric. Variance of the proposal distribution to be applied after the first 100 iterations, by which time the mcmc should be sampling the posterior distribution. The default entry is NULL, in which case the value is calculated internally using the mean number of allele copies per test subject. The internal formula is: (1/(mean(N allele copies per test subject)/2))/10, which works well with a big variety of tested numbers of loci per test subject, up to several million.

prior

Numeric vector. Values of shape1 and shape2 parameters for the beta-distributed prior. Default is shape1=shape2=0.5 (Jeffrey's prior). Doesn't currently allow a unique prior for each test subject. If set to c(0,0), the prior is excluded from the posterior likelihood calculation, in which case the best posterior estimate is the mean (‘beta_mean’) rather than the mode (‘h_posterior_mode’), the latter of which is the best estimate in the presence of a prior.

print.k

Numeric. The iteration is printed to screen at multiples of this number. Default is 50.

Details

esth estimates hybrid index using the likelihood formulae of Buerkle (2005), with the addition of a prior. The default is Jeffrey's prior (beta distribution with shape1=shape2=0.5), which testing suggests is an improvement over a uniform prior (shape1=shape2=1).

esth excludes any pre-marker columns from the output with finer resolution than the declared test.subject, and hence has the same number of rows as the number of unique test.subject values.

Set return.likmeans=TRUE if you intend to carry out model comparison based on the widely applicable information criterion (waic).

Value

list containing hi, test.subject, init.var2, and likmeans (optional). init.var2 is numeric and is included in case the user wants to try a different, possibly smaller, value, for test subjects for which mcmc failed. likmeans contains the information needed to calculate waic. hi is a data.table and data.frame containing all pre-marker columns that do not have finer resolution than the declared test.subject, as well as the following fields:

Source

whether the sample is from source 0 ('S0'), source 1 ('S1') or is a test individual ('TEST').

h_posterior_mode

the hybrid index estimate (mode of the beta-distributed posterior).

h_cred_int_lower

lower 95 percent credible interval (2.5 percent quantile of the posterior beta distribution).

h_cred_int_upper

upper 95 percent credible interval (97.5 percent quantile of the posterior beta distribution).

beta_mean

mean of the posterior beta distribution (included for convenience).

beta_var

variance of the posterior beta distribution (included for convenience).

beta_shape1

beta shape1 parameter estimate for the posterior beta distribution (included for convenience).

beta_shape2

beta shape2 parameter estimate for the posterior beta distribution (included for convenience).

npar_compare

see compare.models. Absent unless test.subject.compare is set.

Author(s)

Richard Ian Bailey richardianbailey@gmail.com

References

Buerkle, C. A. (2005). Maximum likelihood estimation of a hybrid index based on molecular markers. Molecular Ecology Notes, 5(3), 684-687.

Examples


## Not run: 
#Use the sparrow data with loci filtered by parental allele frequency credible interval overlap.
dat=read.data("RB_Italy_ggcline_precol_headers_haploidND2.csv",nprecol=2,MISSINGVAL=NA,NUMINDS=569)
prepdata=data.prep(data=dat$data,loci=dat$loci,alleles=dat$alleles,S0=c("Kralove","Oslo"),S1="LesinaSPANISH",precols=dat$precols,AF.CIoverlap = FALSE)

###############################################
#In the presence of parental reference samples#
###############################################
hindlabel=esth(
 data.prep.object=prepdata$data.prep,
 read.data.precols=dat$precols,
 include.Source=TRUE,	         #Leave at default TRUE if you want hybrid indices for the parental reference individuals#
 plot.ind = c("PD07-254","PD07-160","PD07-159","PI07-110","PI08-498","PH08-285"),  #posterior sampling for up to 6 test subjects can be plotted in real time#
  plot.col = c("blue","green","cyan","purple","magenta","red"),
nitt=3000,                                                  #Testing suggests nitt=3000,burnin=1000 are plenty for accurate posterior estimation#
 burnin=1000
)

#There will be a warning if any NAs are produced for the best posterior estimate (h_posterior_mode).

###################################################
#esth in the absence of parental reference samples#
###################################################

#This requires a marker.info file with correct column names and containing parental allele frequencies. 
#First create this file.
dat=read.data("RB_Italy_ggcline_precol_headers_haploidND2.csv",nprecol=2,MISSINGVAL=NA,NUMINDS=569);
prepdata=data.prep(data=dat$data,loci=dat$loci,alleles=dat$alleles,S0=c("Kralove","Oslo"),S1="LesinaSPANISH",
 precols=dat$precols,AF.CIoverlap = FALSE,return.locus.table = TRUE);
sparrowMarkers=prepdata$locus.data[,.(locus,S0_allele,S1_allele,S0.prop_0,S1.prop_0)];
setnames(sparrowMarkers,c("S0_allele","S1_allele","S0.prop_0","S1.prop_0"),c("refAllele","alternateAllele","S0.prop_r","S1.prop_r"));
#Save it to the working directory.
fwrite(sparrowMarkers,"sparrowMarkers.csv")#By default data.table's "fwrite" saves as a comma-separated file, whatever the file extension#

#Now run the workflow without specifying parental reference samples.
dat=read.data(file="RB_Italy_ggcline_precol_headers_haploidND2.csv",nprecol=2,NUMINDS=569,MISSINGVAL=NA,marker.info.file="sparrowMarkers.csv",sourceAbsent=TRUE);

#Run data.prep filtering for allele frequency difference.
prepdata= data.prep(
 data=dat$data,
 loci=dat$loci,
 sourceAbsent = TRUE,        #***Required when no parental reference samples are declared***#
 marker.info=dat$marker.info,
 alleles=dat$alleles,
 precols=dat$precols,
 min.diff = 0.2
)

#esth can now be run as above.
hindlabelx=esth(data.prep.object=prepdata$data.prep,read.data.precols=dat$precols,nitt=3000,burnin=1000)

#################################################################
#Estimating a hybrid index per locus across all TEST individuals#
#################################################################

#Also known as locus-specific ancestry.

#Use a dataset including parental reference samples, and then run per-locus hybrid index estimation on the 'TEST' 
#individuals only (i.e. excluding S0 and S1). No plots this time.
dat=read.data("RB_Italy_ggcline_precol_headers_haploidND2.csv",nprecol=2,MISSINGVAL=NA,NUMINDS=569);
prepdata=data.prep(data=dat$data,loci=dat$loci,alleles=dat$alleles,S0=c("Kralove","Oslo"),S1="LesinaSPANISH",precols=dat$precols,AF.CIoverlap = FALSE);
hindlabel=esth(
 data.prep.object=prepdata$data.prep,
 read.data.precols=dat$precols,
 test.subject="locus",             #Switch from estimating h-index per individual (INDLABEL, the default) to per locus#
 include.Source=FALSE,	          #Exclude the parental reference individuals (which are included by default) to get the h-index per locus for TEST individuals only#
 nitt=3000,
 burnin=1000
)

##############################################################################
#Running esth on one of the files produced by the 'split_data_prep' function##
##############################################################################

dat=read.data("RB_Italy_ggcline_precol_headers_haploidND2.csv",nprecol=2,MISSINGVAL=NA,NUMINDS=569);
prepdata=data.prep(data=dat$data,loci=dat$loci,alleles=dat$alleles,S0=c("Kralove","Oslo"),S1="LesinaSPANISH",precols=dat$precols,AF.CIoverlap = FALSE);

#Produce one data analysis file per individual and save to WD.
split_data_prep(data.prep.object=prepdata$data.prep,splitBy="INDLABEL",keepN=1);

#Read in the first sub-file and estimate hybrid index.
prep=fread("prepdata_INDLABEL_1.csv");
hindlabel=esth(data.prep.object=prep,read.data.precols=dat$precols,nitt=3000,burnin=1000)

## End(Not run)

ribailey/gghybrid documentation built on Feb. 2, 2024, 12:53 a.m.