ggcline: Estimate genomic cline parameters, credible intervals and p...
In ribailey/gghybrid: Evolutionary Analysis of Hybrids and Hybrid Zones

ggcline

R Documentation

Estimate genomic cline parameters, credible intervals and p values.

Description

Estimate genomic cline parameters, credible intervals and p values.

Usage

ggcline(
  data.prep.object,
  esth.object,
  esth.colname = "h_posterior_mode",
  test.subject = "locus",
  poolv = FALSE,
  poolcentre = FALSE,
  include.Source = FALSE,
  return.likmeans = FALSE,
  read.data.precols,
  fix.subject.v = FALSE,
  fix.value.v,
  fix.subject.centre = FALSE,
  fix.value.centre,
  plot.test.subject = NULL,
  plot.col = NULL,
  plot.ylim = c(-5, 5),
  plot.pch.v.centre = c(1, 3),
  prior.logitcentre = c(0, sqrt(50)),
  prior.logv = c(0, sqrt(10)),
  nitt,
  burnin,
  start.v = NULL,
  start.centre = NULL,
  init.var.v = NULL,
  init.var.centre = NULL,
  init.cov.vcentre = NULL,
  print.k = 50
)

Arguments

`data.prep.object`	Name of the `data.prep` object produced by the `data.prep` function.
`esth.object`	Name of the `esth` object.
`esth.colname`	Name of the column in the `hi` object within `esth` containing the best posterior estimates for hybrid index. Default is ‘h_posterior_mode’. If hybrid index was estimated in the absence of a prior, use ‘beta_mean’ instead.
`test.subject`	Character string. Name of the field identifying subjects for genomic cline estimation. Default “locus”.
`poolv`	Logical. Whether to pool the rate parameter `v` across all test subjects. Default `FALSE`.
`poolcentre`	Logical. Whether to pool the `centre` parameter across all test subjects. Default `FALSE`.
`include.Source`	Logical. Whether to include the parental reference individuals in genomic cline estimation. Default `FALSE`.
`return.likmeans`	Logical. Whether to return a table of likelihood calculations. Required for applying the widely applicable information criterion to model comparison using `compare.models`. Default is `FALSE`.
`read.data.precols`	A `precols` object produced by `read.data`, or a custom character vector with names of all pre-marker columns in the `data`.
`fix.subject.v`	Logical default, otherwise a character vector. List of test subjects (normally locus names) for which you wish to fix rather than estimate the rate parameter `v`. Used for model comparison with `compare.models`. Set to `TRUE` to set all test subjects to the same value. Default is `FALSE`.
`fix.value.v`	Logical default, otherwise a numeric vector of length 1 or the same length as `fix.subject.v`. The fixed `v` value(s) for the test subject(s) listed in `fix.subject.v`.
`fix.subject.centre`	Logical default, otherwise a character vector. List of test subjects (normally locus names) for which you wish to fix rather than estimate the `centre` parameter . Used for model comparison with `compare.models`. Set to `TRUE` to set all test subjects to the same value. Default is `FALSE`.
`fix.value.centre`	Logical default, otherwise a numeric vector of length 1 or the same length as `fix.subject.centre`. The fixed `centre` value(s) for the test subject(s) listed in `fix.subject.centre`.
`plot.test.subject`	Character vector. List of up to 3 test subjects for which you wish to plot the posterior distribution values in real time. Rate parameter `v` is plotted with open circles, and logit(`centre`) with plus signs, both the same colour for the same test subject. Default `NULL`.
`plot.col`	Character vector. List of up to 3 colours to plot the `plot.test.subject` test subjects. Default `NULL`.
`plot.ylim`	Numeric vector. Upper and lower plot limits of the y axis. Default `c(-5,5)`.
`plot.pch.v.centre`	Numeric vector. Size of data points for the two parameters. Default `c(1,3)`.
`prior.logitcentre`	Numeric vector. Mean and sd of the normally distributed prior for logit(`centre`). Default `c(0,sqrt(50))`.
`prior.logv`	Numeric vector. Mean and sd of the normally distributed prior for ln(`v`). Default `c(0,sqrt(10))`.
`nitt`	The total number of MCMC iterations including burnin. 5,000 is recommended, 10,000 if one parameter is pooled.
`burnin`	Numeric. The number of burnin MCMC iterations. 2,000 is recommended, 5,000 if one parameter is pooled.
`start.v`	Numeric vector. Optional starting `v` values for the MCMC, which are otherwise sampled randomly. Default `NULL`.
`start.centre`	Numeric vector. Optional starting `centre` values for the MCMC, which are otherwise sampled randomly. Default `NULL`.
`init.var.v`	Numeric. Optional starting variance of the log(`v`) parameter proposal distribution. Internal value works well on tested data sets. Default `NULL`.
`init.var.centre`	Optional starting variance of the logit(`centre`) parameter proposal distribution. Internal value works well on tested data sets. Default `NULL`.
`init.cov.vcentre`	Optional starting covariance of the log(`v`) and logit(`centre`) multivariate normal proposal distribution. Internal value (0) works well on tested data sets. Default `NULL`.
`print.k`	The iteration is printed to screen on multiples of this number. Default is `50`.

Details

ggcline assumes the posterior of ln(v) is normally distributed, and the results for this parameter are the inverse logs of the estimated mean and upper and lower 95 percent credible intervals of ln(v). v is always positive and higher values for v indicate steeper clines, with v=1 being the null value. centre represents the hybrid index at which allele frequencies are half way between those of the parents. It ranges between 0 and 1 and the null value is 0.5. It is estimated as logit(centre), and the posterior mean and credible intervals are inverse logit-transformed back to the original scale.

While Fitzpatrick (2013) describes parameters, v (the relative cline slope) and u (the relative cline position), the parameter u is difficult to interpret as its range scales to v. Noting that for the hybrid index itself, u=0 and v=1, the cline centre (hybrid index value for which allele frequencies are half way between those of the parents: m in Fitzpatrick's notation) for individual loci has the relationship logit(centre)=u/v. centre is easier to interpret, and estimating it rather than u improves MCMC efficiency; hence I estimate centre rather than u.

If both parameters are fixed to the null or other values, only nitt=2 and burnin=0 are required.

For poolv and poolcentre, the options are no pooling (FALSE) or pooling across all samples in the data set (TRUE). So if for example you wish to obtain a single pooled estimate of v across two transects for each locus individually (i.e. not the same parameter estimate across all loci), the loci must be run individually by subsetting the data.prep.object to only include one locus in each run, using split_data_prep.

Value

list with at least two components, gc, a data.table and data.frame with the estimated genomic cline parameters, and test.subject, a character scalar. The optional output likmeans is needed downstream when model comparison is carried out with compare.models.

gc contains the test.subject and following fields:

`locus`	the locus name, or another chosen test.subject.
`S0.prop_1`	allele frequency of the S1 allele in the S0 source population.
`S1.prop_1`	allele frequency of the S1 allele in the S1 source population.
`exp_mean_log_v`	the best posterior estimate of v (inverse-logged mean of the posterior normal distribution for ln(v)).
`v_lower_95`	v lower 95 percent credible interval (inverse-logged 2.5 percent quantile of the posterior normal distribution for ln(v)).
`v_upper_95`	v upper 95 percent credible interval (inverse-logged 97.5 percent quantile of the posterior normal distribution for ln(v)).
`v_pvalue`	Bayesian p value for v (2*Quantile of the null value (log(1)) given the posterior normal distribution for ln(v)).
`invlogit_mean_logit_centre`	best posterior estimate of centre (inverse logit-transformed mean of the posterior normal distribution for logit(centre)).
`centre_lower_95`	centre lower 95 percent credible interval (inverse logit-transformed 2.5 percent quantile of posterior logit(centre)).
`centre_upper_95`	centre upper 95 percent credible interval (inverse logit-transformed 97.5 percent quantile of posterior logit(centre)).
`centre_pvalue`	Bayesian p value for centre (2*Quantile of the null value (logit(0.5)) given the posterior normal distribution for logit(centre)).
`mean_log_v`	posterior mean v on the normally distributed latent scale (ln(v)).
`var_log_v`	posterior variance of v on the normally distributed latent scale (ln(v)).
`mean_logit_centre`	posterior mean centre on the normally distributed latent scale (logit(centre)).
`var_logit_centre`	posterior variance of centre on the normally distributed latent scale (logit(centre)).
`cov_log_v_logit_centre`	posterior covariance between the multivariate normally distributed ln(v) and (logit(centre)).
`npar`	number of parameters for each test subject.
`sum_npar`	sum number of parameters across all test subjects.

Author(s)

Richard Ian Bailey richardianbailey@gmail.com

References

Fitzpatrick, B. M. (2013). Alternative forms for genomic clines. Ecology and evolution, 3(7), 1951-1966.

Examples


## Not run: 
###########################################################
#Read in and prepare data, and run hybrid index estimation#
###########################################################
dat=read.data("RB_Italy_ggcline_precol_headers_haploidND2.csv",nprecol=2,MISSINGVAL=NA,NUMINDS=569);
prepdata=data.prep(data=dat$data,loci=dat$loci,alleles=dat$alleles,S0=c("Kralove","Oslo"),S1="LesinaSPANISH",
 precols=dat$precols,AF.CIoverlap = FALSE);
hindlabel=esth(data.prep.object=prepdata$data.prep,read.data.precols=dat$precols,include.Source=TRUE,
 nitt=3000,burnin=1000);

#################################################################################################################
#The full set of arguments for ggcline including all defaults (comments indicate which entries are not defaults)#
#################################################################################################################
gc=ggcline(
  data.prep.object=prepdata$data.prep,    #Needs an entry#
  esth.object=hindlabel,                  #Needs an entry#
  esth.colname="h_posterior_mode",          #Default. Can be replaced with "beta_mean" from the esth object#
  test.subject = "locus",
  poolv = FALSE,
  poolcentre = FALSE,
  include.Source = TRUE,                  #Default is FALSE#
  return.likmeans = TRUE,                 #Default is FALSE#
  read.data.precols=dat$precols,          #Needs an entry#
  fix.subject.v = FALSE,
  fix.value.v,
  fix.subject.centre = FALSE,
  fix.value.centre,
  plot.test.subject = c("A2ML1_SNP1","GTF2H2"), #Plots are just to check the MCMC is working#
  plot.col = c("orange","cyan"),                #Plots are just to check the MCMC is working#
  plot.ylim = c(-3, 5),                         #Plots are just to check the MCMC is working#
  plot.pch.v.centre = c(1, 3),                  #Plots are just to check the MCMC is working#
  prior.logitcentre = c(0, sqrt(50)),         #Default#
  prior.logv = c(0, sqrt(10)),                #Default#
  nitt=5000,                                     #Needs an entry, 5000 should be sufficient for standard per-locus estimates#
  burnin=2000,                                   #Needs an entry, 2000 should be sufficient for standard per-locus estimates#
  start.v = NULL,
  start.centre = NULL,
  init.var.v = NULL,
  init.var.centre = NULL,
  init.cov.vcentre = NULL,
  print.k = 50
)

#############################################################################################
#Estimate a single v parameter across a whole chromosome, but a unique centre for each locus#
#############################################################################################

#First, add a column to prepdata$data.prep indicating the chromosome for each marker.
chrom=fread("markerchr.csv");#This contains only the loci retained in prepdata$data.prep after filtering#
setkey(prepdata$data.prep,locus);setkey(chrom,locus);
prepdata$data.prep=prepdata$data.prep[chrom];

#Make a data.prep object for the Z chromosome markers only.
prepZ=prepdata$data.prep[chrom=="Z"]

#First run ggcline for chrZ-only without pooling, for later model comparison.
gc_unpooled_z=ggcline(data.prep.object=prepZ,esth.object=hindlabel,include.Source = TRUE,
 return.likmeans = TRUE,               #***Important to set this to TRUE waic for model comparison***#
 read.data.precols=dat$precols, nitt=5000,burnin=2000)

#Run again but pooling v across all loci. The MCMC is less efficient with pooling, so using longer nitt and burnin.
gc_poolv_Z=ggcline(data.prep.object=prepZ,esth.object=hindlabel,              
 poolv = TRUE,                         #***Set to TRUE***#
 include.Source = TRUE,
 return.likmeans = TRUE,               #***Important to set this to TRUE for model comparison***#
 read.data.precols=dat$precols,
 nitt=10000,                           #Use a longer post-burnin (5000 in this case)#
 burnin=5000                           #Use a longer burnin#
)

########################################################################
#Set up model comparisons by fixing one or both parameters (no pooling)#
########################################################################

#Run ggcline with both parameters fixed to their null values. This can then be compared with the first run above (gc).
gc_v_centre_fixed=ggcline(data.prep.object=prepdata$data.prep,esth.object=hindlabel,include.Source = TRUE,
 return.likmeans = TRUE,               #***Important to set this to TRUE for model comparison***#
 read.data.precols=dat$precols,
 fix.subject.v = gc$gc$locus,          #***Character vector of locus names, here taken from the first ggcline run above***#
 fix.value.v = 1,                      #***The null parameter value on the data scale***#
 fix.subject.centre = gc$gc$locus,     #***Vector of locus names***#
 fix.value.centre = 0.5,               #***The null parameter value on the data scale***#
 nitt=2,                               #MCMC estimation not needed when both parameters are fixed for all test subjects, only nitt=2 is necessary to avoid errors#
 burnin=0                              #MCMC estimation not needed when both parameters are fixed for all test subjects#
)

#The comparison would be cleaner if we only fixed one parameter. For example, v.
gc_v_fixed=ggcline(data.prep.object=prepdata$data.prep,esth.object=hindlabel,include.Source = TRUE,
 return.likmeans = TRUE,               #***Important to set this to TRUE for model comparison***#
 read.data.precols=dat$precols,
 fix.subject.v = gc$gc$locus,          #***Vector of locus names***#
 fix.value.v = 1,                      #***The null parameter value on the data scale (could also be fixed to some other value)***#
 #fix.subject.centre = gc$gc$locus,    #Not fixing centre this time#
 #fix.value.centre = 0.5,              #Not fixing centre this time#
 nitt=5000,
 burnin=2000
)

#This can now be compared with either gc or gc_v_centre_fixed.

##################################################################
#Prepare to compare parameters estimated at differing resolutions#
##################################################################

#Run ggcline using chromosome as test.subject rather than locus. We added the column "chrom" to prepdata$data.prep earlier.
gc_by_chr=ggcline(data.prep.object=prepdata$data.prep,esth.object=hindlabel,
 test.subject = "chrom",               #***Changed from default***#
 include.Source = TRUE,return.likmeans = TRUE,read.data.precols=dat$precols,nitt=5000,burnin=2000)

#The column "chrom" is absent from the 'gc' results object created above, and must first be added to both gc$gc and gc$likmeans.
chrom=fread("markerchr.csv");#This contains only the loci retained in prepdata$data.prep#
setkey(gc$gc,locus);setkey(chrom,locus);setkey(gc$likmeans,locus);
gc$gc=gc$gc[chrom];
gc$likmeans=gc$likmeans[chrom];

#Now model comparison can be carried out at the chromosome level, between gc and gc_by_chr.

## End(Not run)

ribailey/gghybrid documentation built on Feb. 2, 2024, 12:53 a.m.