ggcline: Estimate genomic cline parameters, credible intervals and p...

View source: R/ggcline.r

ggclineR Documentation

Estimate genomic cline parameters, credible intervals and p values.

Description

Estimate genomic cline parameters, credible intervals and p values.

Usage

ggcline(
  data.prep.object,
  esth.object,
  esth.colname = "h_posterior_mode",
  test.subject = "locus",
  poolv = FALSE,
  poolcentre = FALSE,
  include.Source = FALSE,
  return.likmeans = FALSE,
  read.data.precols,
  fix.subject.v = FALSE,
  fix.value.v,
  fix.subject.centre = FALSE,
  fix.value.centre,
  plot.test.subject = NULL,
  plot.col = NULL,
  plot.ylim = c(-5, 5),
  plot.pch.v.centre = c(1, 3),
  prior.logitcentre = c(0, sqrt(50)),
  prior.logv = c(0, sqrt(10)),
  nitt,
  burnin,
  start.v = NULL,
  start.centre = NULL,
  init.var.v = NULL,
  init.var.centre = NULL,
  init.cov.vcentre = NULL,
  print.k = 50
)

Arguments

data.prep.object

Name of the data.prep object produced by the data.prep function.

esth.object

Name of the esth object.

esth.colname

Name of the column in the hi object within esth containing the best posterior estimates for hybrid index. Default is ‘h_posterior_mode’. If hybrid index was estimated in the absence of a prior, use ‘beta_mean’ instead.

test.subject

Character string. Name of the field identifying subjects for genomic cline estimation. Default “locus”.

poolv

Logical. Whether to pool the rate parameter v across all test subjects. Default FALSE.

poolcentre

Logical. Whether to pool the centre parameter across all test subjects. Default FALSE.

include.Source

Logical. Whether to include the parental reference individuals in genomic cline estimation. Default FALSE.

return.likmeans

Logical. Whether to return a table of likelihood calculations. Required for applying the widely applicable information criterion to model comparison using compare.models. Default is FALSE.

read.data.precols

A precols object produced by read.data, or a custom character vector with names of all pre-marker columns in the data.

fix.subject.v

Logical default, otherwise a character vector. List of test subjects (normally locus names) for which you wish to fix rather than estimate the rate parameter v. Used for model comparison with compare.models. Set to TRUE to set all test subjects to the same value. Default is FALSE.

fix.value.v

Logical default, otherwise a numeric vector of length 1 or the same length as fix.subject.v. The fixed v value(s) for the test subject(s) listed in fix.subject.v.

fix.subject.centre

Logical default, otherwise a character vector. List of test subjects (normally locus names) for which you wish to fix rather than estimate the centre parameter . Used for model comparison with compare.models. Set to TRUE to set all test subjects to the same value. Default is FALSE.

fix.value.centre

Logical default, otherwise a numeric vector of length 1 or the same length as fix.subject.centre. The fixed centre value(s) for the test subject(s) listed in fix.subject.centre.

plot.test.subject

Character vector. List of up to 3 test subjects for which you wish to plot the posterior distribution values in real time. Rate parameter v is plotted with open circles, and logit(centre) with plus signs, both the same colour for the same test subject. Default NULL.

plot.col

Character vector. List of up to 3 colours to plot the plot.test.subject test subjects. Default NULL.

plot.ylim

Numeric vector. Upper and lower plot limits of the y axis. Default c(-5,5).

plot.pch.v.centre

Numeric vector. Size of data points for the two parameters. Default c(1,3).

prior.logitcentre

Numeric vector. Mean and sd of the normally distributed prior for logit(centre). Default c(0,sqrt(50)).

prior.logv

Numeric vector. Mean and sd of the normally distributed prior for ln(v). Default c(0,sqrt(10)).

nitt

The total number of MCMC iterations including burnin. 5,000 is recommended, 10,000 if one parameter is pooled.

burnin

Numeric. The number of burnin MCMC iterations. 2,000 is recommended, 5,000 if one parameter is pooled.

start.v

Numeric vector. Optional starting v values for the MCMC, which are otherwise sampled randomly. Default NULL.

start.centre

Numeric vector. Optional starting centre values for the MCMC, which are otherwise sampled randomly. Default NULL.

init.var.v

Numeric. Optional starting variance of the log(v) parameter proposal distribution. Internal value works well on tested data sets. Default NULL.

init.var.centre

Optional starting variance of the logit(centre) parameter proposal distribution. Internal value works well on tested data sets. Default NULL.

init.cov.vcentre

Optional starting covariance of the log(v) and logit(centre) multivariate normal proposal distribution. Internal value (0) works well on tested data sets. Default NULL.

print.k

The iteration is printed to screen on multiples of this number. Default is 50.

Details

ggcline assumes the posterior of ln(v) is normally distributed, and the results for this parameter are the inverse logs of the estimated mean and upper and lower 95 percent credible intervals of ln(v). v is always positive and higher values for v indicate steeper clines, with v=1 being the null value. centre represents the hybrid index at which allele frequencies are half way between those of the parents. It ranges between 0 and 1 and the null value is 0.5. It is estimated as logit(centre), and the posterior mean and credible intervals are inverse logit-transformed back to the original scale.

While Fitzpatrick (2013) describes parameters, v (the relative cline slope) and u (the relative cline position), the parameter u is difficult to interpret as its range scales to v. Noting that for the hybrid index itself, u=0 and v=1, the cline centre (hybrid index value for which allele frequencies are half way between those of the parents: m in Fitzpatrick's notation) for individual loci has the relationship logit(centre)=u/v. centre is easier to interpret, and estimating it rather than u improves MCMC efficiency; hence I estimate centre rather than u.

If both parameters are fixed to the null or other values, only nitt=2 and burnin=0 are required.

For poolv and poolcentre, the options are no pooling (FALSE) or pooling across all samples in the data set (TRUE). So if for example you wish to obtain a single pooled estimate of v across two transects for each locus individually (i.e. not the same parameter estimate across all loci), the loci must be run individually by subsetting the data.prep.object to only include one locus in each run, using split_data_prep.

Value

list with at least two components, gc, a data.table and data.frame with the estimated genomic cline parameters, and test.subject, a character scalar. The optional output likmeans is needed downstream when model comparison is carried out with compare.models.

gc contains the test.subject and following fields:

locus

the locus name, or another chosen test.subject.

S0.prop_1

allele frequency of the S1 allele in the S0 source population.

S1.prop_1

allele frequency of the S1 allele in the S1 source population.

exp_mean_log_v

the best posterior estimate of v (inverse-logged mean of the posterior normal distribution for ln(v)).

v_lower_95

v lower 95 percent credible interval (inverse-logged 2.5 percent quantile of the posterior normal distribution for ln(v)).

v_upper_95

v upper 95 percent credible interval (inverse-logged 97.5 percent quantile of the posterior normal distribution for ln(v)).

v_pvalue

Bayesian p value for v (2*Quantile of the null value (log(1)) given the posterior normal distribution for ln(v)).

invlogit_mean_logit_centre

best posterior estimate of centre (inverse logit-transformed mean of the posterior normal distribution for logit(centre)).

centre_lower_95

centre lower 95 percent credible interval (inverse logit-transformed 2.5 percent quantile of posterior logit(centre)).

centre_upper_95

centre upper 95 percent credible interval (inverse logit-transformed 97.5 percent quantile of posterior logit(centre)).

centre_pvalue

Bayesian p value for centre (2*Quantile of the null value (logit(0.5)) given the posterior normal distribution for logit(centre)).

mean_log_v

posterior mean v on the normally distributed latent scale (ln(v)).

var_log_v

posterior variance of v on the normally distributed latent scale (ln(v)).

mean_logit_centre

posterior mean centre on the normally distributed latent scale (logit(centre)).

var_logit_centre

posterior variance of centre on the normally distributed latent scale (logit(centre)).

cov_log_v_logit_centre

posterior covariance between the multivariate normally distributed ln(v) and (logit(centre)).

npar

number of parameters for each test subject.

sum_npar

sum number of parameters across all test subjects.

Author(s)

Richard Ian Bailey richardianbailey@gmail.com

References

Fitzpatrick, B. M. (2013). Alternative forms for genomic clines. Ecology and evolution, 3(7), 1951-1966.

Examples


## Not run: 
###########################################################
#Read in and prepare data, and run hybrid index estimation#
###########################################################
dat=read.data("RB_Italy_ggcline_precol_headers_haploidND2.csv",nprecol=2,MISSINGVAL=NA,NUMINDS=569);
prepdata=data.prep(data=dat$data,loci=dat$loci,alleles=dat$alleles,S0=c("Kralove","Oslo"),S1="LesinaSPANISH",
 precols=dat$precols,AF.CIoverlap = FALSE);
hindlabel=esth(data.prep.object=prepdata$data.prep,read.data.precols=dat$precols,include.Source=TRUE,
 nitt=3000,burnin=1000);

#################################################################################################################
#The full set of arguments for ggcline including all defaults (comments indicate which entries are not defaults)#
#################################################################################################################
gc=ggcline(
  data.prep.object=prepdata$data.prep,    #Needs an entry#
  esth.object=hindlabel,                  #Needs an entry#
  esth.colname="h_posterior_mode",          #Default. Can be replaced with "beta_mean" from the esth object#
  test.subject = "locus",
  poolv = FALSE,
  poolcentre = FALSE,
  include.Source = TRUE,                  #Default is FALSE#
  return.likmeans = TRUE,                 #Default is FALSE#
  read.data.precols=dat$precols,          #Needs an entry#
  fix.subject.v = FALSE,
  fix.value.v,
  fix.subject.centre = FALSE,
  fix.value.centre,
  plot.test.subject = c("A2ML1_SNP1","GTF2H2"), #Plots are just to check the MCMC is working#
  plot.col = c("orange","cyan"),                #Plots are just to check the MCMC is working#
  plot.ylim = c(-3, 5),                         #Plots are just to check the MCMC is working#
  plot.pch.v.centre = c(1, 3),                  #Plots are just to check the MCMC is working#
  prior.logitcentre = c(0, sqrt(50)),         #Default#
  prior.logv = c(0, sqrt(10)),                #Default#
  nitt=5000,                                     #Needs an entry, 5000 should be sufficient for standard per-locus estimates#
  burnin=2000,                                   #Needs an entry, 2000 should be sufficient for standard per-locus estimates#
  start.v = NULL,
  start.centre = NULL,
  init.var.v = NULL,
  init.var.centre = NULL,
  init.cov.vcentre = NULL,
  print.k = 50
)

#############################################################################################
#Estimate a single v parameter across a whole chromosome, but a unique centre for each locus#
#############################################################################################

#First, add a column to prepdata$data.prep indicating the chromosome for each marker.
chrom=fread("markerchr.csv");#This contains only the loci retained in prepdata$data.prep after filtering#
setkey(prepdata$data.prep,locus);setkey(chrom,locus);
prepdata$data.prep=prepdata$data.prep[chrom];

#Make a data.prep object for the Z chromosome markers only.
prepZ=prepdata$data.prep[chrom=="Z"]

#First run ggcline for chrZ-only without pooling, for later model comparison.
gc_unpooled_z=ggcline(data.prep.object=prepZ,esth.object=hindlabel,include.Source = TRUE,
 return.likmeans = TRUE,               #***Important to set this to TRUE waic for model comparison***#
 read.data.precols=dat$precols, nitt=5000,burnin=2000)

#Run again but pooling v across all loci. The MCMC is less efficient with pooling, so using longer nitt and burnin.
gc_poolv_Z=ggcline(data.prep.object=prepZ,esth.object=hindlabel,              
 poolv = TRUE,                         #***Set to TRUE***#
 include.Source = TRUE,
 return.likmeans = TRUE,               #***Important to set this to TRUE for model comparison***#
 read.data.precols=dat$precols,
 nitt=10000,                           #Use a longer post-burnin (5000 in this case)#
 burnin=5000                           #Use a longer burnin#
)

########################################################################
#Set up model comparisons by fixing one or both parameters (no pooling)#
########################################################################

#Run ggcline with both parameters fixed to their null values. This can then be compared with the first run above (gc).
gc_v_centre_fixed=ggcline(data.prep.object=prepdata$data.prep,esth.object=hindlabel,include.Source = TRUE,
 return.likmeans = TRUE,               #***Important to set this to TRUE for model comparison***#
 read.data.precols=dat$precols,
 fix.subject.v = gc$gc$locus,          #***Character vector of locus names, here taken from the first ggcline run above***#
 fix.value.v = 1,                      #***The null parameter value on the data scale***#
 fix.subject.centre = gc$gc$locus,     #***Vector of locus names***#
 fix.value.centre = 0.5,               #***The null parameter value on the data scale***#
 nitt=2,                               #MCMC estimation not needed when both parameters are fixed for all test subjects, only nitt=2 is necessary to avoid errors#
 burnin=0                              #MCMC estimation not needed when both parameters are fixed for all test subjects#
)

#The comparison would be cleaner if we only fixed one parameter. For example, v.
gc_v_fixed=ggcline(data.prep.object=prepdata$data.prep,esth.object=hindlabel,include.Source = TRUE,
 return.likmeans = TRUE,               #***Important to set this to TRUE for model comparison***#
 read.data.precols=dat$precols,
 fix.subject.v = gc$gc$locus,          #***Vector of locus names***#
 fix.value.v = 1,                      #***The null parameter value on the data scale (could also be fixed to some other value)***#
 #fix.subject.centre = gc$gc$locus,    #Not fixing centre this time#
 #fix.value.centre = 0.5,              #Not fixing centre this time#
 nitt=5000,
 burnin=2000
)

#This can now be compared with either gc or gc_v_centre_fixed.

##################################################################
#Prepare to compare parameters estimated at differing resolutions#
##################################################################

#Run ggcline using chromosome as test.subject rather than locus. We added the column "chrom" to prepdata$data.prep earlier.
gc_by_chr=ggcline(data.prep.object=prepdata$data.prep,esth.object=hindlabel,
 test.subject = "chrom",               #***Changed from default***#
 include.Source = TRUE,return.likmeans = TRUE,read.data.precols=dat$precols,nitt=5000,burnin=2000)

#The column "chrom" is absent from the 'gc' results object created above, and must first be added to both gc$gc and gc$likmeans.
chrom=fread("markerchr.csv");#This contains only the loci retained in prepdata$data.prep#
setkey(gc$gc,locus);setkey(chrom,locus);setkey(gc$likmeans,locus);
gc$gc=gc$gc[chrom];
gc$likmeans=gc$likmeans[chrom];

#Now model comparison can be carried out at the chromosome level, between gc and gc_by_chr.

## End(Not run)

ribailey/gghybrid documentation built on Feb. 2, 2024, 12:53 a.m.