README.md

CoExpansionValidation

An R package for validating hABC co-expansion inference

Tests of synchronous community expansion are commonly conducted using hierarchical Approximate Bayesian Computation (hABC), a statistical framework for inferring the degree of concordance across species. However, this method has demonstrated limited power to accurately detect co-expansion across parameter-space and rigorous validation testing is recommended for empirical studies. This package makes such validation testing of hABC broadly accessible and accommodates multi-locus datasets.

Below we will show how to use this CoExpansionValidation R package to assess hABC performance for detecting co-expansion events across species for a given genetic dataset. It will go through an example step-by-step, showing how to generate one pseudo-observed dataset and use hABC to "infer" the number of co-expansion events. To apply this type of performance assessment in an empirical setting, many replicates of pseudo-observed datasets need to be generated for inference.

Installation

External programs that need to be installed before running:

BayeSSC is used for simulation, and msReject is used for inference. While functions in this package do not directly use hBayeSSC, steps of hyperstat calculation and running msReject are set to be identical to hBayeSSC. Future development will expand on other types of summary statistics.

To install this package, use devtools:

devtools::install_github("huatengh/CoExpansionValidation",build_vignettes = TRUE)

```{r setup} library(CoExpansionValidation) options(stringsAsFactors = F)




## __1.Generating a pseudo-observed dataset__  

### i) Determine the number of co-expansion events  

Two options:  

* user specifies the number of events and assigns species to events randomly or evenly:  

```{r specify the number of co-expansion events}
species<-1:10 # or a vector of species names, can be characters
coevents<-2
#assigning species randomly
species.assignment<-assign_species_to_events(species = species,nco.events = coevents,even = F)
species.assignment
#assigning species evenly
species.assignment<-assign_species_to_events(species = species,nco.events = coevents,even = T)
species.assignment

```{r simulate the number of co-expansion events} species<-1:10 alpha<-1.1 species.assignment<-generate_coevent_number_dirichlet(species = species,alpha = alpha,maxevent=4) species.assignment coevents<-length(species.assignment)


### ii) Sampling the time for co-expansion events

User can provide the time for each event (time from present), or randomly sample the time from a range. Buffer can be added around co-expansion events, such that they are at least, or exactly, a certain distance away from each other. No requirement on the unit of time yet--below user can provide species' generation time to turn them into number of generations.

```{r generate the expansion time}
# if user provides the times, make sure its length equals 
# the coevents specified above
exp.time<-c(30000,50000) 


#randomly draw times from a range with a set distance apart
#make sure the time range is appropriate for the number of events (specified above) and buffer (specified below)
time.range<-c(30000,50000)
buffer<-5000
first.event<-runif(min=time.range[1],max=time.range[2]-buffer*(coevents-1),n=1)
exp.time<-first.event+buffer*c(0:(coevents-1))


#randomly draw times from a range with a minimum distance apart
time.range<-c(30000,50000)
buffer<-5000
exp.time<-generate_cotime_with_buffer(time.range = time.range,nco.events =coevents,buffer =  buffer)
exp.time

User can get a vector of the sampled time for each species, making it easier to write out to a file

```{r get the expansion time for each species} x<-species_exp_time(species.assignment,exp.time) x


### iii) Simulate the "observed" summary statistics with BayeSSC 

Here, we provide a wrapper function for running BayeSCC from R. BayeSSC itself has very detail instructions on [their webpage](https://web.stanford.edu/group/hadlylab/ssc/index.html#IO).   User need to give:

* the path to the BayeSCC executable (absolute or relative to current working directory; including the executable name)

And then, there are two options:   

a) a configuration file: a table (TAB delimited file) consists of columns with following header names

__Column name__     |   __Description__
------------------- | ---------------
species(1)          | Name of species
locinum             | Number of loci
nsam                  | Number of haplotypes to be simulated
nsites              | Number of base pairs
tstv                  | % transitions
gamma                 | Gamma distribution for substitution rate heterogeneity, two numbers separated by space
mutation rate(2)      | The locus mutation rate per generation
gen(3)                | The generation length
Ne                  | The Effective population size
popratio            | The ratio between ancestral and current population size, <1 for expansion and >1 for shrinkage 
eventtime.generation(4) | The time of the historical event in unit of generation(optional)

  (1): species can appear in multiple rows if they have multiple types of loci; using one row for one locus at a time works as well. However, all loci in one species share the same generation time, population size and expansion/shrinkage history. For these columns, function below will only read  from the first row of a species.  
  (2): mutation rate (and population size, and other number columns) can take BayeSCC-style prior distributions. For example, {U:1,299} for uniform distribution between 1 and 299. see [BayeSCC webpage](https://web.stanford.edu/group/hadlylab/ssc/index.html#IO) for details.  
  (3): generation length needs to be in the same unit as the time range and buffer provided before. That is, if MYA was used, then here the generation time would need to be in millions of years ago.  
  (4): optional, if not provided, it will be calculated automatically with the generation time and previously simulated events' times.

```{r run BayeSCC from a configuration file}
path_to_bayessc<-"BayeSSC.exe" #windows version, linux or mac starts with ./
#check the conf data frame --an example of the configuration table-- included in the package
head(conf)
#read your own configuration file:
#configurefile<-"test.conf"
#conf<-read.table(configurefile,header=T,sep="\t",stringsAsFactors = F)

#code below will simulate summary statistics with BayeSSC,see the function's help for details 
simulatedobs<-runbayeSSC_with_conf(BayeSSCallocation = path_to_bayessc,conf = conf,prefix = 'temp',species.assignment = species.assignment,exp.time = exp.time,intern=F)

#the package comes with a simulatedobs data frame, user can check what it looks like
colnames(simulatedobs)
head(simulatedobs[,1:5])

b) if all loci in all species can use the same .par file for BayeSCC (all loci in a species are of the same type)

Below is a template file, already scanned in as part of the package data, templateparfile

BayeSSC//Number of population samples 1 //Population sizes {U:500000,500000} //Sample sizes 40 //Growth rates 0 //Number of migration matrices : If 0 : No migration between demes 0 //Historical event format: 1 historical event 399137 0 0 1 0.01 0 0 //mutation rate {U:0.00008,0.00008} //Number of independent loci 800 //Data type, tvts DNA 0.33 //Gamma distribution for mutation 0 0

```{r run BayeSCC from par file} path_to_bayessc<-"BayeSSC.exe" #windows version, linux or mac starts with ./

show the templateparfile included in this package

templateparfile

you can write this template out to a file and edit it with any text editor

cat(paste(templateparfile,sep = '',collapse = '\n'),"\n",sep='',file="test.par") bayessc_par_file<-'test.par' nloci<-10 gen<-1 # or can be a vector with generation time for each species

extract information from the par file to a configuration table

conf<-par_to_config(bayessc_par_file,species,nloci,gen)

run BayeSSC with the configuration table

simulatedobs<-runbayeSSC_with_conf(BayeSSCallocation = path_to_bayessc,conf = conf,prefix = 'partemp',species.assignment = species.assignment,exp.time = exp.time,intern=TRUE)

### iv) calculate hyperstats (hBaySSC) 

We calculate hyperstats from the "observed" statistics. In our paper, this step was performed with [hBayeSSC](https://github.com/UH-Bioinformatics/hBayeSSC), but it depends on python 2.X which is no longer supported. This package provides a conversion of the code to R. Here, we need


* the simulated summary statistics from previous step

Following code will calcualte hyperstats and write it out to a file that can be used for the third step of running msReject.


```{r hyperstats from hBayeSSC}
#if you wrote the observed summary stats into a file previously
obsfile<-"temp_obs_file"
simulatedobs<-read.table(obsfile,sep="\t",header=T,stringsAsFactors = F)

#calculate hyperstat across all loci
hyperstat<-calculate_hyperstat(simulatedobs)

#make a vector containing these hyperstats, the 'real' number of events,
#and expansion time for each species 
a<-rep(0,3)
a[1]<-'temp' #or any name you want to give to this pseudo-observed dataset
a[2]<-length(exp.time) #true number of events
a[3]<-length(species) #total number of species
names(a)<-c("uid","nevent","nspecies")
species.time<-species_exp_time(species.assignment = species.assignment,exp.time = exp.time)
a<-c(a,species.time)
hyperstat<-c(a,hyperstat)

#write the hyperstats to a file
outfile<-"temp_hyperstat_file"
cat(paste(hyperstat,sep='',collapse = "\t"),"\n",sep='',file = outfile)

2.ABC simulation with BayeSSC

ABC simulation is very similar to generating test data. The biggest difference is that the event time should be autogenerated according to a prior rather than directly specified by user.

For the prior on the degree of concordance, hBayeSSC used a flat, uniform distribution. Here, we adopted the PyMsbayes-style prior, using a gamma distribution for the alpha parameter of the dirichelet process. User needs to decide on the two parameters for the gamma distribution: concentrationShape and consentrationScale. Check prior selection in PyMsbayes for how to select these two parameters.

Similar to step 1, this step needs:

```{r ABC simulation with configuration file} path_to_bayessc="BayeSSC.exe" concentrationShape=20.0 concentrationscale=0.15 time.range<-c(30000,50000) buffer<-500 npod<-10 conf<-conf

running ABC simulation

reference.table<-ABC_simulation_with_conf(npod=npod,conf=conf,time.range=time.range,buffer=buffer,concentrationscale=concentrationscale,concentrationShape=concentrationShape,BayeSSCallocation=path_to_bayessc,prefix='temp',do.parallel=2,write.reference.file = F)

can set the write.reference.file = T, simulated sum stats will be append to a file named prefix_reference_table

useful for simulating large number of replicates

note that if write.reference.file = T, the function does not return the simulated data

just the path to the result file

the package comes with a toy example of reference.table

colnames(reference.table) head(reference.table[,1:5])


```{r ABC simulation from par file}
path_to_bayessc="BayeSSC.exe"
concentrationShape=20.0
concentrationscale=0.15
time.range<-c(30000,50000)
buffer<-500
npod<-100
# you can write the template out to a file and edit it with any text editor
cat(paste(templateparfile,sep = '',collapse = '\n'),"\n",sep='',file="test.par")
bayessc_par_file<-'test.par'
nloci<-10
species<-1:10
gen<-1 # or it can be a vector with one generation time for each species
conf<-par_to_config(bayessc_par_file,species,nloci,gen)
reference.table<-ABC_simulation_with_conf(npod=npod,conf=conf,time.range=time.range,buffer=buffer,concentrationscale=concentrationscale,concentrationShape=concentrationShape,BayeSSCallocation=path_to_bayessc,prefix='temp',do.parallel=2)

To simulate with uniform prior on the number of co-expansion events

```{r ABC simulation with uniform prior on the number of co-expansion events} path_to_bayessc="BayeSSC.exe" time.range<-c(30000,50000) buffer<-500 npod<-10 conf<-conf

running ABC simulation

reference.table<-ABC_simulation_uniform(npod=npod,conf=conf,time.range=time.range,buffer=buffer,BayeSSCallocation=path_to_bayessc,prefix='temp',do.parallel=2,write.reference.file = F)


## __3.Generating Inference for Test Data__  

### __i) Running msReject to get posterior samples__  

For installing/compiling msReject see [msbayes webpage](http://msbayes.sourceforge.net/). User needs to provide
* the "observed" hyperstats  
* the simulated reference data table  
* the path to msReject  
* sampling tolerance  
* prefix to the posterior file  

```{r run MsReject}
path_to_msreject<-"./msReject"
samplingtolerance<-0.5 #just a toy example, should be much smaller for real runs
prefix<-'temp'
posterior<-run_msreject(hyperstat=hyperstat,reference.table=reference.table,MsRejectallocation=path_to_msreject,samplingtolerance=samplingtolerance,prefix=prefix)

ii) Infer the number of co-expansion events

With the posterior file after running msReject, we can use the VGAM, abc and locfit package to do the final acceptance and parameter estimation, some scripts are on the hBayeSSC webpage. Here is some code for looking at the mode of event number in the posterior samples.

{r the mode of number of events} getmode <- function(v) { uniqv <- unique(v) uniqv[which.max(tabulate(match(v, uniqv)))] } getmode(posterior$nevent)



huatengh/CoExpansionValidation documentation built on Nov. 19, 2021, 11:07 a.m.