knitr::opts_chunk$set(message = FALSE,warning = FALSE)
The package GDS evolutionary analysis in R or geaR
is designed to aid with evolutionary and population genetic analysis across whole genome genotype data in the GDS
format.
Analysis can be carried out using classical approaches such as sliding windows, however the main strength of geaR is the ability to include only certain types of features in the analysis.
VCF files can be converted using the R package SeqArray
This command will convert the sample VCF file into GDS format to disk keeping only the Genotype field using 2 threads.
library(SeqArray) packageDir <- system.file("extdata", package = "geaR") seqVCF2GDS(vcf.fn = paste0(packageDir, "/sampleVCF.vcf.gz"), parallel = 2, out.fn = paste0(packageDir, "/sampleVCF.gds"), fmt.import = c("GT"), storage.option = "ZIP_RA")
A common analysis in population genetics is to subdivide the genome into equally sized windows before analysis. Below we will construct loci for four different window types.
Windows of a given size can be generated by using the windowMaker()
or SnpWindowMaker()
functions.
In order to do this, first the package is loaded and scaffold/contig length must be input.
If scaffold information is not known the necessary data.frame
can be made using the reference genome's fasta index (.fai)
library(geaR) library(readr) library(tibble) fileDir <- system.file("extdata", package = "geaR") file <- file.path(fileDir, "/sampleReferenceGenome.fa.fai") df <- readr::read_tsv(file, col_names = FALSE) x <- tibble::tibble(ID = df[[1]], length = df[[2]]) x
We can then use the contig/scaffold metadata to construct loci for a tiled or sliding window analysis.
This will construct windows using scaffold positions, however the number of genotypes in each window is likely to differ due to missing data.
Windows can then be generated by setting arguments windowSize
and stepSize
. Optionally, nCores
can also be set to split the jobs over parallel processors.
### generate 10kb tiled windows loci <- windowMaker(x = x, windowSize = 10000, stepSize = 0) loci ### generate 10kb windows sliding by 2kb loci <- windowMaker(x = x, windowSize = 10000, stepSize = 2000) loci
The R package also allows for windows to be constructed based on genotype position in the GDS ensuring the same number of genotypes in each window.
This will require the GDS to be passed to the function so first we must open a connaection to the GDS using seqOpen()
from the SeqArray package.
library(SeqArray) # open pointer to GDS on disk GDS <- seqOpen(paste0(fileDir, "/sampleVCF.gds"), allow.duplicate = TRUE) ### generate windows with the same number of SNPs in each loci <- snpWindowMaker(x = x, GDS = GDS, windowSize = 1000, stepSize = 0) loci ### generate windows with the same number of SNPs in each loci <- snpWindowMaker(x = x, GDS = GDS, windowSize = 1000, stepSize = 200)
As the analysis will be calculated across a population, we need to define the individuals and which population they belong to.
To do this, we first point to the GDS
file on disk and extract sample names as a vector.
The population data frame can then be constructed by dividing individuals into populaitons.
We will do this arbitrarily with 2 individuals in 10 populations.
# Extract sample names from GDS object samples <- seqGetData(gdsfile = GDS, var.name = "sample.id") # Define populations pops <- tibble::tibble(Sample = samples[1:20], Population = rep(paste0("P", 1:10), each = 2)) head(pops)
We will then begin constructing cog objects. Cogs form the backbone of an analysis in \code{geaR}. For this example we will carry out all analyses: Genetic diversity, Admixture, Loci output and Distance tree output Genetic diversity will calculate
Admixture will calculate
Output Loci will write loci to fasta files in a supplied directory. Output Trees will write distance trees to newick files in a supplied directory
GRanges
# Change this to your desired output directory outDir <- "~/Desktop/setupGear/" # general arguments for the analysis argCog <- makeCog(analysisType = "args", ploidy = 2, nCores = 2, minSites = 0.01, pairwiseDeletion = TRUE, removeIndels = TRUE) # set up diverstity analysis divCog <- makeCog(analysisType = "diversityFULL", stats = "all") # set up admixture cog admixCog <- makeCog(analysisType = "admixture", threePop = list(), fourPop = list(c("P1", "P2", "P3", "P10"))) # set up output loci cog outloci <- makeCog(analysisType = "outputLoci", outputDirectory = outDir, alleles = "seperate", removeIndels = TRUE) # set up output tree cog outTrees <- makeCog(analysisType = "outputTrees", outputDirectory = outDir, alleles = "seperate", removeIndels = TRUE)
We will then use a list of cogs to build a \code{gear} object for the analysis
Now you can initialise a geaR object and run an analysis
gear10kb <- makeGear(loci, populations = pops, outgroup = "P10", cogs = list(argCog,admixCog, divCog,outloci,outTrees)) gear10kb <- analyzeGear(GDS, gear = gear10kb)
library(kableExtra) kable(gear10kb@DiversityStatsFULL) %>% kable_styling() %>% scroll_box(width = "800px", height = "200px")
kable(gear10kb@AdmixtureStats) %>% kable_styling() %>% scroll_box(width = "800px", height = "200px")
gear10kb@OutputLoci gear10kb@OutputTrees
Users may wish to investigate different features across the genome. Below we will read in coding sequence from a gff file, select 4-fold degenerate codons, and incorperate into our windows.
cds <- makeFeatures(gffName = paste0(fileDir, "/sampleGFF.gff3"), feature = "gene:cds", longestIsoform = TRUE, geneIDField = "Name") kable(head(as.data.frame(cds))) %>% kable_styling() %>% scroll_box(width = "800px", height = "200px")
We can then use the extracted coding sequence to generate a database containing codon positions and residues. After this is generated we will then extract the 3rd position of 4-fold degenerate codons. However, it is not sufficient just to filter the data as there may be variation within the test dataset causing an amino acid substitution. To counter this we will validate the presence of 4-fold degenerate codons and return only those that match.
library(Biostrings) # Import genome and build the codon database genome <- readDNAStringSet(paste0(fileDir, "/sampleReferenceGenome.fa")) codons <- buildCodonDB(genome, exons = cds) # Calidate codons fourFold <- validate4FoldCodons(GDS, input = codons, pops = pops, nCores = 1) kable(head(as.data.frame(codons))) %>% kable_styling() %>% scroll_box(width = "800px", height = "200px")
These ranges can then be incorperated into out 10kb tiled windows.
In order to do this we use the mergeLoci
function specifying our windows from before and the cds we just generated.
# additive merge the two loci cdsLoci <- mergeLoci(x = loci, y = cds, overlap = "+", nCores = 1)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.