knitr::opts_chunk$set(message = FALSE,warning = FALSE)

Introduction

The package GDS evolutionary analysis in R or geaR is designed to aid with evolutionary and population genetic analysis across whole genome genotype data in the GDS format.
Analysis can be carried out using classical approaches such as sliding windows, however the main strength of geaR is the ability to include only certain types of features in the analysis.

Conversion of VCF to GDS

VCF files can be converted using the R package SeqArray

This command will convert the sample VCF file into GDS format to disk keeping only the Genotype field using 2 threads.

library(SeqArray)

packageDir <- system.file("extdata", package = "geaR")

seqVCF2GDS(vcf.fn = paste0(packageDir, "/sampleVCF.vcf.gz"),
           parallel = 2, out.fn = paste0(packageDir, "/sampleVCF.gds"),
           fmt.import = c("GT"), storage.option = "ZIP_RA")

Carrying out analysis on different window types

A common analysis in population genetics is to subdivide the genome into equally sized windows before analysis. Below we will construct loci for four different window types.

  1. Tiled window
  2. Sliding window
  3. Tiled SNP windows
  4. Sliding SNP windows

Windows of a given size can be generated by using the windowMaker() or SnpWindowMaker() functions. In order to do this, first the package is loaded and scaffold/contig length must be input. If scaffold information is not known the necessary data.frame can be made using the reference genome's fasta index (.fai)

Set up scaffold/contig meta data

library(geaR)
library(readr)
library(tibble)


fileDir <- system.file("extdata", package = "geaR")

file <- file.path(fileDir, "/sampleReferenceGenome.fa.fai")

df <- readr::read_tsv(file, col_names = FALSE)
x <- tibble::tibble(ID = df[[1]], length = df[[2]])

x

We can then use the contig/scaffold metadata to construct loci for a tiled or sliding window analysis. This will construct windows using scaffold positions, however the number of genotypes in each window is likely to differ due to missing data. Windows can then be generated by setting arguments windowSize and stepSize. Optionally, nCores can also be set to split the jobs over parallel processors.

Constructing tiled and sliding windows

### generate 10kb tiled windows
loci <- windowMaker(x = x, windowSize = 10000, stepSize = 0)

loci

### generate 10kb windows sliding by 2kb
loci <- windowMaker(x = x, windowSize = 10000, stepSize = 2000)

loci

The R package also allows for windows to be constructed based on genotype position in the GDS ensuring the same number of genotypes in each window.

This will require the GDS to be passed to the function so first we must open a connaection to the GDS using seqOpen() from the SeqArray package.

Constructing tiled and sliding snp windows

library(SeqArray)

# open pointer to GDS on disk
GDS <- seqOpen(paste0(fileDir, "/sampleVCF.gds"), allow.duplicate = TRUE) 

### generate windows with the same number of SNPs in each
loci <- snpWindowMaker(x = x, GDS = GDS, windowSize = 1000, stepSize = 0)

loci

### generate windows with the same number of SNPs in each
loci <- snpWindowMaker(x = x, GDS = GDS, windowSize = 1000, stepSize = 200)

As the analysis will be calculated across a population, we need to define the individuals and which population they belong to. To do this, we first point to the GDS file on disk and extract sample names as a vector. The population data frame can then be constructed by dividing individuals into populaitons. We will do this arbitrarily with 2 individuals in 10 populations.

Get samples from GDS then construct population metadata

# Extract sample names from GDS object
samples <- seqGetData(gdsfile = GDS, var.name = "sample.id")

# Define populations
pops <- tibble::tibble(Sample = samples[1:20], 
                       Population = rep(paste0("P", 1:10), each = 2))

head(pops)

Carrying out an analysis

We will then begin constructing cog objects. Cogs form the backbone of an analysis in \code{geaR}. For this example we will carry out all analyses: Genetic diversity, Admixture, Loci output and Distance tree output Genetic diversity will calculate

Admixture will calculate

Output Loci will write loci to fasta files in a supplied directory. Output Trees will write distance trees to newick files in a supplied directory

Calculate default statistics on window GRanges

# Change this to your desired output directory
outDir <- "~/Desktop/setupGear/"

# general arguments for the analysis
argCog <- makeCog(analysisType = "args", 
                  ploidy = 2, nCores = 2, minSites = 0.01, 
                  pairwiseDeletion = TRUE, 
                  removeIndels = TRUE)

# set up diverstity analysis
divCog <- makeCog(analysisType = "diversityFULL", 
                  stats = "all")

# set up admixture cog
admixCog <- makeCog(analysisType = "admixture", 
                    threePop = list(),
                    fourPop = list(c("P1", "P2", "P3", "P10")))

# set up output loci cog
outloci <-  makeCog(analysisType = "outputLoci", 
                    outputDirectory = outDir, 
                    alleles = "seperate", 
                    removeIndels = TRUE)

# set up output tree cog
outTrees <-  makeCog(analysisType = "outputTrees", outputDirectory = outDir, alleles = "seperate", removeIndels = TRUE)

We will then use a list of cogs to build a \code{gear} object for the analysis

Constructing the geaR object and running the analysis

Now you can initialise a geaR object and run an analysis

gear10kb <- makeGear(loci, 
                     populations = pops, 
                     outgroup = "P10", 
                     cogs = list(argCog,admixCog, divCog,outloci,outTrees))

gear10kb <- analyzeGear(GDS, gear = gear10kb)

Retreiving stats from the geaR object

library(kableExtra)
kable(gear10kb@DiversityStatsFULL) %>%
  kable_styling() %>%
  scroll_box(width = "800px", height = "200px")
kable(gear10kb@AdmixtureStats) %>%
  kable_styling() %>%
  scroll_box(width = "800px", height = "200px")
gear10kb@OutputLoci
gear10kb@OutputTrees

Constructing loci using a gff

Users may wish to investigate different features across the genome. Below we will read in coding sequence from a gff file, select 4-fold degenerate codons, and incorperate into our windows.

Setting up loci for cds using a GDS

cds <- makeFeatures(gffName = paste0(fileDir, "/sampleGFF.gff3"), 
                    feature = "gene:cds", 
                    longestIsoform = TRUE, 
                    geneIDField = "Name")

kable(head(as.data.frame(cds))) %>%
  kable_styling() %>%
  scroll_box(width = "800px", height = "200px")

We can then use the extracted coding sequence to generate a database containing codon positions and residues. After this is generated we will then extract the 3rd position of 4-fold degenerate codons. However, it is not sufficient just to filter the data as there may be variation within the test dataset causing an amino acid substitution. To counter this we will validate the presence of 4-fold degenerate codons and return only those that match.

Build codon database then validate 4-fold degenerate codons

library(Biostrings)

# Import genome and build the codon database
genome <- readDNAStringSet(paste0(fileDir, "/sampleReferenceGenome.fa"))
codons <- buildCodonDB(genome, exons = cds)

# Calidate codons
fourFold <- validate4FoldCodons(GDS, input = codons, pops = pops, nCores = 1)


kable(head(as.data.frame(codons))) %>%
  kable_styling() %>%
  scroll_box(width = "800px", height = "200px")

These ranges can then be incorperated into out 10kb tiled windows. In order to do this we use the mergeLoci function specifying our windows from before and the cds we just generated.

Combining multiple loci

# additive merge the two loci 
cdsLoci <- mergeLoci(x = loci, y = cds, overlap = "+", nCores = 1)


CMWbio/geaR documentation built on April 22, 2023, 6:23 a.m.