knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

The bacsnp package is a simple tool for the analysis of single nucleotide positions in viruses from the family Baculoviridae (https://talk.ictvonline.org/ictv-reports/ictv_online_report/dsdna-viruses/w/baculoviridae). Baculoviruses have a circular-closed and double-stranded DNA genome, which are lacking the presence of introns and exons. Baculovirus species and isolates are often sequenced by whole genome sequencing techniques, e.g. Illumina, and assembled subsequently to consensus sequences, which itself is then used for the detection of single nucleotide polymorphisms (SNP). The package bacsnp is intended to help analyzing the detected varialbe positions by loading them into R and processing them for later downstream analysis and visualisation.

Worklow for SNP calling

The idea behind bacsnp is the handling of multiple Illumina (short read) datasets that come from different sequenced baculovirus isolates or samples that have been mapped against a common reference sequence. By using a common reference, all SNPs found are directly related to each other and specificities can also be assigned to the individual positions. Detected SNP positions can be either isolate specific or specific for a group of isolates. Determine the specificity of SNP position is a feature of bacsnp.

The bacsnp package starts after calling variable positions using MPileup and sequencing data should be processed under the following requirements:

For exact worklow description and full list of parameters see Wennmann et al. (submitted).

For running the bacsnp code some R packages are recommended and loaded.

library(VariantAnnotation)
library(IRanges)
library(bacsnp)

Loading VCF file data

The bacsnp package requires to import the VCF file (v4.2) by the VariantAnnotation packge using readVcf function.

bacdata <- readVcf("bac.vcf")

Transforming VCF file data

For a easier presentation and analysis of the data, the VCF file is transformed to a data frame format.

bac.df <- bacsnp.transformation(bacdata)
head(bac.df)

The following information is stored in the individual columns:

Accessing basic SNP information

After the transformation of the VCF data into a data frame, information about the experimental setup can be assessed by standard R commands.

unique(bac.df$ISO)

The names of the sequences isolates (Iso1 and Iso2) are preceded by the letter "i". This is important later so that the names of the isolatese are not confused with the specificities of the individual positions.

The number of unique SNP positions found ban also be determined.

length(unique(bac.df$POS))

The individual SNP positions are listed below.

unique(bac.df$POS)

Filtering SNP data frame

The data can be filtered by three different criteria.

In this example the a minimal absolute read depth of 100 is required in order to consider a detected SNP position. In addition the minimal read depth of the alternative is required to be above 10 and the relative frequency of an alternative nucleotdie must be higher than 0.05 (5 percent.

bac.f <- bacsnp.filter(bac.df, min.abs.cov = 100, min.abs.alt = 10, min.rel.alt = 0.05)

Determination of SNP specificity

One of the most improtant stepts is to determine the specificities that were already mentioned above. To do this, the function bacsnp.specificity is called. It requires the data set as input and a vector that contains all isolates that are to be used for the determination of the specificities. Not all isolates need to be specified here. A smaller number is also sufficient for which the SNP specificities are determined. The specificities are then transferred to the isolates that are not taken into account.

bac.spec <- bacsnp.specificity(bac.f, c("iIso1", "iIso2"), which.rel = "REL.ALT1")

The function snp_specificity provides an element of type list with two major information. First, all combinations with SNP specificity were identified.

bac.spec$combinations

In this example 251 positions were found to be specific for the second isolate. 44 positions were variable in both isolates and only one position is determined to be specific for isolat 1 only.

The specificities are added to the data frame (GROUP.ID, NT.SPEC and SPEC).

head(bac.spec$data[, c("POS", "REF", "ALT","REL.ALT1", "SPEC", "GROUP.ID", "NT.SPEC")]) 

Visualization of SNP frequencies with bacsnp.plot

The function bacsnp.pot is based on ggplot2 can be used for visualtization of SNP positions and their frequencies.

i1 <- bac.spec$data[bac.spec$data$ISO == "iIso1",]
i2 <- bac.spec$data[bac.spec$data$ISO == "iIso2",]

bacsnp.plot(i1, col = "SPEC", genome.length = 123193, mark.repeats = FALSE)#, mark.lowGQ = 120, mark.lowQUAL = 400)
bacsnp.plot(i2, col = "SPEC", genome.length = 123193, mark.repeats = FALSE)#, mark.lowGQ = 120, mark.lowQUAL = 400)


wennj/bacsnp documentation built on Oct. 20, 2022, 11:17 p.m.