vcfR2DNAbin: Convert vcfR to DNAbin
In knausb/vcfR: Manipulate and Visualize VCF Data

vcfR2DNAbin

R Documentation

Convert vcfR to DNAbin

Description

Convert objects of class vcfR to objects of class ape::DNAbin

Usage

vcfR2DNAbin(
  x,
  extract.indels = TRUE,
  consensus = FALSE,
  extract.haps = TRUE,
  unphased_as_NA = TRUE,
  asterisk_as_del = FALSE,
  ref.seq = NULL,
  start.pos = NULL,
  verbose = TRUE
)

Arguments

`x`	an object of class chromR or vcfR
`extract.indels`	logical indicating to remove indels (TRUE) or to include them while retaining alignment
`consensus`	logical, indicates whether an IUPAC ambiguity code should be used for diploid heterozygotes
`extract.haps`	logical specifying whether to separate each genotype into alleles based on a delimiting character
`unphased_as_NA`	logical indicating how to handle alleles in unphased genotypes
`asterisk_as_del`	logical indicating that the asterisk allele should be converted to a deletion (TRUE) or NA (FALSE)
`ref.seq`	reference sequence (DNAbin) for the region being converted
`start.pos`	chromosomal position for the start of the ref.seq
`verbose`	logical specifying whether to produce verbose output

Details

Objects of class DNAbin, from the package ape, store nucleotide sequence information. Typically, nucleotide sequence information contains all the nucleotides within a region, for example, a gene. Because most sites are typically invariant, this results in a large amount of redundant data. This is why files in the vcf format only contain information on variant sites, it results in a smaller file. Nucleotide sequences can be generated which only contain variant sites. However, some applications require the invariant sites. For example, inference of phylogeny based on maximum likelihood or Bayesian methods requires invariant sites. The function vcfR2DNAbin therefore includes a number of options in attempt to accomodate various scenarios.

The presence of indels (insertions or deletions)in a sequence typically presents a data analysis problem. Mutation models typically do not accomodate this data well. For now, the only option is for indels to be omitted from the conversion of vcfR to DNAbin objects. The option extract.indels was included to remind us of this, and to provide a placeholder in case we wish to address this in the future.

The ploidy of the samples is inferred from the first non-missing genotype. All samples and all variants within each sample are assumed to be of the same ploid.

Conversion of haploid data is fairly straight forward. The options consensus and extract.haps are not relevant here. When vcfR2DNAbin encounters missing data in the vcf data (NA) it is coded as an ambiguous nucleotide (n) in the DNAbin object. When no reference sequence is provided (option ref.seq), a DNAbin object consisting only of variant sites is created. When a reference sequence and a starting position are provided the entire sequence, including invariant sites, is returned. The reference sequence is used as a starting point and variable sitees are added to this. Because the data in the vcfR object will be using a chromosomal coordinate system, we need to tell the function where on this chromosome the reference sequence begins.

Conversion of diploid data presents a number of scenarios. When the option consensus is TRUE and extract.haps is FALSE, each genotype is split into two alleles and the two alleles are converted into their IUPAC ambiguity code. This results in one sequence for each diploid sample. This may be an appropriate path when you have unphased data. Note that functions called downstream of this choice may handle IUPAC ambiguity codes in unexpected manners. When extract.haps is set to TRUE, each genotype is split into two alleles. These alleles are inserted into two sequences. This results in two sequences per diploid sample. Note that this really only makes sense if you have phased data. The options ref.seq and start.pos are used as in halpoid data.

When a variant overlaps a deletion it may be encoded by an asterisk allele (*). The GATK site covers this in a post on Spanning or overlapping deletions ]. This is handled in vcfR by allowing the user to decide how it is handled with the paramenter asterisk_as_del. When asterisk_as_del is TRUE this allele is converted into a deletion ('-'). When asterisk_as_del is FALSE the asterisk allele is converted to NA. If extract.indels is set to FALSE it should override this decision.

Conversion of polyploid data is currently not supported. However, I have made some attempts at accomodating polyploid data. If you have polyploid data and are interested in giving this a try, feel free. But be prepared to scrutinize the output to make sure it appears reasonable.

Creation of DNAbin objects from large chromosomal regions may result in objects which occupy large amounts of memory. If in doubt, begin by subsetting your data and the scale up to ensure you do not run out of memory.

Examples

library(ape)
data(vcfR_test)

# Create an example reference sequence.
nucs <- c('a','c','g','t')
set.seed(9)
myRef <- as.DNAbin(matrix(nucs[round(runif(n=20, min=0.5, max=4.5))], nrow=1))

# Recode the POS data for a smaller example.
set.seed(99)
vcfR_test@fix[,'POS'] <- sort(sample(10:20, size=length(getPOS(vcfR_test))))

# Just vcfR
myDNA <- vcfR2DNAbin(vcfR_test)
seg.sites(myDNA)
image(myDNA)

# ref.seq, no start.pos
myDNA <- vcfR2DNAbin(vcfR_test, ref.seq = myRef)
seg.sites(myDNA)
image(myDNA)

# ref.seq, start.pos = 4.
# Note that this is the same as the previous example but the variants are shifted.
myDNA <- vcfR2DNAbin(vcfR_test, ref.seq = myRef, start.pos = 4)
seg.sites(myDNA)
image(myDNA)

# ref.seq, no start.pos, unphased_as_NA = FALSE
myDNA <- vcfR2DNAbin(vcfR_test, unphased_as_NA = FALSE, ref.seq = myRef)
seg.sites(myDNA)
image(myDNA)

knausb/vcfR documentation built on Feb. 25, 2025, 1:28 p.m.