options(digits = 2)
library(strataG)

There are several functions for working with sequence data in strataG. They will either take a haploid gtypes object that contains sequences or some format that can be converted to a DNAbin (in the ape package), or multidna (in the apex package) object.

Leading and trailing N's can be removed from all sequences like this:

library(ape)
data(dolph.seqs)

i <- sample(1:10, 1)
j <- sample(1:10, 1)
x <- c(rep("n", i), dolph.seqs[[1]], rep("n", j))
x
x.trimmed <- trimNs(as.DNAbin(x))
as.character(as.list(x.trimmed))

Base frequencies for a sequence are calculated with the baseFreqs function:

bf <- baseFreqs(dolph.seqs)
bf$site.freqs[, 1:8]
bf$base.freqs

One can also identify which sites are fixed and which are variable:

fs <- fixedSites(dolph.seqs)
fs[1:20]

vs <- variableSites(dolph.seqs)
vs

Both functions take an optional set of bases to consider when evaluating whether a site is fixed or variable. For fixedSites, the function will only count those sites that are fixed in the listed bases argument. For variableSites the site is considered variable if it has those bases and is not fixed for them:

fs <- fixedSites(dolph.seqs, bases = c("c", "t"))
fs[1:20]

vs <- variableSites(dolph.seqs, bases = c("c", "t"))
vs

There are also functions to compare bases against IUPAC ambiguity codes. One can calculate the appropriate IUPAC code for a vector of nucleotides:

iupacCode(c("c", "t", "t", "c", "c"))
iupacCode(c("c", "t", "a", "c", "c"))
iupacCode(c("g", "t", "a", "c", "c"))

One can also calculate all IUPAC codes that apply to a vector of nucleotides:

validIupacCodes(c("c", "t", "t", "c", "c"))
validIupacCodes(c("c", "t", "a", "c", "c"))
validIupacCodes(c("g", "t", "a", "c", "c"))

A consensus sequence can also be easily generated:

createConsensus(dolph.seqs)

Nucleotide diversity for each site is calculaed with:

nd <- nucleotideDiversity(dolph.seqs)
head(nd)

For a stratified gtypes object, one can calculate net nucleotide divergence (Nei's dA), and distributions of between- and within-strata divergence:

# create gtypes
data(dolph.seqs)
data(dolph.strata)
rownames(dolph.strata) <- dolph.strata$id
dloop <- df2gtypes(dolph.strata[, c("id", "fine", "id")], ploidy = 1,
             schemes = dolph.strata[, c("fine", "broad")], sequences = dolph.seqs)
dloop <- labelHaplotypes(dloop, "Hap.")

# calculate divergence
nucleotideDivergence(dloop)

For stratified gtypes, one can also identify fixed differences between strata:

fixedDifferences(dloop)

Two functions have been provided to select a subset of representative sequences. The first selects the most distant sequences in order to capture the full distribution of variation. For example:

x <- as.DNAbin(dolph.seqs)
mostDistantSequences(x, num.seqs = 5)

The other function selects the most representative sequences by first clustering the sequences and selecting the sequences closest to the center of each cluster:

mostRepresentativeSequences(x, num.seqs = 5)


EricArcher/strataG documentation built on Feb. 12, 2023, 4:11 a.m.