In EricArcher/strataG: Summaries and Population Structure Analyses of Genetic Data

options(digits = 2)
library(strataG)

There are several functions for working with sequence data in strataG. They will either take a haploid gtypes object that contains sequences or some format that can be converted to a DNAbin (in the ape package), or multidna (in the apex package) object.

Leading and trailing N's can be removed from all sequences like this:

library(ape)
data(dolph.seqs)

i <- sample(1:10, 1)
j <- sample(1:10, 1)
x <- c(rep("n", i), dolph.seqs[[1]], rep("n", j))
x
x.trimmed <- trimNs(as.DNAbin(x))
as.character(as.list(x.trimmed))

Base frequencies for a sequence are calculated with the baseFreqs function:

bf <- baseFreqs(dolph.seqs)
bf$site.freqs[, 1:8]
bf$base.freqs

One can also identify which sites are fixed and which are variable:

fs <- fixedSites(dolph.seqs)
fs[1:20]

vs <- variableSites(dolph.seqs)
vs

Both functions take an optional set of bases to consider when evaluating whether a site is fixed or variable. For fixedSites, the function will only count those sites that are fixed in the listed bases argument. For variableSites the site is considered variable if it has those bases and is not fixed for them:

fs <- fixedSites(dolph.seqs, bases = c("c", "t"))
fs[1:20]

vs <- variableSites(dolph.seqs, bases = c("c", "t"))
vs

There are also functions to compare bases against IUPAC ambiguity codes. One can calculate the appropriate IUPAC code for a vector of nucleotides:

iupacCode(c("c", "t", "t", "c", "c"))
iupacCode(c("c", "t", "a", "c", "c"))
iupacCode(c("g", "t", "a", "c", "c"))

One can also calculate all IUPAC codes that apply to a vector of nucleotides:

validIupacCodes(c("c", "t", "t", "c", "c"))
validIupacCodes(c("c", "t", "a", "c", "c"))
validIupacCodes(c("g", "t", "a", "c", "c"))

A consensus sequence can also be easily generated:

createConsensus(dolph.seqs)

Nucleotide diversity for each site is calculaed with:

nd <- nucleotideDiversity(dolph.seqs)
head(nd)

For a stratified gtypes object, one can calculate net nucleotide divergence (Nei's dA), and distributions of between- and within-strata divergence:

# create gtypes
data(dolph.seqs)
data(dolph.strata)
rownames(dolph.strata) <- dolph.strata$id
dloop <- df2gtypes(dolph.strata[, c("id", "fine", "id")], ploidy = 1,
             schemes = dolph.strata[, c("fine", "broad")], sequences = dolph.seqs)
dloop <- labelHaplotypes(dloop, "Hap.")

# calculate divergence
nucleotideDivergence(dloop)

For stratified gtypes, one can also identify fixed differences between strata:

fixedDifferences(dloop)

Two functions have been provided to select a subset of representative sequences. The first selects the most distant sequences in order to capture the full distribution of variation. For example:

x <- as.DNAbin(dolph.seqs)
mostDistantSequences(x, num.seqs = 5)

The other function selects the most representative sequences by first clustering the sequences and selecting the sequences closest to the center of each cluster:

mostRepresentativeSequences(x, num.seqs = 5)

EricArcher/strataG documentation built on June 8, 2025, 2:12 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

EricArcher/strataG
Summaries and Population Structure Analyses of Genetic Data

In EricArcher/strataG: Summaries and Population Structure Analyses of Genetic Data

R Package Documentation

Browse R Packages

We want your feedback!

EricArcher/strataG Summaries and Population Structure Analyses of Genetic Data

In EricArcher/strataG: Summaries and Population Structure Analyses of Genetic Data

R Package Documentation

Browse R Packages

We want your feedback!

EricArcher/strataG
Summaries and Population Structure Analyses of Genetic Data