Genotype and sequence summaries
In strataG: Summaries and Population Structure Analyses of Genetic Data

options(digits = 2)
library(strataG)

There are several by-locus summary functions available for gtypes objects. Given some sample microsatellite data:

data(msats.g)
msats <- stratify(msats.g, "broad")
msats <- msats[, getLociNames(msats)[1:4], ]

One can calculate the following summaries:

The number of alleles at each locus:

numAlleles(msats)

The number of samples with missing data at each locus:

numMissing(msats)

which can also be expressed as a proportion of samples with missing data:

numMissing(msats, prop = TRUE)

The allelic richness, or the average number of alleles per sample:

allelicRichness(msats)

The observed and expected heterozygosity:

# observed
heterozygosity(msats, type = "observed")

# expected
heterozygosity(msats, type = "expected")

The proportion of alleles that are unique (present in only one sample):

propUniqueAlleles(msats)

The value of theta based on heterozygosity:

theta(msats)

These measures are all calculated in the summarizeLoci function and returned as a matrix. This function also allows you to calculate the measures for each stratum separately, which returns a list for each stratum:

summarizeLoci(msats)
summarizeLoci(msats, by.strata = TRUE)

One can also obtain the allelic frequencies for each locus overall and by-strata by:

alleleFreqs(msats)
alleleFreqs(msats, by.strata = TRUE)

The dupGenotypes function identifies samples that have the same or nearly the same genotypes. The number (or percent) of loci that must be shared in order for it to be considered a duplicate can be set by the num.shared argument. The return data.frame provides which loci the two samples show mismatches at so they can be reviewed.

# Find samples that share alleles at 2/3rds of the loci
dupGenotypes(msats, num.shared = 0.66)

The start and end positions and number of N's and indels can be generated with the summarizeSeqs function:

library(ape)
data(dolph.seqs)
seq.smry <- summarizeSeqs(as.DNAbin(dolph.seqs))
head(seq.smry)

Base frequencies can be generated with baseFreqs:

bf <- baseFreqs(as.DNAbin(dolph.seqs))

# nucleotide frequencies by site
bf$site.freq[, 1:15]

# overall nucleotide frequencies
bf$base.freqs

Sequences can be scanned for low-frequency substitutions with lowFreqSubs:

lowFreqSubs(as.DNAbin(dolph.seqs), min.freq = 2)

Unusual sequences can be identified by plotting likelihoods based on pairwise distances:

data(dolph.haps)
sequenceLikelihoods(as.DNAbin(dolph.haps))

All of the above functions can be conducted at once with the qaqc function. Only those functions appropriate to the data type contained (haploid or diploid) will be run. Files are written for each output that are labelled either by the \@description slot of the gtypes object or the optional label argument of the function.