In EricArcher/strataG: Summaries and Population Structure Analyses of Genetic Data

library(strataG)

Raw data

It is easiest to load raw genotype data from the disk if it is available as a text file, usually in comma-delimited (.csv) format. The standard R functions read.table or read.csv can be used to accomplish this. However, in strataG, the readGenData function has been provided for .csv files, which is a wrapper for read.csv that sets commonly used values for missing data and removes blank lines.

gen.data <- readGenData("msats.csv")
str(gen.data)

For sequence data stored in FASTA format, the read.fasta function is available, which is a wrapper for the read.dna function in the ape package with standard FASTA arguments set. This will create a DNAbin object in the workspace:

fname <- system.file("extdata/dolph.seqs.fasta", package = "strataG")
x <- read.fasta(fname) 
x

For sequences stored in other formats, read.dna should be used directly.

Construction

For most functions in strataG, you will need to load your data into a gtypes object. A gtypes object is an R S4 class with several slots that are fully described in ?gtypes.
The easiest way to create a gtypes object is with the df2gtypes() function. This function assumes that you have a matrix or data.frame with columns for individual ids, stratification, and locus data. You then specify the columns in the data.frame where this information can be found. df2gtypes() can be used for data with multiple alleles per locus, like this:

# create a single data.frame with the msat data and stratification
msats.merge <- merge(dolph.strata, dolph.msats, all.y = TRUE, description = date())
str(msats.merge)

# create the gtypes object
msats.fine <- df2gtypes(msats.merge, ploidy = 2, id.col = 1, strata.col = 3, loc.col = 5)

...or for haploid data, like this:

data(dolph.seqs)

seq.df <- dolph.strata[ c("id", "broad", "id")]
colnames(seq.df)[3] <- "D-loop"
dl.g <- df2gtypes(seq.df, ploidy = 1, sequences = dolph.seqs)
dl.g

Note that since each sequence in dolph.seqs is for a given individual, the num.ind and num.haplotypes values are the same for both strata. In order to convert the sequences to unique haplotypes, use the labelHaplotypes() function:

dl.haps <- labelHaplotypes(dl.g)
dl.haps

sequence2gtypes - Convert DNA sequences

The sequence2gtypes() function creates an unstratified gtype object with just a set of DNA sequences:

data(dolph.haps)

haps.g <- sequence2gtypes(dolph.haps)
haps.g

If you have a vector that identifies strata designations for the sequences, that can be supplied as well:

# extract and name the stratification scheme
strata <- dolph.strata$fine
names(strata) <- dolph.strata$ids

# create the gtypes object
dloop.fine <- sequence2gtypes(dolph.seqs, strata, seq.names = "dLoop",
  description = "dLoop: fine-scale stratification")
dloop.fine

Note that stratification is generally provided for individuals, thus if you want to stratify the resulting gtypes object from sequence2gtypes(), one sequence for each individual should be provided, rather than just a set of unique haplotypes.

Conversions from other packages.

THere are conversion functions for data objects from several other popular packages in R, such as adegenet(genind), pegas(loci), and phangorn(phydat).

library(adegenet)
# from example(df2genind)
df <- data.frame(locusA=c("11","11","12","32"),
                 locusB=c(NA,"34","55","15"),
                 locusC=c("22","22","21","22"))
row.names(df) <- .genlab("genotype",4)
obj <- df2genind(df, ploidy=2, ncode=1)
obj

# convert to gtypes
gi.g <- genind2gtypes(obj)
gi.g

Accessor functions

There are several functions for getting basic information from a gtypes object (see ?accessors):

getNumInd(g) The number of individuals.
getNumLoci(g) The number of loci.
getNumStrata(g) The number of strata in the current scheme.
getIndNames(g) The names of the individuals.
getLociNames(g) The names of the loci or genes.
getStrataNames(g) The names of the strata in the current scheme.
getPloidy(g) The ploidy of each locus.
getStrata(g) The current strata to which each individual belongs.
getSchemes(g) A data frame of potential stratification schemes.
getSequences(g) The sequences stored in a haploid object.
getDescription(g) The text description of the object.
getOther(g) The list used to store other information about the object.

Some functions are available for modifying values in the object as well, such as:

setStrata(g) Replace the vector of strata assignments.
setSchemes(g) Replace the data.frame of potential stratification schemes.
setDescription(g) Replace the label describing the object.
setOther(g) Replace the optional data stored in the \@other slot.

Subsetting/Indexing

A gtypes object can be subset using the standard R '[' indexing operation, with three slots: [i, j, k]. The first (i) specifies the desired individuals, the second (j) is the loci to return, and the third (k) is the strata. All standard R indexing operations involving numerical, character, or logical vectors work for each argument. For example, to return 10 random individuals:

sub.msats <- msats.fine[sample(getNumInd(msats.fine), 10), , ]
sub.msats

...or to return specific loci:

sub.msats <- sub.msats[, c("D11t", "EV37", "TV7"), ]
sub.msats

...or some loci in a specific stratum:

sub.msats <- msats.fine[, c("Ttr11", "D11t"), "Coastal"]
sub.msats

Summary

Several functions have been defined for gtypes, that provide summaries for individuals (summarizeInds()), loci (summarizeLoci()), and sequences (summarizeSeqs()):

summarizeLoci(msats.fine)
summarizeInds(msats.fine)

Stratifying samples

You can specify the stratification scheme when creating a gtypes object as in the examples above. Once a gtypes object has been created, you can also change the stratification scheme by either supplying a new vector for the \@strata slot:

# randomly stratify individuals to two populations
msats <- msats.g
new.strata <- sample(c("Pop1", "Pop2"), getNumInd(msats), rep = TRUE)
names(new.strata) <- getIndNames(msats)
setStrata(msats) <- new.strata
msats

or, if there is a stratification scheme data.frame in the \@schemes slot, you can use the stratify function to choose a stratification scheme:

# choose "broad" stratification scheme
msats <- stratify(msats, "broad")
msats

You can update the \@schemes slot with data.frame like this:

new.schemes <- getSchemes(msats)
new.schemes$ran.pop <- sample(c("Pop5", "Pop6"), getNumInd(msats), rep = TRUE)
setSchemes(msats) <- new.schemes

NOTE: Filling or changing the \@schemes slot does not affect the current stratification of the samples. You must then select a new stratification scheme or fill the \@strata slot as above.

stratify(msats, "ran.pop")

If some samples should be unstratified (excluded from any stratified analyses), they should have NAs in the appropriate position in the \@strata slot. For example:

# unstratify a random 10 samples
x <- getStrata(msats)
x[sample(getIndNames(msats), 10)] <- NA
msats

You can also randomly permute the current stratification scheme using the permuteStrata() function like this:

msats <- stratify(msats, "fine")

# original
msats

# permuted
ran.msats <- permuteStrata(msats)
ran.msats

NOTE: Only samples assigned to strata are permuted with permuteStrata. Those not assigned (NAs) remain unassigned.

Exporting

The allelic data in a gtypes object can be converted back to a matrix or data frame with as.matrix() and as.data.frame():

gen.mat <- as.matrix(msats)
head(gen.mat)

By default, this function splits each allele into its own column. One can make a matrix with one locus per column and alleles separated by a specified character by setting the one.col argument to TRUE:

gen.mat <- as.matrix(msats, one.col = TRUE)
head(gen.mat)

The contents of a gtypes object can be written to a file with the writeGtypes() function. This will write a .csv file with the allelic information and a .fasta file for any sequence data if it exists.

EricArcher/strataG documentation built on Feb. 12, 2023, 4:11 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com