Data import and manipulation in poppr version `r packageVersion('poppr')`

knitr::opts_chunk$set(message = FALSE, warning = FALSE, tidy = FALSE)
knitr::opts_chunk$set(fig.align = "center", fig.show = 'asis')
knitr::opts_knit$set(eval.after = 'fig.cap')
print_command <- function(funk){
  fargs <- formals(funk)

  lapply(names(fargs), function(arg_name, fargs){
    arg <- utils::capture.output(print(fargs[[arg_name]]))
    if (arg == ""){
      fargs[[arg_name]] <<- as.symbol(arg_name)
      names(fargs)[names(fargs) == arg_name] <<- ""
    }
  }, fargs)
  fargs$call <- as.symbol(funk)
  fargs <- fargs[c(length(fargs), 1:(length(fargs) - 1))]
  return(as.call(fargs))
}

Abstract

Poppr provides open-source, cross-platform tools for quick analysis of population genetic data enabling focus on data analysis and interpretation. While there are a plethora of packages for population genetic analysis, few are able to offer quick and easy analysis of populations with mixed reproductive modes. Poppr's main advantage is the ease of use and integration with other packages such as adegenet and vegan, including support for novel methods such as clone correction, multilocus genotype analysis, calculation of Bruvo's distance, and the index of association. New features in version 2.0 include generation of minimum spanning networks with reticulation, calculation of the index of association for genomic data, and filtering multilocus genotypes based on genetic distance.





Introduction {#intro}

Purpose {#intro:purpose}

Poppr is an R package with convenient functions for analysis of genetic data with mixed modes of reproduction including sexual and clonal reproduction. While there are many R packages in CRAN and other repositories with tools for population genetic analyses, few are appropriate for populations with mixed modes of reproduction. There are several stand-alone programs that can handle these types of data sets, but they are often platform specific and often only accept specific data types. Furthermore, a typical analysis often involves switching between many programs, and converting data to each specific format.

Poppr is designed to make analysis of populations with mixed reproductive modes more streamlined and user friendly so that the researcher using it can focus on data analysis and interpretation. Poppr allows analysis of haploid and diploid dominant/co-dominant marker data including microsatellites, Single Nucleotide Polymorphisms (SNP), and Amplified Fragment Length Polymorphisms (AFLP). To avoid creating yet another file format that is specific to a program, poppr was created on the backbone of the popular R package adegenet and can take all the file formats that adegenet can take (Genpop, Genetix, Fstat, and Structure) and newly introduces compatibility with GenAlEx formatted files (exported to CSV). This means that anything you can analyze in adegenet can be further analyzed with poppr.

The real power of poppr is in the data manipulation and analytic tools. Poppr has the ability to bootstrap, clone-censor, and subset data sets. With poppr you can also quickly calculate Bruvo's distance, the index of association, and easily determine which multilocus genotypes are shared across populations.

In version 2.0, tools for genomic data were introduced [@kamvar2015novel]. These tools allow researchers to define what it means to be a clone lost in a sea of genomic data, generate bootstrapped dendrograms with any genetic distance, and calculate minimum spanning networks with reticulations to reveal the underlying population structure of your clonal data.

Resources

This vignette will cover all of the material you need to know to efficiently analyze data in poppr. For information on methods of analysis (eg. index of association, distance measures, AMOVA, ...), please read the manual pages provided for each function.

As poppr expanded from version 1.0, the vignette also expanded to be 80+ pages. As a result, it became clear that over 22,000 words was less of a manual and more of a novella with a terrible plot. To remedy this, this vignette will focus only on data manipulation and a separate vignette, "algo", has been written to give algorithmic details of analyses introduced with poppr.

As of spring 2014, Drs. Niklaus J. Gr├╝nwald, Sydney E. Everhart, and I have co- authored a primer on using R for population genetic analysis. It is located at http://grunwaldlab.github.io//Population_Genetics_in_R and the source code can be found on our github site.

Getting Help

If you have any questions or feedback, feel free to send a message to the poppr forum at http://groups.google.com/group/poppr. You can submit bug reports there or on our github site: https://github.com/grunwaldlab/poppr/issues

Acknowledgments

Much thanks goes to Sydney E. Everhart for alpha testing, beta testing, feature requests, proofreading, data contribution, and moral support throughout the writing of this package and manual. Thanks also to Brian Knaus, Ignazio Carbone, David Cooke, Corine Schoebel, Jane Stewart, and Zaid Abdo for beta testing and feedback.

The following data sets are included in poppr:

Citation

To cite poppr, please type in your R console:

citation(package = "poppr")

The formal publication for the first version of poppr was published in the journal PeerJ: http://peerj.com/articles/281/ [@kamvar2014poppr]. The new features in version 2.0 were published in Frontiers https://doi.org/10.3389/fgene.2015.00208 [@kamvar2015novel].

Installation and Quick Start

Installation {#intro:install}

This manual assumes you have installed R. If you have not, please refer to The CRAN home page at https://cran.r-project.org/. We also recommend the Rstudio IDE (http://www.rstudio.com/), which allows the user to view the R console, environment, scripts, and plots in a single window.

From CRAN

To install poppr from CRAN, select "Package Installer" from the menu "Packages & Data" in the GUI or type:

install.packages("poppr", repos = "http://cloud.r-project.org/")

All dependencies will also be installed. In the unfortunate case this does not work, please consult https://cran.r-project.org/doc/manuals/R-admin.html#Installing-packages.

From GitHub

GitHub is a repository where you can find all stable and development versions of poppr.

Since poppr contains C code, it needs to be compiled, which means that you need a working C compiler. If you are on Linux, you should have that, but if you are on Windows or OSX, you might need to download some special tools:

Windows

: Download Rtools: https://cran.r-project.org/bin/windows/Rtools/

OSX

: Download Xcode: https://developer.apple.com/xcode

To install from GitHub, you do not need to download the tarball since there is a package called devtools that will download and install the package for you directly from GitHub. After you have installed all dependencies (see above section), you should download devtools:

install.packages("devtools")

Now you can execute the command install_github() with the user and repository name:

devtools::install_github("grunwaldlab/poppr")

Quick start {#intro:qstart}

The author assumes that if you have reached this point in the manual, then you have successfully installed R and poppr. Before proceeding, you should be aware that R is case sensitive. This means that the words "Case" and "case" are different. You should also know where your R package library is located. In this section, you will learn how to locate a file, import it to R, and make a first analysis using the poppr() function.

What or where is my R package library?

R is as powerful as it is through a community of people who submit extra code called "packages" to help it do specific things. These packages live in a certain place on your computer called an R library. You can find out where this library is by typing .libPaths()

Importing a file into R involves you knowing the path to your file and then typing that into R's console. getfile() will help provide a point and click interface for selecting a file. This is simply a tool to help you get started. As you get better with R, you might feel that you don"t need it at all.

First, tell your computer to search R's library, load the poppr package, and use getfile():

x <- list(files = "/path/to/R/poppr/files/rootrot.csv", path = "/path/to/R/poppr/files")
library("poppr")
x <- getfile()

A pop up window will appear like this[^1]:

knitr::include_graphics(path = "getfile.png")

You should navigate to your R library and select the file called "rootrot.csv". If you don"t know where your the poppr package lives, you can find it by typing find.package("poppr") into the R command line. Once we select a file, the file name and its path will be stored in the variable, x. We can confirm that by typing x into R"s command line.

x

Here we can see that x is a list with two entries: $files shows the files you selected and $path shows the path to those files.

We will use x$files to access the file. The file is in the GenAlEx format, so we will import it using read.genalex() and then analyze it with the function poppr() to get a table of diversity indices per population.

myData  <- read.genalex(x$files)
myData
options(width=90)
myData <- read.genalex(system.file("files/rootrot.csv", package="poppr"))
popdata <- poppr(myData)

The output of poppr() was assigned to the variable popdata, so let"s look at the data.

popdata

The fields you see in the output include:

These fields are further described in the function poppr. You can access the help page for poppr by typing ?poppr in your R console.

One thing to note about this output is the NaN in the column labeled SE. In R, NaN means "Not a number". This is produced from calculation of a standard error based on rarefaction analysis. Occasionally, this calculation will encounter a situation in which it must attempt to take a square root of a negative number. Since the root of any negative number is not defined in the set of real numbers, it must therefore have an imaginary component, $i$. Unfortunately, R will not represent the imaginary components of numbers unless you specifically tell it to do so. By default, R represents these as NaN.

Import and Data types

Importing data into poppr {Get out of my dreams and into my R} {#intro:import}

How does R represent data? {Objective: data}

Working with data in R means that these data have to be stored inside an "object", which is stored in the computer"s memory. Because of this, it"s important to know the difference between a file and an object. When anyone talking about importing a file into R, they are referring to a spreadsheet or text file that lives in a folder on your hard drive. Spreadsheet files (saved as *.csv files) are normally imported through the R function read.table(). The output of read.table() is a data.frame. A data.frame is an object represented in your computer's memory. This means that it only exists for as long as R is running.

The good thing about having objects stored in memory is that you can manipulate them in any way and not affect the source of those data. Since R is a scripted language (instead of point-and-click), any of your manipulations can be saved in a separate R file that can be easily adapted to new data. Of course, most data are not going to be entered into R manually. Usually they will be formatted in a manner that can be read by popular population genetics programs.

As previously mentioned, since poppr is based on adegenet, it's possible to read in the following file formats into a genind object with the function import2genind():

Here, we introduce a new way of importing data into a genind or genclone object from a GenAlEx formatted file.

Function: read.genalex {#intro:import:read.genalex}

A very popular program for population genetics is GenAlEx (http://biology.anu.edu.au/GenAlEx/Welcome.html) [@Peakall:2012; @Peakall:2006]. GenAlEx runs within the Excel environment and can be very powerful in its analyses. Poppr has added the ability to read *.CSV files[^2] produced in the GenAlEx format. It can handle data types containing regions and geographic coordinates, but currently cannot import allelic frequency data from GenAlEx. Using the poppr function read.genalex() will import your data into adegenet"s genind object or poppr"s genclone object (more information on that below). For ways of formatting a GenAlEx file, see the manual here: http://biology.anu.edu.au/GenAlEx/Download_files/GenAlEx%206.5%20Guide.pdf

Below is an example of the GenAlEx format. We will use the data set called microbov from the adegenet package to generate it. The data contains three demographic factors: Country, Species and Breed contained within the @other slot (detailed in the 'other' slot). We will first set these as the population strata, define the population as the combination of the strata, and then save a file to the desktop.

library("poppr")
data(microbov)
strata(microbov) <- data.frame(other(microbov)) # set the strata
microbov
genind2genalex(microbov, file = "~/Desktop/microbov.csv")
cat("Extracting the table ... Writing the table to ~/Desktop/microbov.csv ... Done.")
mictext <- paste(
  "The first 15 individuals and 4 loci of the microbov data set. The",
  "first column contains the individual names, the second column contains",
  "the population names, and each subsequent column represents",
  "microsatellite genetic data. Highlighted in red is a list of populations",
  "and their relative sizes."
  )
knitr::include_graphics(path = "unmod_dat.png")

The GenAlEx format contains individuals in rows and loci in columns. Individual data begins at row 4. Column A always contains individual names and column B defines the population of each individual. Notice here that the three demographic factors from the data have been concatenated with a "_". This allows us to import more than one population factor to use as hierarchical levels in a genclone object.

The First three rows contain information pertaining to the global data set. The only important information for poppr is the information contained in row 3 and the first three columns of row 1.

      **A**        **B**               **C**               **D**

1 # of Loci # of Individuals # of Populations Pop1 Size ... 2 - - - Pop1 Name ... 3 - - Locus 1 ...

Highlighted in red in figure above are definitions of the number of populations and their respective sizes. As this is redundant information, we can remove it. Below is an example of a valid data set that can be imported into poppr.

mictext <- paste(
  "The first 15 individuals and 4 loci of the microbov data set. This is the same",
  "figure as above, however the populations and counts have been removed from the",
  "header row and the third number in the header has been replaced by 1" 
  )
knitr::include_graphics(path = "mod_dat.png")

All GenAlEx formatted data can be imported with the command read.genalex(), detailed below:


funk <- "read.genalex"
print_command(funk)

Note that regional data and geographic data are not mutually exclusive. You can have both in one file, just make sure that they are on the same sheet and that the geographic data is always placed after all genetic and demographic data.

We have a short example of GenAlEx formatted data with no geographic or regional formatting. We will first see where the data is using the command system.file()

system.file("files/rootrot.csv", package="poppr")
paste("/path/to/R/library/poppr/files/rootrot.csv")

Now import the data into poppr like so:

rootrot <- read.genalex(system.file("files/rootrot.csv", package="poppr"))

Executing rootrot shows that this file is now in genclone format and can be used with any function in poppr and adegenet

rootrot

Other ways of importing data {#intro:import:other}

Adegenet already supports the import of FSTAT, Structure, Genpop, and Genetix formatted files, so if you have data in those formats, you can import them using the function import2genind. For sequence data, check if you can use read.dna() from the ape package to import your data. If you can, then you can use the adegenet function DNAbin2genind(). If you don"t have any of these formats handy, you can still import your data using R"s read.table along with df2genind() from adegenet. For more information, see adegenet's "Getting Started" vignette.

Function: genind2genalex (exporting data) {#intro:import:genind2genalex}

Of course, being able to export data is just as useful as being able to import it, so we have this handy little function that will write a GenAlEx formatted file to wherever you desire.\ WARNING: This will overwrite any file that exists with the same name.


funk <- "genind2genalex"
print_command(funk)

First, a simple example for the rootrot data we demonstrated in an earlier section:

genind2genalex(rootrot, "~/Desktop/rootrot.csv")
cat("Extracting the table ... Writing the table to ~/Desktop/rootrot.csv ... Done.\n")

Here's an example of exporting the nancycats data set into GenAlEx format with geographic information. If we look at the nancycats geographic information, we can see it"s coordinates for each population, but not each individual:

data(nancycats)
other(nancycats)$xy

To export it:

genind2genalex(nancycats, "~/Desktop/nancycats_pop_xy.csv", geo = TRUE)
cat("Extracting the table ... Writing the table to ~/Desktop/nancycats_pop_xy.csv ... Done.\n")

If we wanted to assign a geographic coordinate to each individual, we can use this trick knowing that there are 17 populations in the data set:

nan2           <- nancycats
other(nan2)$xy <- other(nan2)$xy[pop(nan2), ]
tail(other(nan2)$xy)

Now we can export it to a different file.

genind2genalex(nan2, "~/Desktop/nancycats_inds_xy.csv", geo = TRUE)
cat("Extracting the table ... Writing the table to ~/Desktop/nancycats_inds_xy.csv ... Done.\n")

Getting to know adegenet"s genind object {#intro:genind}

Since poppr was built around adegenet"s framework, it is important to know how adegenet stores data in the genind object, as that is the object used by poppr. To create a genind object, adegenet takes a data frame of genotypes (rows) across multiple loci (columns) and converts them into a matrix of individual allelic counts at each locus [@Jombart:2008].

For example, Let's say we had data with 3 diploid individuals each with 3 loci that had 3, 4, and 5 allelic states respectively:

  locus1    locus2    locus3

1 101/101   201/201   301/302
2 102/103   202/203   301/303
3 102/102   203/204   304/305

The resulting genind object would contain a matrix that has 3 rows and 12 columns. Below is a schematic of what that would look like. Each column represents a separate allele, each row represents an individual and each color represents a different locus.

library("adegenet")
df <- data.frame(list(locus1=c("101/101", "102/103", "102/102"),
                      locus2=c("201/201", "202/203", "203/204"),
                      locus3=c("301/302", "301/303", "304/305")))
dat <- tab(df2genind(df, sep="/"))
tdat <- dat
tdat[] <- 1
x <- barplot(tdat, axes = FALSE, axisnames = FALSE)
barplot(rep(3, 12), col = rep(rainbow(3, alpha = 0.5), 3:5), axes = FALSE, add = TRUE)
axis(2, at = 1:3 - 0.5, labels = 1:3, tick = FALSE)
axis(3, at = x, labels = colnames(tdat), tick = FALSE)
axis(1, at = c(2, 6.125, 11.5), labels = names(df), tick = FALSE)

When we look at the data derived from table above, we see that we have a matrix of individual allele counts at each locus.

dat

At each locus, the allele counts for each individual sum to the ploidy, $p$. Homozygotes are denoted as having an allele count of $p$ at a single allele within a locus, while heterozygotes have their allele counts represented as $<p$ where $p$ = ploidy across multiple alleles in a single locus. Along with this matrix, are elements that define the names of the individuals, loci, alleles, and populations. If you wish to know more, see the adegenet "Getting Started" manual.

The other slot {#intro:genind:other}

The other slot is a place in the genind object that can be used to store useful information about the data. We saw earlier that it could store demographic information, now let's explore a different example. Bruvo's distance is based off of a stepwise mutation model for microsatellites. This requires us to know the length of the repeat of each locus. We could store the repeat lengths in a separate variable in our R environment, but we are at risk of losing that. One way to prevent it from being lost would be to place it in the "other" slot. For the purpose of this example, we will use the "nancycats" data set from the adegenet package and assume that it has di-nucleotide repeats at all of its loci.

data(nancycats) # Load the data
other(nancycats) # geographical coordinates
repeats <- rep(2, nLoc(nancycats)) #nLoc = number of loci
repeats
other(nancycats)$repeat_lengths <- repeats
other(nancycats) # two items named xy and repeat_lengths

The genclone object {send in the clones} {#intro:genclone}

The genclone class was defined in order to make working with clonal organisms more intuitive. It is built off of the genind object and has dedicated slots for defined multilocus genotypes. The name genclone refers to the fact that it has the ability to handle genotypes of clonal organisms (but it is also used for sexual populations).

In previous versions of poppr, the genclone object contained a hierarchy slot as well. This slot was moved to adegenet and its name was changed to "strata". This slot allows you to carry around several definitions for populations in the same data set.

The function as.genclone allows the user to convert a genind object to a genclone object. The following example will demonstrate that the genclone object is an extension of the genind object as well as the advantages of having populations pre-defined in your data set.

Function: as.genclone {#intro:genclone:as.genclone}


funk <- "as.genclone"
print_command(funk)

Let"s show an example of a genclone object. First, we will take an existing genind object and convert it using the function as.genclone (We can also use the function read.genalex()to import as genclone or genind objects). We will use the Aeut data set because it is a clonal data set that has a simple population strata [@Grunwald:2006]. The data set is here: https://doi.org/10.6084/m9.figshare.877104 and it is AFLP data of the root rot pathogen Aphanomyces euteiches collected from two different fields in NW Oregon and W Washington, USA. These fields were divided up into subplots from which samples were collected. The fields represent the population and the subplots represent the subpopulation. Let"s take a look at what the genind object looks like:

library("poppr")
data(Aeut)
Aeut

We can see that there is a data frame in the @other slot called "population_hierarchy". This is left over from poppr version 1.x behavior. Since the genind object now has a @strata slot, we can use it to set the stratification (previously called "hierarchy").

strata(Aeut) <- other(Aeut)$population_hierarchy[-1]
Aeut

Now we can convert this to a genclone object, which will store information about our multilocus genotypes for us.

agc <- as.genclone(Aeut)
agc

If we wanted to, we could also convert it back to a genind object.

genclone2genind(agc)

About polyploid data {#intro:import:polyploid}

WARNING

Treat polyploid data with care. Please read this section carefully and consult the help pages for all functions mentioned here.

With diploid or haploid data, genotypes are unambiguous. It is often clear when it is homozygous or heterozygous. With polyploid data, genotypes can be ambiguous. For example, a tetraploid individual with the apparent genotype of A/B could actually have one of three genotypes: A/A/A/B, A/A/B/B, or A/B/B/B. This ambiguity prevents a researcher from accurately calling all alleles present. In adegenet, it was previously difficult to import polyploid data because of this ambiguity as data was required to be unambiguous or missing.

A solution to this problem is to code missing alleles as "0". An example of this is found within the Pinf data set in poppr [@goss2014irish]. First, we will look at where we have polyploid allele calls.

data(Pinf)
Pinf
ptab <- info_table(Pinf, type = "ploidy", plot = TRUE)

We look at the last six samples over two loci, Pi63 (3 alleles, triploid) and Pi70 (3 alleles, diploid) to examine how the data is represented.

Pinf[loc = locNames(Pinf)[9:10]] %>% tab() %>% tail()

Each column in this data represents a different allele at a particular locus. Pi63.148 is the allele 148 at locus Pi63. Each row is an individual. The numbers represent the fraction of a given allele that makes up the individual genotype at a particular locus. What we can see here is that the number of columns is 8 when we expect only 6 (2 loci x 3 alleles). The first allele at each locus is 000. Let's take a look at the data in a human-readable format.

Pinfdf <- genind2df(Pinf, sep = "/")
tail(Pinfdf[10:11])

It"s more clear now that we have a data set of tetraploid individuals where some genotypes appear diploid (000/000/157/157) and some appear triploid (000/148/151/157). The tetraploid genotype is padded with zeroes to make up the difference in ploidy.

This method allows Bruvo's Distance [@Bruvo:2004] and the Index of Association [@Brown:1980; @Smith:1993; @Agapow:2001] to work with polyploids as they specifically recognize the zeroes as being missing data. A side effect, unfortunately is that the extra zeroes appear as extra alleles. As this affects all frequency-based statistics (except for the ones noted above), the user should reformat their data set with the function recode_polyploids, which will remove the zeroes.

Pinf_rc <- recode_polyploids(Pinf, newploidy = TRUE)
Pinf_rc # Notice that the new ploidy is accounted for.
Pinf_rc[loc = locNames(Pinf_rc)[9:10]] %>% tab() %>% tail

Below, we show the observed genotypes:

Pinfrcdf <- genind2df(Pinf_rc, sep = "/")
tail(Pinfrcdf[10:11])

If you have imported your data as recoded polyploid data, you can use the argument "addzero" to fill out the ploidy:

Pinf_rc[loc = locNames(Pinf_rc)[9:10]] %>%
  recode_polyploids(addzero = TRUE) %>%
  tab() %>%
  tail()

Data Manipulation {#data.manip}

One tedious aspect of population genetic analysis is the need for repeated data manipulation. Poppr includes novel functions for clone- censoring your data sets, removing genotypes or loci with missing data, removing uninformative loci, and shuffling populations.

Replace or remove missing data {Inside the golden days of missing data} {#data.manip:missing}

A data set without missing data is always ideal, but often not achievable. The poppr function missingno exists to handle missing data. Missing data can mean different things based on your data type. For microsatellites, missing data might represent any source of error that could cause a PCR product to not amplify in gel electrophoresis, which may or may not be biologically relevant. For a DNA alignment, missing data could mean something as simple as an insertion or deletion, which is biologically relevant. The choice to exclude or estimate data has very different implications for the type of data you have.

Treatment of Missing data is a non-trivial task. You should understand the nature of missing data in your data set before treatment.


Function: missingno {#data.manip:missing:missingno}

missingno is a function that serves as a way to exclude specific areas that contain systematic missing data. There are four methods available,

funk <- "missingno"
print_command(funk)

pop - a genind object.

type - This could be one of five options:

(@) "loci" This is to be used for a data set that has systematic problems with certain loci that contain null alleles or simply failed to amplify. This will remove loci with a defined threshold of missing data from the data set.

(@) "geno" This is to be used for genotypes (individuals) in your data set where many null alleles are present. Individuals with a defined threshold missing data will be removed.

(@) "asis" This argument will retain missing data in your data set. It"s useful for functions that utilize missingno internally, such as mlg.filter(), poppr.amova(), or aboot().

(@) "mean" This replaces missing data with the mean allele frequencies in the entire data set .

(@) "zero" or "0" This replaces missing data with zero, signifying a new allele .

cutoff - This is a numeric value from 0 to 1 indicating the percent allowable missing data for either loci or genotypes. If you have, for example, two loci containing missing 5% and 10% missing data, respectively and you set cutoff = 0.05, missingno will remove the second locus. Percent missing data for genotypes is considered the percent missing loci over number of total loci.

quiet - When this is set to FALSE, the number of missing values replaced will be printed to screen if the method is "zero" or "mean". It will print the number of loci or individuals removed if the method is "loci" or "geno".

freq - This is used for compatibility with the tab() method for genind object. It will convert counts of alleles to frequencies of alleles, rendering an object that will return warnings.


Let's take a look at what this does by focusing in on areas with missing data. We"ll use the data set nancycats as an example. Using the poppr function info_table, we can assess missing data within populations.

library("poppr")
data(nancycats)
info_table(nancycats, plot = TRUE)

We can see that locus fca8 has a lot of missing data. To demonstrate the function missingno(), we will zoom into the first five individuals at the first locus.

tab(nancycats)[1:5, 8:13]

When looking at this data set, recall how a genind object is formatted. You have a matrix representing counts of alleles. For diploids, if you see 1, that means it is heterozygous at that allele, and a 2 means it"s homozygous. Here we see three heterozygotes and two individuals with missing data (indicated by NA). Let"s look at what happens when we exclude loci and genotypes with >5% missing data).

nanloci <-  missingno(nancycats, "loci")
nangeno <-  missingno(nancycats, "geno")
tab(nanloci)[1:5, 8:13]

Notice how we now see columns named fca23.128 and fca23.130. This is showing us another locus because we have removed the first. Recall from the summary table that the first locus had 16 alleles, and the second had 11. Now that we've removed loci containing missing data, all others have shifted over.
Let's look at the loci names and number of individuals.

nInd(nanloci)     # Individuals
locNames(nanloci) # Names of the loci

You can see that the number of individuals stayed the same but the loci "fca8", "fca45", and "fca96" were removed. Let's look at what happened when we removed individuals.

tab(nangeno)[1:5, 8:13]
nInd(nangeno)     # Individuals
locNames(nangeno) # Names of the loci

We can see here that the number of individuals decreased, yet we have the same number of loci. Notice how the frequency matrix changes in both scenarios? In the scenario with "loci", we removed several columns of the data set, and so with our sub-setting, we see alleles from the second locus. In the scenario with "geno", we removed several rows of the data set so we see other individuals in our sub-setting.

Extract populations {Divide (populations) and conquer (your analysis)} {#data.manip:divide}

This poppr function popsub makes subsetting genind or genlight objects by population easier:

Function: popsub {#data.manip:divide:popsub}

The command popsub is powerful in that it allows you to choose exactly what populations you choose to include or exclude from your analyses. As with many R functions, you can also use this within a function to avoid creating a new variable to keep track of.


funk <- "popsub"
print_command(funk)

To demonstrate this tool, we'll use the H3N2 virus data set provided in adegenet. It contains a data frame in the "other" slot called "x" that contains information about the year of epidemic, country, etc.

data("H3N2", package = "adegenet")
strata(H3N2) <- data.frame(other(H3N2)$x)
H3N2

We will demonstrate the popsub functionality by setting the population factor to "country". Note, in this section, I am naming the variables staring with "v" indicating "virus".

setPop(H3N2) <- ~country
popNames(H3N2) # Only two countries from North America.
v_na <- popsub(H3N2, sublist = c("USA", "Canada"))
popNames(v_na)

If we want to see the population size, we can use the adegenet function nInd():

c(NorthAmerica = nInd(v_na), Total = nInd(H3N2))

You can see that the population factors are correct and that the size of the data set is considerably smaller. Let's see the data set without the North American countries.

v_na_minus <- popsub(H3N2, blacklist = c("USA", "Canada"))
popNames(v_na_minus)

Let"s make sure that the number of individuals in both data sets is equal to the number of individuals in our original data set:

(nInd(v_na_minus) + nInd(v_na)) == nInd(H3N2)

Now we have data sets with and without North America. Let's try something a bit more challenging. Let's say that we want the first 10 populations in alphabetical order, but we know that we still don"t want any countries in North America. We can easily do this by using the base function sort.

vsort <- sort(popNames(H3N2))[1:10]
vsort
valph <- popsub(H3N2, sublist = vsort, blacklist = c("USA", "Canada"))
popNames(valph)

And that, is how you subset your data with poppr!

Clone-censor data sets {Attack of the clone correction} {#data.manip:cc}

Clone correction refers to the ability of keeping one observation of each MLG in a given population (or sub-population). Clone correcting can be hazardous if its done by hand (even on small data sets) and it requires a defined population hierarchy to get relevant results. Poppr has a clone correcting function that that will correct down to the lowest level of any defined population hierarchy. Note that clone correction in poppr is sensitive to missing data, as it treats all missing data as a single extra allele.

This function will create new data sets, but it is also utilized by the functions poppr() and poppr.amova() natively.

Function: clonecorrect {#data.manip:cc:clonecorrect}

This function will return a clone corrected data set corrected for the lowest population level. Population levels are specified with the hier flag. You can choose to combine the population hierarchy to analyze at the lowest population level by choosing combine = TRUE.


funk <- "clonecorrect"
print_command(funk)

Let"s look at ways to clone-correct our data. We"ll look at our A. euteichies data that we used in an earlier section since that data set is known to include clonal populations [@Grunwald:2006]. Try playing around with the data and see what different combinations of the strata, and keep flags produce. Below, I will give three examples of clone corrections at the sample level with respect to field, at the field level, and finally, at the level of the entire data set.

First, we will examine the original data set.

data(Aeut)
strata(Aeut) <- data.frame(other(Aeut)$population_hierarchy[-1])
Aeut

When you read in data with read.genalex(), the default is to represent it in a genclone object. Since the clonecorrect() function works on multilocus genotype definitions, It's more efficient to convert it to a genclone object first. We will also rename the strata to "field" and "sample" to make the biological relevance of the data clearer.

aphan <- as.genclone(Aeut)
nameStrata(Aeut) <- ~field/sample

Now we correct by sample with respect to field and keep the field as the population.

clonecorrect(aphan,  strata = ~Pop/Subpop)
# Your turn: Use the same stratification and use combine = TRUE and then
# keep = 1:2. Is there any difference?

Correcting by field. Notice how the number of MLG is much closer to our census.

clonecorrect(aphan, strata = ~Pop)

Correcting over whole data set. Our MLG is equal to our census.

clonecorrect(aphan, strata = NA)

Permutations and bootstrap resampling {every day I'm shuffling (data sets)} {#data.manip:shuffle}

A common null hypothesis for populations with mixed reproductive modes is panmixia, or to put it simply: lots of sex. Poppr randomly shuffles data sets in order to calculate P-values for the index of association ($I_A$ and $\bar{r} _d$) [@Agapow:2001] using 4 different methods:

method strategy      units sampled

     1 permutation   alleles
     2 simulation    alleles
     3 simulation    alleles
     4 permutation   genotypes

These methods are detailed below. We will create a dummy data set to be shuffled by each example below. Let"s assume a single diploid locus with four alleles (1, 2, 3, 4) with the frequencies of 0.1, 0.2, 0.3, and 0.4, respectively:

   A1/A2

1 4/4 2 4/1 3 4/3 4 2/2 5 3/3

: Original

The 4 methods are detailed below.


Function: shufflepop {#data.manip:shuffle:shufflepop}

funk <- "shufflepop"
print_command(funk)

These shuffling schemes have been implemented for the index of association, but there may be other summary statistics you can use shufflepop for. All you have to do is use the function replicate. Let"s use average Bruvo"s distance with the first population of the data set nancycats as an example:

data(nancycats)
nan1 <- popsub(nancycats, 1)
reps <- rep(2, 9) # Assuming dinucleotide repeats.
observed <- mean(bruvo.dist(nan1, replen = reps))
observed

You could use this method to replicate the resampling 999 times and then create a histogram to visualize a distribution of what would happen under different assumptions of panmixia.

set.seed(9999)
bd.test <- replicate(999, mean(bruvo.dist(shufflepop(nan1, method = 2), replen = reps)))
load("nancybruvo.rda")
hist(bd.test, xlab = "Bruvo's Distance", main = "Average Bruvo's distance over 999 randomizations")
abline(v = observed, col = "red")
legend('topleft', legend="observed", col="red", lty = 1)

Removing uninformative loci {Cut It Out!} {#data.manip:informloci}

Phylogenetically uninformative loci are those that have only one sample differentiating from the rest. This can lead to biased results when using multilocus analyses such as the index of association [@Brown:1980; @Smith:1993]. These nuisance loci can be removed with the following function.

Function: informloci {#data.manip:informloci:informloci}


funk <- "informloci"
print_command(funk)

Here's a quick example using the H3N2 virus SNP data set from an earlier section. We will only retain loci that have a minor allele frequency of $\geq 5\%$

H.five <- informloci(H3N2, cutoff = 0.05)
msg <- readLines("msg.txt")
message(paste(msg, collapse = "\n"))

Now what happens when you have all informative loci. We"ll use the nancycats data set, which has microsatellite loci. It is important to note that this is searching for loci with a specified genotype frequency as fixed heterozygous sites are also uninformative:

data(nancycats)
naninform <- informloci(nancycats, cutoff = 0.05)

Multilocus Genotype Analysis {#mlg}

In populations with mixed sexual and clonal reproduction, it common to have multiple samples from the same population that have the same set of alleles at all loci. Here, we introduce tools for tracking MLGs within and across populations in genind, and genlight objects from the adegenet package. We will be using the H3N2 data set containing SNP data from isolates of the H3N2 virus from 2002 to 2006. Note that genclone and snpclone objects are optimal for these analyses. For a more in-depth document on methods for multilocus genotypes in poppr, see the "Multilocus Genotype Analysis" vignette by typing

vignette("mlg", package = "poppr")

How many multilocus genotypes are in our data set? {Just a peek} {#mlg:mlg}

Counting the number of MLGs in a population is the first step for these analyses as they allow us to see how many clones exist. With the genclone object, This information is already displayed when we view the object.

H3N2

If we need to store the number of MLGs as a variable, we can simply run the mlg() command.

H3N2_mlg <- mlg(H3N2)
H3N2_mlg

Since the number of individuals exceeds the number of multilocus genotypes, we conclude that this data set contains clones. Let"s examine what populations these clones belong to.

MLGs across populations {clone-ing around} {#mlg:cross}

Since you have the ability to define hierarchical levels of your data set freely, it is quite possible to see some of the same MLGs across different populations. Tracking them by hand can be a nightmare with large data sets. Luckily, mlg.crosspop has you covered in that regard.

Function: mlg.crosspop {#mlg:cross:mlg.crosspop}

Analyze the MLGs that cross populations within your data set. This has three output modes. The default one gives a list of MLGs, and for each MLG, it gives a named numeric vector indicating the abundance of that MLG in each population. Alternate outputs are described with indexreturn and df.


funk <- "mlg.crosspop"
print_command(funk)

We can see what MLGs cross different populations and then give a vector that shows how many populations each one of those MLGs crosses.

setPop(H3N2) <- ~country
v.dup <- mlg.crosspop(H3N2, quiet=TRUE)

Here is a snippet of what the output looks like when quiet is FALSE. It will print out the MLG name, the total number of individuals that make up that MLG, and the populations where that MLG can be found.

setPop(H3N2) <- ~country
v.dup <- structure(list(MLG.3 = structure(c(4L, 8L), .Names = c("USA",
"Denmark")), MLG.9 = structure(c(1L, 13L, 1L, 1L), .Names = c("Japan",
"USA", "Finland", "Denmark")), MLG.31 = structure(c(2L, 7L), .Names = c("Japan",
"Canada")), MLG.75 = structure(c(2L, 8L, 2L, 1L, 6L, 2L, 1L,
1L), .Names = c("Japan", "USA", "Finland", "Norway", "Denmark",
"Austria", "Russia", "Ireland")), MLG.80 = structure(c(1L, 1L
), .Names = c("USA", "Denmark")), MLG.86 = structure(3:4, .Names = c("Denmark",
"Austria")), MLG.95 = structure(c(1L, 1L), .Names = c("USA",
"Bangladesh")), MLG.97 = structure(c(1L, 5L, 1L, 1L), .Names = c("USA",
"Austria", "Bangladesh", "Romania")), MLG.104 = structure(1:2, .Names = c("USA",
"France")), MLG.110 = structure(c(2L, 3L, 11L), .Names = c("Japan",
"USA", "China"))), .Names = c("MLG.3", "MLG.9", "MLG.31", "MLG.75",
"MLG.80", "MLG.86", "MLG.95", "MLG.97", "MLG.104", "MLG.110"))
printthings <- function(ind, x){
  cat(paste0(names(x)[ind], ":"),
      paste0("(", sum(x[[ind]])," inds)"),
      names(x[[ind]]),
      "\n")
}
invisible(lapply(1:10, printthings, v.dup))

The output of this function is a list of MLGs, each containing a vector indicating the number of copies in each population. We"ll count the number of populations each MLG crosses using the function sapply to loop over the data with the function length.

head(v.dup)
v.num <- sapply(v.dup, length) # count the number of populations each MLG crosses.
head(v.num)

Producing MLG tables and graphs {bringing something to the table} {#mlg:table}

We can also create a table of MLGs per population as well as bar graphs to give us a visual representation of the data. This is achieved through the function mlg.table

Function: mlg.table {#mlg:table:mlg.table}

This function will produce a matrix containing counts of MLGs (columns) per population (rows). If there are not populations defined in your data set, a vector will be produced instead.


funk <- "mlg.table"
print_command(funk)

v.tab <- mlg.table(H3N2, plot = TRUE)
v.tab[1:10, 1:10] # Showing the first 10 columns and rows of the table.
v.tab <- mlg.table(H3N2, plot = FALSE)
v.tab[1:10, 1:10] # Showing the first 10 columns and rows of the table.
mlg.table(H3N2, sublist = "Norway", plot = TRUE)

The MLG table is not limited to use with poppr. In fact, one of the main advantages of mlg.table() is that it allows easy access to diversity functions present in the package vegan [@vegan]. One example is to create a rarefaction curve for each population in your data set giving the number of expected MLGs for a given sample size. For more information, type help(diversity, package=vegan) in your R console.

For the sake of this example, instead of drawing a curve for each of the 37 countries represented in this sample, let"s set the hierarchical level to year.

setPop(H3N2) <- ~year
summary(H3N2) # Check the data to make sure it's correct.
setPop(H3N2) <- ~year
res <- "
// Number of individuals: 1903
// Group sizes: 158 415 399 469 462
// Number of alleles per locus: 3 3 4 2 4 2 3 2 4 3 4 2 4 3 2 2 3 3 2 2 3 3 3 2 2 2 2 2 2 2 2 2 2 4 4 3 3 3 4 2 2 2 4 3 2 3 4 2 3 2 3 2 2 2 4 2 2 2 2 2 2 2 4 4 4 3 3 2 3 4 3 2 3 3 3 3 2 3 2 4 2 3 2 2 3 3 3 3 2 2 2 2 3 2 3 2 3 2 3 2 3 3 2 2 2 3 2 2 2 3 3 3 2 2 3 3 3 3 4 2 3 3 4 3 2
// Number of alleles per group: 203 255 232 262 240
// Percentage of missing data: 2.36 %
// Observed heterozygosity: 0"
cat(res, sep = "\n")
library("vegan")
H.year <- mlg.table(H3N2, plot = FALSE)
rarecurve(H.year, ylab="Number of expected MLGs", sample=min(rowSums(H.year)),
          border = NA, fill = NA, font = 2, cex = 1, col = "blue")
# library("vegan")
H.year <- mlg.table(H3N2, plot = FALSE)
vegan::rarecurve(H.year, ylab = "Number of expected MLGs", col = "blue",
    sample = min(rowSums(H.year)), border = NA, fill = NA, font = 2, cex = 1)

The minimum value from the base function rowSums() of the table represents the minimum common sample size of all populations defined in the table. Setting the "sample" flag draws the horizontal and vertical lines you see on the graph. The intersections of these lines correspond to the numbers you would find if you ran the function poppr on this data set (under the column "eMLG").

Combining MLG functions {getting into the mix} {#mlg:mix}

Alone, the different functionalities are neat. Combined, we can create interesting data sets. Let"s say we wanted to know which MLGs were duplicated across the regions of the United Kingdom, Germany, Netherlands, and Norway. All we have to do is use the sublist flag in the function:

setPop(H3N2) <- ~country
UGNN.list <- c("United Kingdom", "Germany", "Netherlands", "Norway")
UGNN <- mlg.crosspop(H3N2, sublist=UGNN.list, indexreturn=TRUE)

OK, the output tells us that there are three MLGs that are crossing between these populations, but we do not know how many are in each. We can easily find that out if we subset our original table, v.tab.

UGNN # Note that we have three numbers here. This will index the columns for us.
UGNN.list # And let's not forget that we have the population names.
v.tab[UGNN.list, UGNN]

Now we can see that Norway has a higher incidence of nearly all of these MLGs. We can investigate the incidence of these MLGs throughout our data set. One thing that the genclone object keeps track of is a single vector defining the unique multilocus genotypes within the data. These are represented as integers and can be accessed with mlg.vector. This is useful for finding MLGs that correspond to certain individuals or populations. Let"s use mlg.vector to find individuals corresponding to the MLGs. First we"ll investigate what the output of this function looks like.

v.vec <- mlg.vector(H3N2)
str(v.vec) # Analyze the structure.

The integers produced are the MLG assignment of each individual in the same order as the data set. This means that the first two individuals have the exact same set of alleles at each locus, so they have the same MLG: 605. If we look at the number of unique integers in the vector, it corresponds to the number of observed multilocus genotypes:

length(unique(v.vec)) # count the number of MLGs
H3N2 # equal to the first number in this output.

We will take UGNN (MLGs crossing UK, Germany, Netherlands, and Norway) and compare its elements to the MLG vector (v.vec) to see where else they occur.

UGNN # Show what we are looking for
UGNN_match <- v.vec %in% UGNN
table(UGNN_match) # How many individuals matched to those three MLGs?

22 individuals matched to those three MLGs. We can use this vector to show us the 22 individuals.

indNames(H3N2)[UGNN_match]

Note that there is an alternative way to list individuals matching specific MLGs using the function mlg.id. This function will return a list where each element represents a unique MLG. You can use this data to find out which individuals correspond to specific MLGs. Each element in the list is named with the MLG, but the index does not necessarily match up, so it is important to convert your query MLGs to strings:

H3N2.id <- mlg.id(H3N2)
H3N2.id[as.character(UGNN)]

We can also use the vector of MLGs to subset mlg.table() with the mlgsub flag.

mlg.table(H3N2, mlgsub = UGNN)

That showed us exactly which populations these three MLGs came from in our data set.

mlg.table(H3N2, mlgsub = UGNN)

Appendix

Manipulating Graphics

Poppr utilizes ggplot2 to produce many of its graphs. One advantage it gives the user is the ability to manipulate these graphs. With base R graphs, the only manipulation that can be performed is by adding elements to the graph. It is a static image. The ggplot graphs are actually represented as objects in your R environment. We can use the function last_plot() from ggplot2 to be able to grab the plot that was plotted last in our window. Let's illustrate this using a MLG bar graph from the Aeut data set.

library("poppr")
library("ggplot2")
data(Aeut)
Aeut.tab <- mlg.table(Aeut)
p <- last_plot()

We've captured our plot using last_plot() and now we can manipulate it. One common need is to change the title. We can easily do that with the function ggtitle(). Let"s say we wanted to label it "Aphanomyces euteiches multilocus genotype distribution". We would use ggtitle(Aphanomyces euteiches multilocus genotype distribution). Unfortunately, we need italics for a latin binomial. One way to acheive this is by using the expression() function and declaring which text needs to be italicized.

myTitle <- expression(paste(italic("Aphanomyces euteiches"), 
                      " multilocus genotype distribution"))
(pt <- p +
   ggtitle(myTitle) +
   xlab("Multilocus genotype")) # We can label the x axis, too

Let"s say we wanted to remove the grey background. We could use the theme() function to do this, or we could use a theme already implemented in ggplot2 called theme_bw().

(ptt <- pt + theme_bw())

Uh-oh. The x axis labels are now horizontal when they should be vertical. Since it"s the overall distribution we're interested in, we don't really need them anyways. We can remove them with axis.text.x and axis.ticks.x (we'll also remove the x axis gridlines because they're ugly).

(ptta <- ptt + theme(axis.text.x = element_blank()) +
  theme(axis.ticks.x = element_blank()) +
  theme(panel.grid.major.x = element_blank()))

And, if for some bizarre reason, you liked the color gradient in poppr version 1, you can get that back by adding the fill aesthetic:

(ptttaf <- ptta + aes(fill = count))

This allows you to produce publication quality graphs directly in R. Please see Hadley Wickham"s ggplot2 package for more details [@ggplot2]. Note that if you don"t like using ggplot2, you can access the data in the ggplot2 object and plot the data yourself:

head(p$data)

Exporting Graphics {#appendix:graphics}

R has the ability to produce nice graphics from most any type of data, but to get these graphics into a report, presentation, or manuscript can be a bit challenging. It"s no secret that the R Documentation pages are a little difficult to interpret, so I will give the reader here a short example on how to export graphics from R. Note that any code here that will produce images will also be present in other places in this vignette. The default installation of the R GUI is quite minimal, and for an easy way to manage your plots and code, I strongly encourage the user to use Rstudio http://www.rstudio.com/.

Basics {#appendix:graphics:basics}

Before you export graphics, you have to ask yourself what they will be used for. If you want to use the graphic for a website, you might want to opt for a low-resolution image so that it can load quickly. With printing, you"ll want to make sure that you have a scalable or at least a very high resolution image. Here, I will give some general guidelines for graphics (note that these are merely suggestions, not defined rules).

Image Editors {#appendix:graphics:editors}

Often times, fine details such as labels on networks need to be tweaked by hand. Luckily, there are a wide variety of programs that can help you do that. Here is a short list of image editors (both free and for a price) that you can use to edit your graphics.

Exporting ggplot2 graphics {#appendix:graphics:ggplot2}

ggplot2 is a fantastic package that poppr uses to produce graphs for the mlg.table(), poppr(), and ia() functions. Saving a plot with ggplot2 is performed with one command after your plot has rendered:

data(nancycats) # Load the data set.
poppr(nancycats, sample = 999) # Produce a single plot.
ggsave("nancycats.pdf")

Note that you can name the file anything, and ggsave will save it in that format for you. The details are in the documentation and you can access it by typing help(ggsave) in your R console. The important things to note are that you can set a width, height, and unit. The only downside to this function is that you can only save one plot at a time. If you want to be able to save multiple plots, read on to the next section.

Exporting any graphics {#appendix:graphics:export}

Some of the functions that poppr offers will give you multiple plots, and if you want to save them all, using ggsave will require a lot of tedious typing and clicking. Luckily, R has Functions that will save any plot you generate in nearly any image format you want. You can save in raster images such as png, bpm, and jpeg. You can also save in vector based images such as svg, pdf, and postscript. The important thing to remember is that when you are saving in a raster format, the default units of measurement are "pixels", but you can change that by specifying your unit of choice and a resolution.

For raster images and svg files, you can only save your plots in multiple files, but pdf and postscript plots can be saved in one file as multiple pages. All of these functions have the same basic form. You call the function to specify the file type you want (eg. pdf(myfile.pdf)), create any graphs that you want to create, and then make sure to close the session with the function dev.off(). Let"s give an example saving to pdf and png files.

data(nancycats)

png("nancy_pair%02d.png", width = 14, height = 14, units = "in", res = 300)
poppairs <- lapply(seppop(nancycats), pair.ia, limits = c(-0.25, 1))
dev.off()

Since this data set is made up of 30 populations with more than 1 individual, this will save 30 files to your working directory named "nancy_pair_barchart01.png...nancy_pair_barchart30.png". The way R knows how to number these files is because of the %02d part of the command. That"s telling R to use a number that is two digits long in place of that expression. All of these files will be 14x14" and will have a resolution of 300 dots per inch. If you wanted to do the same thing, but place them all in one file, you should use the pdf option.

pdf("nancy_pair.png", width = 14, height = 14, compress = FALSE)
poppairs <- lapply(seppop(nancycats), pair.ia, limits = c(-0.25, 1))
dev.off()

Remember, it is important not to forget to type dev.off() when you are done making graphs. Note that I did not have to specify a resolution for this image since it is based off of vector graphics.

[^1]: This window sometimes appears behind your current session of R, depending on the GUI and you will have to toggle to this window

[^2]: *.CSV files are comma separated files that are easily machine readable.

References



Try the poppr package in your browser

Any scripts or data that you put into this service are public.

poppr documentation built on Aug. 22, 2018, 5:03 p.m.