Next-generation sequencing techniques that reduce the size of the genome (e.g. genotype-by-sequencing (GBS) and restriction-site-associated DNA sequencing (RADseq)) produce huge numbers of markers that hold great potential and promises for assignment analysis. After hitting the bioinformatic wall with the different workflows you'll likely end up with several folders containing whitelist and blacklist of markers and individuals, data sets with various de novo and/or filtering parameters and missing data. This reality of GBS/RAD data is quite hard on GUI software traditionally used for assignment analysis. The end results is usually poor data exploration, constrained by time, and poor reproducibility.
assigner was tailored to make it easy to conduct assignment analysis using GBS/RAD data within R. Additionally, combining the use of tools like [RStudio] (https://www.rstudio.com) and [GitHub] (https://github.com) will make effortless documenting your workflows and pipelines.
This is the development page of the assigner package for the R software.
Use assigner to:
blacklist.genotype
argument.iteration.method
and/or iteration.subsample
arguments to resample markers or individuals to get statistics!whitelist.markers
argument.ggplot2
-based plotting to view assignment results and create publication-ready figuresYou can try out the dev version of assigner. Follow the 4 steps below:
Step 1 You will need the package devtools
if (!require("devtools")) install.packages("devtools") # to install
library(devtools) # to load
Step 2 Install assigner:
install_github("thierrygosselin/assigner") # to install
library(assigner) # to load
Step 3 For faster imputations, you need to install an OpenMP enabled randomForestSRC package website.
Option 1: From source (Linux & Mac OSX)
# Terminal
cd ~/Downloads
curl -O https://cran.r-project.org/src/contrib/randomForestSRC_2.0.7.tar.gz
tar -zxvf randomForestSRC_2.0.7.tar.gz
cd randomForestSRC
autoconf
# Back in R:
install.packages(pkgs = "~/Downloads/randomForestSRC", repos = NULL, type = "source")
Option 2: Use a pre-compiled binary (Mac OSX & Windows) [instructions here] (http://www.ccs.miami.edu/~hishwaran/rfsrc.html) or quick copy/paste solution below:
# Mac OSX
library("devtools")
install_url(url = "http://www.ccs.miami.edu/~hishwaran/rfsrc/randomForestSRC_2.0.7.tgz")
# Windows
library("devtools")
install_url(url = "http://www.ccs.miami.edu/~hishwaran/rfsrc/randomForestSRC_2.0.7.zip")
Step 4 Install [gsi_sim] (https://github.com/eriqande/gsi_sim):
assigner assumes that the command line version of [gsi_sim] (https://github.com/eriqande/gsi_sim) is properly installed and available on the command line, so it is executable from any directory. If you have no idea what i'm saying here, you might want to first read this short section of my tutorial on [GBS in the cloud] (http://gbs-cloud-tutorial.readthedocs.org/en/latest/03_computer_setup.html?highlight=bash_profile#save-time).
The fastest way is to put [gsi_sim] (https://github.com/eriqande/gsi_sim)
binary, the gsi_sim
executable, in the folder /usr/local/bin
.
# Mac OSX
# git the repo and submodules
cd ~/Downloads/ # or any directory
sudo git clone https://github.com/eriqande/gsi_sim.git
cd gsi_sim/
sudo git submodule init
sudo git submodule update
cd ..
sudo cp ~/Downloads/gsi_sim/gsi_sim-Darwin /usr/local/bin/gsi_sim
sudo rm -R ~/Downloads/gsi_sim
# Linux
cd ~/Downloads/ # or any directory
sudo git clone https://github.com/eriqande/gsi_sim.git
cd gsi_sim/
sudo git submodule init
sudo git submodule update
cd ..
sudo cp ~/Downloads/gsi_sim/gsi_sim-Linux /usr/local/bin/gsi_sim
sudo rm -R ~/Downloads/gsi_sim
Problems during installation:
Sometimes you'll get warnings while installing dependencies required for assigner or other R packages.
Warning: cannot remove prior installation of package ‘stringi’
To solve this problem:
Option 1. Delete the problematic packages manually and reinstall. On MAC computers, in the Finder, use the shortcut cmd+shift+g, or in the menu bar : GO -> Go to Folder, copy/paste the text below:
/Library/Frameworks/R.framework/Resources/library
#Delete the problematic packages.
Option 2. If you know your way around the terminal and understand the consequences of using sudo rm -R command, here something faster to remove problematic packages:
sudo rm -R /Library/Frameworks/R.framework/Resources/library/package_name
# Changing 'package_name' to the problematic package.
# Reinstall the package.
Dependencies
Here the list of packages that assigner is depending on:
if (!require("reshape2")) install.packages("reshape2")
if (!require("ggplot2")) install.packages("ggplot2")
if (!require("stringr")) install.packages("stringr")
if (!require("stringi")) install.packages("stringi")
if (!require("plyr")) install.packages("plyr")
if (!require("dplyr")) install.packages("dplyr")
if (!require("tidyr")) install.packages("tidyr")
if (!require("readr")) install.packages("readr")
if (!require("purrr")) install.packages("purrr")
if (!require("data.table")) install.packages("data.table")
if (!require("lazyeval")) install.packages("lazyeval")
if (!require("adegenet")) install.packages("adegenet")
if (!require("parallel")) install.packages("parallel")
If you don't have them, no worries, it's intalled automatically during assigner installation. If you have them, it's your job to update them, because i'm usually using the latest versions...
Most of the function in assigner were designed to be as fast as possible. Using computer with 16GB RAM is recommended. With more CPU and Memory comes faster computation time. If you decide to keep intermediate files during assignment analysis, you will need a large external drive (disk space is cheap). Solid State Drive and thunderbolt cables will provide fast input/output.
If disk space and computer power is an issue, cloud computing with [Google Cloud Compute Engine] (https://cloud.google.com/compute/) and [Amazon Elastic Cloud Compute] (https://aws.amazon.com/ec2/) is cheap and can be used easily.
A tutorial and pipeline along an Amazon Machine Image (AMI) are available in our [tutorial-workflow] (http://gbs-cloud-tutorial.readthedocs.org/en/latest/).
The AMI is preloaded with gsi_sim and the required R packages. Following a few steps: [link] (http://gbs-cloud-tutorial.readthedocs.org/en/latest/10_use_rstudio.html), you can have [RStudio server] (https://www.rstudio.com/) running and used through your web browser!
The Amazon image can be imported into Google Cloud Compute Engine to start a new compute engine virtual machine: [link] (https://cloud.google.com/compute/docs/creating-custom-image#import_an_ami_image).
v.0.1.8 * You can now opt between [gsi_sim] (https://github.com/eriqande/gsi_sim) or [adegenet] (https://github.com/thibautjombart/adegenet), a R package developed by Thibaul Jombart, to conduct the assignment analysis
v.0.1.7
New input file: Re-introduced the haplotype data frame file from stacks.
Argument name change: imputations
is now impute.method
.
* New argument: impute
with 2 options: impute = "genotype"
or impute = "allele"
.
v.0.1.6
Input file argument is now data
and covers the three types of files the
function can use: VCF file, PLINK tped/tfam or data frame of genotypes file.
Huge number of markers (> 50 000 markers) can now be imported in PLINK
tped/tfam format. The first 2 columns of the tfam
file will be used for the
strata
argument, unless a new one is provided. Columns 1, 3 and 4 of the
tped
are discarded. The remaining columns correspond to the genotype in the
format 01/04
where A = 01, C = 02, G = 03 and T = 04
. For A/T
format, use
PLINK or bash to convert. Use [VCFTOOLS] (http://vcftools.sourceforge.net/) with
--plink-tped
to convert very large VCF file. For .ped
file conversion to
.tped
use [PLINK] (http://pngu.mgh.harvard.edu/~purcell/plink/) with
--recode transpose
.
v.0.1.5
* bug fix in method = "random"
and imputation
v.0.1.4
Changed function name, from GBS_assignment
to assignment_ngs
. Stands for
assignment with next-generation sequencing data.
New argument df.file
if you don't have a VCF file. See documentation.
* New argument strata
if you don't have population id or other metadata info
in the individual name. See documentation.
v.0.1.3
Changed arguments THL
to thl
and snp.LD
to snp.ld
to follow convention.
iterations.subsample
changed to iteration.subsample
.
iterations
changed to iteration.method
to avoid confusion with other iteration arguments.
Removed baseline
and mixture
arguments from the function GBS_assignment
.
These options will be re-introduce later in a separate function.
* Using marker.number
higher than the number of markers in the data set was causing
problems. This could arise when using arguments that removed markers from the dataset
(e.g. snp.ld
, common.markers
, and maf
filters).
v.0.1.2
* new version to update with gsi_sim new install instruction for Linux and Mac.
After re-installing assigner package, follow the instruction to re-install
the new [gsi_sim] (https://github.com/eriqande/gsi_sim).
And delete the old binary 'gsisim' in the /usr/local/bin folder
with the following Terminal command: sudo rm /usr/local/bin/gsisim
This package has been developed in the open, and it wouldn’t be nearly as good without your contributions. There are a number of ways you can help me make this package even better: If you don’t understand something, please let me know. Your feedback on what is confusing or hard to understand is valuable. * If you spot a typo, feel free to edit the underlying page and send a pull request.
New to pull request on github ? The process is very easy: Click the edit this page on the sidebar. Make the changes using github’s in-page editor and save. Submit a pull request and include a brief description of your changes. “Fixing typos” is perfectly adequate.
Under construction
Here an example on how to read the GenoDive assignment output file in order to
use the arguments in the dlr
and the plot_assignment_dlr
functions correctly.
Vignettes are in development, check periodically for updates.
Anderson EC, Waples RS, Kalinowski ST (2008) An improved method for predicting the accuracy of genetic stock identification. Canadian Journal of Fisheries and Aquatic Sciences, 65, 1475–1486.
Anderson EC (2010) Assessing the power of informative subsets of loci for population assignment: standard methods are upwardly biased. Molecular Ecology Resources, 10, 701–710.
Catchen JM, Amores A, Hohenlohe PA et al. (2011) Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences. G3, 1, 171–182.
Catchen JM, Hohenlohe PA, Bassham S, Amores A, Cresko WA (2013) Stacks: an analysis tool set for population genomics. Molecular Ecology, 22, 3124–3140.
Danecek P, Auton A, Abecasis G et al. (2011) The variant call format and VCFtools. Bioinformatics, 27, 2156–2158.
Foll M, Gaggiotti O (2008) A Genome-Scan Method to Identify Selected Loci Appropriate for Both Dominant and Codominant Markers: A Bayesian Perspective. Genetics, 180, 977–993.
Ishwaran H. and Kogalur U.B. (2015). Random Forests for Survival, Regression and Classification (RF-SRC), R package version 1.6.1.
Jombart T (2008) adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics, 24, 1403–1405.
Jombart T, Ahmed I (2011) adegenet 1.3-1: new tools for the analysis of genome-wide SNP data. Bioinformatics, 27, 3070–3071.
Kavakiotis I, Triantafyllidis A, Ntelidou D et al. (2015) TRES: Identification of Discriminatory and Informative SNPs from Population Genomic Data. Journal of Heredity, 106, 672–676.
Meirmans PG, Van Tienderen PH (2004) genotype and genodive: two programs for the analysis of genetic diversity of asexual organisms. Molecular Ecology Notes, 4, 792-794.
Paetkau D, Slade R, Burden M, Estoup A (2004) Genetic assignment methods for the direct, real-time estimation of migration rate: a simulation-based exploration of accuracy and power. Molecular Ecology, 13, 55-65.
Paetkau D, Waits LP, Clarkson PL, Craighead L, Strobeck C (1997) An empirical evaluation of genetic distance statistics using microsatellite data from bear (Ursidae) populations. Genetics, 147, 1943-1957.
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics. 2007; 81: 559–575. doi:10.1086/519795
Rosenberg NA, Li LM, Ward R, Pritchard JK (2003) Informativeness of genetic markers for inference of ancestry. American Journal of Human Genetics, 73, 1402–1422.
Shriver MD, Smith MW, Jin L et al. (1997) Ethnic-affiliation estimation by use of population-specific DNA markers. American Journal of Human Genetics, 60, 957.
Weir BS, Cockerham CC (1984) Estimating F-Statistics for the Analysis of Population Structure. Evolution, 38, 1358–1370.
Whitlock MC, Lotterhos KE (2015) Reliable Detection of Loci Responsible for Local Adaptation: Inference of a Null Model through Trimming the Distribution of FST*. The American Naturalist, S000–S000.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.