Description Arguments Details Value Note Author(s) References Examples
gsi_sim
is a tool for doing and simulating genetic stock
identification and developed by Eric C. Anderson.
The arguments in the assignment_ngs
function were tailored for the
reality of GBS data for assignment analysis while
maintaining a reproducible workflow.
The input data is a VCF file produced by STACKS or a data frame. Individuals, populations and
markers can be filtered and/or selected in several ways using blacklist,
whitelist and other arguments. Map-independent imputation of missing genotype
using Random Forest or the most frequent category is also available.
Markers can be randomly selected for a classic LOO (Leave-One-Out)
assignment or chosen based on ranked Fst for a thl
(Training, Holdout, Leave-one-out) assignment analysis.
data |
Options include the VCF (1) or an haplotype files (2) created in STACKS
( |
assignment.analysis |
Assignment analysis conducted with
|
whitelist.markers |
(optional) A whitelist containing CHROM (character
or integer) and/or LOCUS (integer) and/or
POS (integer) columns header. To filter by chromosome and/or locus and/or by snp.
The whitelist is in the working directory (e.g. "whitelist.txt").
de novo CHROM column with 'un' need to be changed to 1.
Default |
monomorphic.out |
(optional) For PLINK file, should the monomorphic
markers present in the dataset be filtered out ?
Default: |
blacklist.genotype |
(optional) Useful to erase genotype with below
average quality, e.g. genotype with more than 2 alleles in diploid likely
sequencing errors or genotypes with poor genotype likelihood or coverage.
The blacklist as a minimum of 2 column headers (markers and individuals).
Markers can be 1 column (CHROM or LOCUS or POS),
a combination of 2 (e.g. CHROM and POS or CHROM and LOCUS or LOCUS and POS) or
all 3 (CHROM, LOCUS, POS) The markers columns must be designated: CHROM (character
or integer) and/or LOCUS (integer) and/or POS (integer). The id column designated
INDIVIDUALS (character) columns header. The blacklist must be in the working
directory (e.g. "blacklist.genotype.txt"). For de novo VCF, CHROM column
with 'un' need to be changed to 1. Default |
snp.ld |
(optional) For VCF file only. With anonymous markers from
RADseq/GBS de novo discovery, you can minimize linkage disequilibrium (LD) by
choosing among these 3 options: |
common.markers |
(optional) Logical. Default = |
maf.thresholds |
(string, double, optional) String with
local/populations and global/overall maf thresholds, respectively.
Default: |
maf.pop.num.threshold |
(integer, optional) When maf thresholds are used,
this argument is for the number of pop required to pass the maf thresholds
to keep the locus. Default: |
maf.approach |
(character, optional). By |
maf.operator |
(character, optional) |
max.marker |
An optional integer useful to subsample marker number in
large PLINK file. Default: |
marker.number |
(Integer or string of number or "all") Calculations with
fixed or subsample of your markers. Default= |
blacklist.id |
(optional) A blacklist with individual ID and a column header 'INDIVIDUALS'. The blacklist is in the working directory (e.g. "blacklist.txt"). |
sampling.method |
(character) Should the markers be randomly selected
|
thl |
(character, integer, proportion) For |
iteration.method |
With random marker selection the iterations argument =
the number of iterations to repeat marker resampling, default is |
folder |
(optional) The name of the folder created in the working directory to save the files/results. |
gsi_sim.filename |
(optional) The name of the file written to the directory.
Use the extension ".txt" at the end. Default |
keep.gsi.files |
(Boolean) Default |
pop.levels |
(required) A character string with your populations ordered. |
pop.labels |
(optional) A character string for your populations labels.
If you need to rename sampling sites in |
pop.id.start |
The start of your population id
in the name of your individual sample. Your individuals are identified
in this form : SPECIES-POPULATION-MATURITY-YEAR-ID = CHI-QUE-ADU-2014-020,
then, |
pop.id.end |
The end of your population id
in the name of your individual sample. Your individuals are identified
in this form : SPECIES-POPULATION-MATURITY-YEAR-ID = CHI-QUE-ADU-2014-020,
then, |
strata |
(optional) A tab delimited file with 2 columns with header:
|
pop.select |
(string) Conduct the assignment analysis on a
selected list of populations. Default = |
subsample |
(Integer or Proportion) Default is no sumsampling, |
iteration.subsample |
(Integer) The number of iterations to repeat
subsampling, default: |
imputation.method |
Should a map-independent imputations of markers be
computed. Available choices are: (1) |
impute |
(character) Imputation on missing genotype
|
imputations.group |
|
num.tree |
The number of trees to grow in Random Forest. Default is 100. |
iteration.rf |
The number of iterations of missing data algorithm in Random Forest. Default is 10. |
split.number |
Non-negative integer value used to specify random splitting in Random Forest. Default is 100. |
verbose |
Logical. Should trace output be enabled on each iteration
in Random Forest ? Default is |
parallel.core |
(optional) The number of core for OpenMP shared-memory parallel
programming of Random Forest imputations. For more info on how to install the
OpenMP version see |
You need to have either the pop.id.start
and pop.id.end
or the strata
argument, to identify your populations.
The imputations using Random Forest requires more time to compute and can take several minutes and hours depending on the size of the dataset and polymorphism of the species used. e.g. with a low polymorphic taxa, and a data set containing 30% missing data, 5 000 haplotypes loci and 500 individuals will require 15 min. The Fst is based on Weir and Cockerham 1984 equations.
Depending on arguments selected, several files are written to the your
working directory or folder
The output in your global environment is a list. To view the assignment results
$assignment
to view the ggplot2 figure $plot.assignment
.
See example below.
assignment_ngs
assumes that the command line version of gsi_sim
is properly installed and available on the command line, so it is executable from
any directory (more info on how to do this, here
http://gbs-cloud-tutorial.readthedocs.org/en/latest/03_computer_setup.html?highlight=bash_profile#save-time.
The easiest way is to put the binary, the gsi_sim
executable,
in the folder /usr/local/bin
. To compile gsi_sim, follow the
instruction here: https://github.com/eriqande/gsi_sim.
Thierry Gosselin thierrygosselin@icloud.com
Anderson, Eric C., Robin S. Waples, and Steven T. Kalinowski. (2008) An improved method for predicting the accuracy of genetic stock identification. Canadian Journal of Fisheries and Aquatic Sciences 65, 7:1475-1486.
Anderson, E. C. (2010) Assessing the power of informative subsets of loci for population assignment: standard methods are upwardly biased. Molecular ecology resources 10, 4:701-710.
Catchen JM, Amores A, Hohenlohe PA et al. (2011) Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences. G3, 1, 171-182.
Catchen JM, Hohenlohe PA, Bassham S, Amores A, Cresko WA (2013) Stacks: an analysis tool set for population genomics. Molecular Ecology, 22, 3124-3140.
Weir BS, Cockerham CC (1984) Estimating F-Statistics for the Analysis of Population Structure. Evolution, 38, 1358–1370.
Ishwaran H. and Kogalur U.B. (2015). Random Forests for Survival, Regression and Classification (RF-SRC), R package version 1.6.1.
Ishwaran H. and Kogalur U.B. (2007). Random survival forests for R. R News 7(2), 25-31.
Ishwaran H., Kogalur U.B., Blackstone E.H. and Lauer M.S. (2008). Random survival forests. Ann. Appl. Statist. 2(3), 841–860.
Danecek P, Auton A, Abecasis G et al. (2011) The variant call format and VCFtools. Bioinformatics, 27, 2156-2158.
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics. 2007; 81: 559–575. doi:10.1086/519795
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | ## Not run:
assignment.treefrog <- assignment_ngs(
data = "batch_1.vcf",
whitelist.markers = "whitelist.vcf.txt",
snp.ld = NULL,
common.markers = TRUE,
marker.number = c(500, 5000, "all"),
sampling.method = "ranked",
thl = 0.3,
blacklist.id = "blacklist.id.lobster.tsv",
subsample = 25,
iteration.subsample = 10
gsi_sim.filename = "treefrog.txt",
keep.gsi.files = FALSE,
pop.levels = c("PAN", "COS")
pop.id.start = 5, pop.id.end = 7,
imputation.method = FALSE,
parallel.core = 12
)
Since the 'folder' argument is missing, it will be created automatically
inside your working directory.
To create a dataframe with the assignment results:
assignment <- assignment.treefrog$assignment.
To plot the assignment using ggplot2 and facet
(with subsample by current pop):
assignment.treefrog$plot.assignment + facet_grid(SUBSAMPLE~CURRENT).
To save the plot:
ggsave("assignment.treefrog.THL.subsample.pdf", height = 35,
width = 60,dpi = 600, units = "cm", useDingbats = F)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.