assignment_ngs: Assignment analysis in gsi_sim with GBS data produced by...

Description Arguments Details Value Note Author(s) References Examples

Description

gsi_sim is a tool for doing and simulating genetic stock identification and developed by Eric C. Anderson. The arguments in the assignment_ngs function were tailored for the reality of GBS data for assignment analysis while maintaining a reproducible workflow. The input data is a VCF file produced by STACKS or a data frame. Individuals, populations and markers can be filtered and/or selected in several ways using blacklist, whitelist and other arguments. Map-independent imputation of missing genotype using Random Forest or the most frequent category is also available. Markers can be randomly selected for a classic LOO (Leave-One-Out) assignment or chosen based on ranked Fst for a thl (Training, Holdout, Leave-one-out) assignment analysis.

Arguments

data

Options include the VCF (1) or an haplotype files (2) created in STACKS (data = "batch_1.vcf" and data = "batch_1.haplotypes.tsv", respectively) or a data frame (3) with tab separating the genotypes in columns (data = "data.assignment.tsv"). The 1st column is the POP_ID, 2nd colum the INDIVIDUALS and the remaining columns are the markers IDs containing genotypes in the format: 3 digits per allele and no space between alleles (e.g. 235240 : allele1 = 235 and allele2 = 240). Missing genotypes are coded 0 or 000000. Note that the POP_ID column can be any hierarchical grouping. See the argument strata for other means of controlling grouping used in the assignment. The last option for data input is a PLINK file in tped/tfam format (e.g. data = "data.assignment.tped"). The first 2 columns of the tfam file will be used for the strata argument below, unless a new one is provided. Columns 1, 3 and 4 of the tped are discarded. The remaining columns correspond to the genotype in the format 01/04 where A = 01, C = 02, G = 03 and T = 04. For A/T format, use PLINK or bash to convert. Use VCFTOOLS http://vcftools.sourceforge.net/ with --plink-tped to convert very large VCF file. For .ped file conversion to .tped use PLINK http://pngu.mgh.harvard.edu/~purcell/plink/ with --recode transpose.

assignment.analysis

Assignment analysis conducted with assignment.analysis = "gsi_sim" or assignment.analysis = "adegenet".

whitelist.markers

(optional) A whitelist containing CHROM (character or integer) and/or LOCUS (integer) and/or POS (integer) columns header. To filter by chromosome and/or locus and/or by snp. The whitelist is in the working directory (e.g. "whitelist.txt"). de novo CHROM column with 'un' need to be changed to 1. Default NULL for no whitelist of markers. In the VCF, the column ID is the LOCUS identification.

monomorphic.out

(optional) For PLINK file, should the monomorphic markers present in the dataset be filtered out ? Default: monomorphic.out = TRUE.

blacklist.genotype

(optional) Useful to erase genotype with below average quality, e.g. genotype with more than 2 alleles in diploid likely sequencing errors or genotypes with poor genotype likelihood or coverage. The blacklist as a minimum of 2 column headers (markers and individuals). Markers can be 1 column (CHROM or LOCUS or POS), a combination of 2 (e.g. CHROM and POS or CHROM and LOCUS or LOCUS and POS) or all 3 (CHROM, LOCUS, POS) The markers columns must be designated: CHROM (character or integer) and/or LOCUS (integer) and/or POS (integer). The id column designated INDIVIDUALS (character) columns header. The blacklist must be in the working directory (e.g. "blacklist.genotype.txt"). For de novo VCF, CHROM column with 'un' need to be changed to 1. Default NULL for no blacklist of genotypes to erase.

snp.ld

(optional) For VCF file only. With anonymous markers from RADseq/GBS de novo discovery, you can minimize linkage disequilibrium (LD) by choosing among these 3 options: "random" selection, "first" or "last" SNP on the same short read/haplotype. Default: snp.ld = NULL. Note that for other file type, use stackr package for haplotype file and create a whitelist, for plink and data frames, use PLINK linkage disequilibrium based SNP pruning option.

common.markers

(optional) Logical. Default = FALSE. With TRUE, will keep markers genotyped in all the populations.

maf.thresholds

(string, double, optional) String with local/populations and global/overall maf thresholds, respectively. Default: maf.thresholds = NULL. e.g. maf.thresholds = c(0.05, 0.1) for a local maf threshold of 0.05 and a global threshold of 0.1. Available for VCF, PLINK and data frame files. Use stackr for haplotypes files.

maf.pop.num.threshold

(integer, optional) When maf thresholds are used, this argument is for the number of pop required to pass the maf thresholds to keep the locus. Default: maf.pop.num.threshold = 1

maf.approach

(character, optional). By maf.approach = "SNP" or by maf.approach = "haplotype". The function will consider the SNP or ID/LOCUS/haplotype/read MAF statistics to filter the markers. Default is maf.approach = "SNP". The haplotype approach is restricted to VCF file.

maf.operator

(character, optional) maf.operator = "AND" or default maf.operator = "OR". When filtering over LOCUS or SNP, do you want the local "AND" global MAF to pass the thresholds, or ... you want the local "OR" global MAF to pass the thresholds, to keep the marker?

max.marker

An optional integer useful to subsample marker number in large PLINK file. Default: max.marker = NULL. e.g. if the data set contains 200 000 markers and max.marker = 10000 10000 markers are subsampled randomly from the 200000 markers. Use whitelist.markers to keep specific markers.

marker.number

(Integer or string of number or "all") Calculations with fixed or subsample of your markers. Default= "all". e.g. To test 500, 1000, 2000 and all the markers: marker.number = c(500, 1000, 2000, "all". To use only 500 makers marker.number = 500.

blacklist.id

(optional) A blacklist with individual ID and a column header 'INDIVIDUALS'. The blacklist is in the working directory (e.g. "blacklist.txt").

sampling.method

(character) Should the markers be randomly selected "random" for a classic Leave-One-Out (LOO) assignment or chosen based on ranked Fst "ranked", used in a Training-Holdout-Leave One Out (thl) assignment ?

thl

(character, integer, proportion) For sampling.method = "ranked" only. Default 1, 1 individual sample is used as holdout. This individual is not participating in the markers ranking. For each marker number, the analysis will be repeated with all the indiviuals in the data set (e.g. 500 individuals, 500 times 500, 1000, 2000 markers). If a proportion is used e.g. 0.15,= 15 populations are chosen randomly as holdout individuals. With thl = "all" all individuals are used for ranking (not good) and iteration.method argument below is set to 1 by default. For the other thl values, you can create different holdout individuals lists with the iteration.method argument below.

iteration.method

With random marker selection the iterations argument = the number of iterations to repeat marker resampling, default is 10 With marker.number = c(500, 1000) and default iterations setting, 500 markers will be randomly chosen 10 times and 1000 markers will be randomly chosen 10 times. For the ranked method, using thl = 1, the analysis will be repeated for each individuals in the data set for every marker.number selected. With a proportion argument thl = 0.15, 15 individuals and this process is reapeated the number of times chosen by the iteration.method value.

folder

(optional) The name of the folder created in the working directory to save the files/results.

gsi_sim.filename

(optional) The name of the file written to the directory. Use the extension ".txt" at the end. Default assignment_data.txt. The number of markers used will be appended to the name of the file.

keep.gsi.files

(Boolean) Default FALSE The input and output gsi_sim files will be deleted from the directory when finished processing. With TRUE, remember to allocate a large chunk of the disk space for the analysis.

pop.levels

(required) A character string with your populations ordered.

pop.labels

(optional) A character string for your populations labels. If you need to rename sampling sites in pop.levels or combined sites/pop into a different names, here is the place.

pop.id.start

The start of your population id in the name of your individual sample. Your individuals are identified in this form : SPECIES-POPULATION-MATURITY-YEAR-ID = CHI-QUE-ADU-2014-020, then, pop.id.start = 5. If you didn't name your individuals with the pop id in it, use the strata argument.

pop.id.end

The end of your population id in the name of your individual sample. Your individuals are identified in this form : SPECIES-POPULATION-MATURITY-YEAR-ID = CHI-QUE-ADU-2014-020, then, pop.id.end = 7. If you didn't name your individuals with the pop id in it, use the strata argument.

strata

(optional) A tab delimited file with 2 columns with header: INDIVIDUALS and STRATA. Default: strata = NULL. With a data frame of genotypes the strata is the INDIVIDUALS and POP_ID columns, with PLINK files, the tfam first 2 columns are used. If a strata file is specified, the strata file will have precedence. The STRATA column can be any hierarchical grouping.

pop.select

(string) Conduct the assignment analysis on a selected list of populations. Default = NULL for no selection and keep all population. e.g. pop.select = "QUE" to select QUE population samples. pop.select = c("QUE", "ONT") to select QUE and ONT population samples.

subsample

(Integer or Proportion) Default is no sumsampling, subsample = NULL. With a proportion argument subsample = 0.15, 15 percent of individuals in each populations are chosen randomly to represent the dataset. With subsample = 36, 36 individuals in each populations are chosen randomly to represent the dataset.

iteration.subsample

(Integer) The number of iterations to repeat subsampling, default: iteration.subsample = 1. With subsample = 20 and iteration.subsample = 10, 20 individuals/populations will be randomly chosen 10 times.

imputation.method

Should a map-independent imputations of markers be computed. Available choices are: (1) FALSE for no imputation. (2) "max" to use the most frequent category for imputations. (3) "rf" using Random Forest algorithm. Default: imputation.method = FALSE.

impute

(character) Imputation on missing genotype impute = "genotype" or alleles impute = "allele".

imputations.group

"global" or "populations". Should the imputations be computed globally or by populations. If you choose global, turn the verbose to TRUE, to see progress. Default = "populations".

num.tree

The number of trees to grow in Random Forest. Default is 100.

iteration.rf

The number of iterations of missing data algorithm in Random Forest. Default is 10.

split.number

Non-negative integer value used to specify random splitting in Random Forest. Default is 100.

verbose

Logical. Should trace output be enabled on each iteration in Random Forest ? Default is FALSE.

parallel.core

(optional) The number of core for OpenMP shared-memory parallel programming of Random Forest imputations. For more info on how to install the OpenMP version see randomForestSRC-package. If not selected detectCores()-1 is used as default.

Details

You need to have either the pop.id.start and pop.id.end or the strata argument, to identify your populations.

The imputations using Random Forest requires more time to compute and can take several minutes and hours depending on the size of the dataset and polymorphism of the species used. e.g. with a low polymorphic taxa, and a data set containing 30% missing data, 5 000 haplotypes loci and 500 individuals will require 15 min. The Fst is based on Weir and Cockerham 1984 equations.

Value

Depending on arguments selected, several files are written to the your working directory or folder The output in your global environment is a list. To view the assignment results $assignment to view the ggplot2 figure $plot.assignment. See example below.

Note

assignment_ngs assumes that the command line version of gsi_sim is properly installed and available on the command line, so it is executable from any directory (more info on how to do this, here http://gbs-cloud-tutorial.readthedocs.org/en/latest/03_computer_setup.html?highlight=bash_profile#save-time. The easiest way is to put the binary, the gsi_sim executable, in the folder /usr/local/bin. To compile gsi_sim, follow the instruction here: https://github.com/eriqande/gsi_sim.

Author(s)

Thierry Gosselin thierrygosselin@icloud.com

References

Anderson, Eric C., Robin S. Waples, and Steven T. Kalinowski. (2008) An improved method for predicting the accuracy of genetic stock identification. Canadian Journal of Fisheries and Aquatic Sciences 65, 7:1475-1486.

Anderson, E. C. (2010) Assessing the power of informative subsets of loci for population assignment: standard methods are upwardly biased. Molecular ecology resources 10, 4:701-710.

Catchen JM, Amores A, Hohenlohe PA et al. (2011) Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences. G3, 1, 171-182.

Catchen JM, Hohenlohe PA, Bassham S, Amores A, Cresko WA (2013) Stacks: an analysis tool set for population genomics. Molecular Ecology, 22, 3124-3140.

Weir BS, Cockerham CC (1984) Estimating F-Statistics for the Analysis of Population Structure. Evolution, 38, 1358–1370.

Ishwaran H. and Kogalur U.B. (2015). Random Forests for Survival, Regression and Classification (RF-SRC), R package version 1.6.1.

Ishwaran H. and Kogalur U.B. (2007). Random survival forests for R. R News 7(2), 25-31.

Ishwaran H., Kogalur U.B., Blackstone E.H. and Lauer M.S. (2008). Random survival forests. Ann. Appl. Statist. 2(3), 841–860.

Danecek P, Auton A, Abecasis G et al. (2011) The variant call format and VCFtools. Bioinformatics, 27, 2156-2158.

Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics. 2007; 81: 559–575. doi:10.1086/519795

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
## Not run: 
assignment.treefrog <- assignment_ngs(
data = "batch_1.vcf",
whitelist.markers = "whitelist.vcf.txt",
snp.ld = NULL,
common.markers = TRUE,
marker.number = c(500, 5000, "all"),
sampling.method = "ranked",
thl = 0.3,
blacklist.id = "blacklist.id.lobster.tsv",
subsample = 25,
iteration.subsample = 10
gsi_sim.filename = "treefrog.txt",
keep.gsi.files = FALSE,
pop.levels = c("PAN", "COS")
pop.id.start = 5, pop.id.end = 7,
imputation.method = FALSE,
parallel.core = 12
)

Since the 'folder' argument is missing, it will be created automatically
inside your working directory.

To create a dataframe with the assignment results: 
assignment <- assignment.treefrog$assignment.

To plot the assignment using ggplot2 and facet 
(with subsample by current pop):
assignment.treefrog$plot.assignment + facet_grid(SUBSAMPLE~CURRENT).

To save the plot:
ggsave("assignment.treefrog.THL.subsample.pdf", height = 35, 
width = 60,dpi = 600, units = "cm", useDingbats = F)

## End(Not run)

eriqande/assigner documentation built on May 16, 2019, 8:45 a.m.