RADseq Data using R

Description Arguments Details Value Note Author(s) References Examples

gsi_sim is a tool for doing and simulating genetic stock identification and developed by Eric C. Anderson. The arguments in the assignment_ngs function were tailored for the reality of GBS data for assignment analysis while maintaining a reproducible workflow. The input data is a VCF file produced by STACKS or a data frame. Individuals, populations and markers can be filtered and/or selected in several ways using blacklist, whitelist and other arguments. Map-independent imputation of missing genotype using Random Forest or the most frequent category is also available. Markers can be randomly selected for a classic LOO (Leave-One-Out) assignment or chosen based on ranked Fst for a thl (Training, Holdout, Leave-one-out) assignment analysis.

`data`	Options include the VCF (1) or an haplotype files (2) created in STACKS (`data = "batch_1.vcf"` and `data = "batch_1.haplotypes.tsv"`, respectively) or a data frame (3) with tab separating the genotypes in columns (`data = "data.assignment.tsv"`). The 1st column is the `POP_ID`, 2nd colum the `INDIVIDUALS` and the remaining columns are the markers IDs containing genotypes in the format: 3 digits per allele and no space between alleles (e.g. 235240 : allele1 = 235 and allele2 = 240). Missing genotypes are coded `0` or `000000`. Note that the `POP_ID` column can be any hierarchical grouping. See the argument `strata` for other means of controlling grouping used in the assignment. The last option for data input is a PLINK file in `tped/tfam` format (e.g. `data = "data.assignment.tped"`). The first 2 columns of the `tfam` file will be used for the `strata` argument below, unless a new one is provided. Columns 1, 3 and 4 of the `tped` are discarded. The remaining columns correspond to the genotype in the format `01/04` where `A = 01, C = 02, G = 03 and T = 04`. For `A/T` format, use PLINK or bash to convert. Use VCFTOOLS http://vcftools.sourceforge.net/ with `--plink-tped` to convert very large VCF file. For `.ped` file conversion to `.tped` use PLINK http://pngu.mgh.harvard.edu/~purcell/plink/ with `--recode transpose`.
`assignment.analysis`	Assignment analysis conducted with `assignment.analysis = "gsi_sim"` or `assignment.analysis = "adegenet"`.
`whitelist.markers`	(optional) A whitelist containing CHROM (character or integer) and/or LOCUS (integer) and/or POS (integer) columns header. To filter by chromosome and/or locus and/or by snp. The whitelist is in the working directory (e.g. "whitelist.txt"). de novo CHROM column with 'un' need to be changed to 1. Default `NULL` for no whitelist of markers. In the VCF, the column ID is the LOCUS identification.
`monomorphic.out`	(optional) For PLINK file, should the monomorphic markers present in the dataset be filtered out ? Default: `monomorphic.out = TRUE`.
`blacklist.genotype`	(optional) Useful to erase genotype with below average quality, e.g. genotype with more than 2 alleles in diploid likely sequencing errors or genotypes with poor genotype likelihood or coverage. The blacklist as a minimum of 2 column headers (markers and individuals). Markers can be 1 column (CHROM or LOCUS or POS), a combination of 2 (e.g. CHROM and POS or CHROM and LOCUS or LOCUS and POS) or all 3 (CHROM, LOCUS, POS) The markers columns must be designated: CHROM (character or integer) and/or LOCUS (integer) and/or POS (integer). The id column designated INDIVIDUALS (character) columns header. The blacklist must be in the working directory (e.g. "blacklist.genotype.txt"). For de novo VCF, CHROM column with 'un' need to be changed to 1. Default `NULL` for no blacklist of genotypes to erase.
`snp.ld`	(optional) For VCF file only. With anonymous markers from RADseq/GBS de novo discovery, you can minimize linkage disequilibrium (LD) by choosing among these 3 options: `"random"` selection, `"first"` or `"last"` SNP on the same short read/haplotype. Default: `snp.ld = NULL`. Note that for other file type, use stackr package for haplotype file and create a whitelist, for plink and data frames, use PLINK linkage disequilibrium based SNP pruning option.
`common.markers`	(optional) Logical. Default = `FALSE`. With `TRUE`, will keep markers genotyped in all the populations.
`maf.thresholds`	(string, double, optional) String with local/populations and global/overall maf thresholds, respectively. Default: `maf.thresholds = NULL`. e.g. `maf.thresholds = c(0.05, 0.1)` for a local maf threshold of 0.05 and a global threshold of 0.1. Available for VCF, PLINK and data frame files. Use stackr for haplotypes files.
`maf.pop.num.threshold`	(integer, optional) When maf thresholds are used, this argument is for the number of pop required to pass the maf thresholds to keep the locus. Default: `maf.pop.num.threshold = 1`
`maf.approach`	(character, optional). By `maf.approach = "SNP"` or by `maf.approach = "haplotype"`. The function will consider the SNP or ID/LOCUS/haplotype/read MAF statistics to filter the markers. Default is `maf.approach = "SNP"`. The `haplotype` approach is restricted to VCF file.
`maf.operator`	(character, optional) `maf.operator = "AND"` or default `maf.operator = "OR"`. When filtering over LOCUS or SNP, do you want the local `"AND"` global MAF to pass the thresholds, or ... you want the local `"OR"` global MAF to pass the thresholds, to keep the marker?
`max.marker`	An optional integer useful to subsample marker number in large PLINK file. Default: `max.marker = NULL`. e.g. if the data set contains 200 000 markers and `max.marker = 10000` 10000 markers are subsampled randomly from the 200000 markers. Use `whitelist.markers` to keep specific markers.
`marker.number`	(Integer or string of number or "all") Calculations with fixed or subsample of your markers. Default= `"all"`. e.g. To test 500, 1000, 2000 and all the markers: `marker.number = c(500, 1000, 2000, "all"`. To use only 500 makers `marker.number = 500`.
`blacklist.id`	(optional) A blacklist with individual ID and a column header 'INDIVIDUALS'. The blacklist is in the working directory (e.g. "blacklist.txt").
`sampling.method`	(character) Should the markers be randomly selected `"random"` for a classic Leave-One-Out (LOO) assignment or chosen based on ranked Fst `"ranked"`, used in a Training-Holdout-Leave One Out (thl) assignment ?
`thl`	(character, integer, proportion) For `sampling.method = "ranked"` only. Default `1`, 1 individual sample is used as holdout. This individual is not participating in the markers ranking. For each marker number, the analysis will be repeated with all the indiviuals in the data set (e.g. 500 individuals, 500 times 500, 1000, 2000 markers). If a proportion is used e.g. `0.15`,= 15 populations are chosen randomly as holdout individuals. With `thl = "all"` all individuals are used for ranking (not good) and `iteration.method` argument below is set to `1` by default. For the other thl values, you can create different holdout individuals lists with the `iteration.method` argument below.
`iteration.method`	With random marker selection the iterations argument = the number of iterations to repeat marker resampling, default is `10` With `marker.number = c(500, 1000)` and default iterations setting, 500 markers will be randomly chosen 10 times and 1000 markers will be randomly chosen 10 times. For the ranked method, using `thl = 1`, the analysis will be repeated for each individuals in the data set for every `marker.number` selected. With a proportion argument `thl = 0.15`, 15 individuals and this process is reapeated the number of times chosen by the `iteration.method` value.
`folder`	(optional) The name of the folder created in the working directory to save the files/results.
`gsi_sim.filename`	(optional) The name of the file written to the directory. Use the extension ".txt" at the end. Default `assignment_data.txt`. The number of markers used will be appended to the name of the file.
`keep.gsi.files`	(Boolean) Default `FALSE` The input and output gsi_sim files will be deleted from the directory when finished processing. With `TRUE`, remember to allocate a large chunk of the disk space for the analysis.
`pop.levels`	(required) A character string with your populations ordered.
`pop.labels`	(optional) A character string for your populations labels. If you need to rename sampling sites in `pop.levels` or combined sites/pop into a different names, here is the place.
`pop.id.start`	The start of your population id in the name of your individual sample. Your individuals are identified in this form : SPECIES-POPULATION-MATURITY-YEAR-ID = CHI-QUE-ADU-2014-020, then, `pop.id.start` = 5. If you didn't name your individuals with the pop id in it, use the `strata` argument.
`pop.id.end`	The end of your population id in the name of your individual sample. Your individuals are identified in this form : SPECIES-POPULATION-MATURITY-YEAR-ID = CHI-QUE-ADU-2014-020, then, `pop.id.end` = 7. If you didn't name your individuals with the pop id in it, use the `strata` argument.
`strata`	(optional) A tab delimited file with 2 columns with header: `INDIVIDUALS` and `STRATA`. Default: `strata = NULL`. With a data frame of genotypes the strata is the INDIVIDUALS and POP_ID columns, with PLINK files, the `tfam` first 2 columns are used. If a `strata` file is specified, the strata file will have precedence. The `STRATA` column can be any hierarchical grouping.
`pop.select`	(string) Conduct the assignment analysis on a selected list of populations. Default = `NULL` for no selection and keep all population. e.g. `pop.select = "QUE"` to select QUE population samples. `pop.select = c("QUE", "ONT")` to select QUE and ONT population samples.
`subsample`	(Integer or Proportion) Default is no sumsampling, `subsample = NULL`. With a proportion argument `subsample = 0.15`, 15 percent of individuals in each populations are chosen randomly to represent the dataset. With `subsample = 36`, 36 individuals in each populations are chosen randomly to represent the dataset.
`iteration.subsample`	(Integer) The number of iterations to repeat subsampling, default: `iteration.subsample = 1`. With `subsample = 20` and `iteration.subsample = 10`, 20 individuals/populations will be randomly chosen 10 times.
`imputation.method`	Should a map-independent imputations of markers be computed. Available choices are: (1) `FALSE` for no imputation. (2) `"max"` to use the most frequent category for imputations. (3) `"rf"` using Random Forest algorithm. Default: `imputation.method = FALSE`.
`impute`	(character) Imputation on missing genotype `impute = "genotype"` or alleles `impute = "allele"`.
`imputations.group`	`"global"` or `"populations"`. Should the imputations be computed globally or by populations. If you choose global, turn the verbose to `TRUE`, to see progress. Default = `"populations"`.
`num.tree`	The number of trees to grow in Random Forest. Default is 100.
`iteration.rf`	The number of iterations of missing data algorithm in Random Forest. Default is 10.
`split.number`	Non-negative integer value used to specify random splitting in Random Forest. Default is 100.
`verbose`	Logical. Should trace output be enabled on each iteration in Random Forest ? Default is `FALSE`.
`parallel.core`	(optional) The number of core for OpenMP shared-memory parallel programming of Random Forest imputations. For more info on how to install the OpenMP version see `randomForestSRC-package`. If not selected `detectCores()-1` is used as default.

You need to have either the pop.id.start and pop.id.end or the strata argument, to identify your populations.

The imputations using Random Forest requires more time to compute and can take several minutes and hours depending on the size of the dataset and polymorphism of the species used. e.g. with a low polymorphic taxa, and a data set containing 30% missing data, 5 000 haplotypes loci and 500 individuals will require 15 min. The Fst is based on Weir and Cockerham 1984 equations.

Depending on arguments selected, several files are written to the your working directory or folder The output in your global environment is a list. To view the assignment results $assignment to view the ggplot2 figure $plot.assignment. See example below.

assignment_ngs assumes that the command line version of gsi_sim is properly installed and available on the command line, so it is executable from any directory (more info on how to do this, here http://gbs-cloud-tutorial.readthedocs.org/en/latest/03_computer_setup.html?highlight=bash_profile#save-time. The easiest way is to put the binary, the gsi_sim executable, in the folder /usr/local/bin. To compile gsi_sim, follow the instruction here: https://github.com/eriqande/gsi_sim.

Thierry Gosselin thierrygosselin@icloud.com

Anderson, Eric C., Robin S. Waples, and Steven T. Kalinowski. (2008) An improved method for predicting the accuracy of genetic stock identification. Canadian Journal of Fisheries and Aquatic Sciences 65, 7:1475-1486.

Anderson, E. C. (2010) Assessing the power of informative subsets of loci for population assignment: standard methods are upwardly biased. Molecular ecology resources 10, 4:701-710.

Catchen JM, Amores A, Hohenlohe PA et al. (2011) Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences. G3, 1, 171-182.

Catchen JM, Hohenlohe PA, Bassham S, Amores A, Cresko WA (2013) Stacks: an analysis tool set for population genomics. Molecular Ecology, 22, 3124-3140.

Weir BS, Cockerham CC (1984) Estimating F-Statistics for the Analysis of Population Structure. Evolution, 38, 1358–1370.

Ishwaran H. and Kogalur U.B. (2015). Random Forests for Survival, Regression and Classification (RF-SRC), R package version 1.6.1.

Ishwaran H. and Kogalur U.B. (2007). Random survival forests for R. R News 7(2), 25-31.

Ishwaran H., Kogalur U.B., Blackstone E.H. and Lauer M.S. (2008). Random survival forests. Ann. Appl. Statist. 2(3), 841–860.

Danecek P, Auton A, Abecasis G et al. (2011) The variant call format and VCFtools. Bioinformatics, 27, 2156-2158.

Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics. 2007; 81: 559–575. doi:10.1086/519795

## Not run: 
assignment.treefrog <- assignment_ngs(
data = "batch_1.vcf",
whitelist.markers = "whitelist.vcf.txt",
snp.ld = NULL,
common.markers = TRUE,
marker.number = c(500, 5000, "all"),
sampling.method = "ranked",
thl = 0.3,
blacklist.id = "blacklist.id.lobster.tsv",
subsample = 25,
iteration.subsample = 10
gsi_sim.filename = "treefrog.txt",
keep.gsi.files = FALSE,
pop.levels = c("PAN", "COS")
pop.id.start = 5, pop.id.end = 7,
imputation.method = FALSE,
parallel.core = 12
)

Since the 'folder' argument is missing, it will be created automatically
inside your working directory.

To create a dataframe with the assignment results: 
assignment <- assignment.treefrog$assignment.

To plot the assignment using ggplot2 and facet 
(with subsample by current pop):
assignment.treefrog$plot.assignment + facet_grid(SUBSAMPLE~CURRENT).

To save the plot:
ggsave("assignment.treefrog.THL.subsample.pdf", height = 35, 
width = 60,dpi = 600, units = "cm", useDingbats = F)

## End(Not run)