View source: R/assignment_ngs.R
assignment_ngs | R Documentation |
The arguments in the assignment_ngs
function were tailored for the
reality of RADseq data for assignment analysis while
maintaining a reproducible workflow.
Assignment assumptions are listed in the section below.
Input file: various file format are supported (see data
argument below).
Cross-Validations: Markers can be randomly selected for a classic LOO (Leave-One-Out) assignment or chosen based on ranked Fst for a thl (Training, Holdout, Leave-one-out) assignment analysis.
Assignment analysis: conducted in gsi_sim, a tool for doing and simulating genetic stock identification and developed by Eric C. Anderson, or adegenet, an R package developed by Thibaul Jombart.
Parallel: The assignment can be conduncted on multiple CPUs. The R GUI is unstable with this functions, I recommend using RStudio.
assignment_ngs(
data,
strata = NULL,
pop.levels = NULL,
assignment.analysis = c("gsim_sim", "adegenet"),
markers.sampling = c("ranked", "random"),
marker.number = "all",
thl = 1,
iteration.method = 10,
subsample = NULL,
iteration.subsample = 1,
verbose = TRUE,
parallel.core = parallel::detectCores() - 1,
...
)
data |
Several input format are accepted. assigner uses radiator
|
strata |
(optional)
The strata file is a tab delimited file with a minimum of 2 columns headers:
|
pop.levels |
(optional, string) This refers to the levels in a factor. In this
case, the id of the pop.
Use this argument to have the pop ordered your way instead of the default
alphabetical or numerical order. e.g. |
assignment.analysis |
(character) Assignment analysis conducted with
|
markers.sampling |
(character) 2 options for markers selection:
|
marker.number |
(Integer or string of number or "all") The assignment
analysis can use all your markers (default) or a subset of your markers.
e.g. To test 500, 1000, 2000 and all the markers:
|
thl |
(character, integer, proportion) For
Different lists of holdout individuals are generated when the argument
|
iteration.method |
(integer)
With random markers selection method, the iterations argument =
the number of iterations to repeat marker resampling.
Default: With Notes: If all the markers are used, with With ranked makers selection method, using |
subsample |
(Integer or Character, optional)
This argument subsample individuals.
With |
iteration.subsample |
(Integer) The number of iterations to repeat
subsampling of individuals.
With |
verbose |
(optional, logical) When |
parallel.core |
(optional) The number of core used for parallel
execution during import.
Default: |
... |
(optional) To pass further argument for fine-tuning the function (see advanced section below). |
Using gsi_sim:
assignment_ngs
assumes that the command line version of
gsi_sim
is properly installed into file.path(system.file(package = "assigner"), "bin", "gsi_sim")
.
Things are set up so that it will try running gsi_sim, and if it does not find it, the
program will throw an error and ask the user to run install_gsi_sim
which will do its best to put a usable copy of gsi_sim where it is needed.
To do so, you must be connected to the internet. If that doesn't work, you will
need to compile the program yourself, or get it yourself, and the manually copy
it to file.path(system.file(package = "assigner"), "bin", "gsi_sim")
.
To compile gsi_sim, follow the
instruction here: https://github.com/eriqande/gsi_sim.
Depending on arguments selected, several folders and files are written:
01_radiator_tidy_genomic
this is the result of importing the data
with radiator import module tidy_genomic_data.
assigner_assignment_ngs_args_date@time.tsv
: For reproducibility,
the function call, arguments and values used along the default arguments.
assignment.plot.pdf
: The assignment figure.
assignment.results.summary.stats.tsv
: tibble
of summarized assignment statistics.
assignment.ranked.results.summary.stats.all.subsamples.tsv
:
When subsampling is used, this file contains the raw results of all subsample before
summarizing.
THL: Training, Holdout, Leave-one-out:
Intermediate files are written, you can inspect specific iterations and/or subsample:
assignment.ranked.results.iterations.raw.tsv
: tibble
with raw intermediate results of all iterations.
assignment.ranked.results.iterations.summary.tsv
: tibble
with intermediate summary over iterations.
holdout.individuals.tsv
: tibble with the holdout individuals and
associated iteration and random seed number.
LOO: Leave-one-out:
Intermediate files are written, you can inspect specific iterations and/or subsample:
assignment.random.results.iterations.raw.tsv
: tibble
with raw intermediate results of all iterations.
markers.random.tsv
: tibble with the random markers selected for
each iteration with associated random seed number.
Folders
The names have the different iterations i
starting with assignment_
i contains:
assignment_
i.tsv
: the assignment result, for the
iteration.
fst.ranked_
i.tsv
: for THL method, the ranked Fst
per markers, for the iteration.
gsi_sim_seeds
: the gsi_sim
random seeds when this program
is used, for the iteration.
The output in your global environment is a list. To view the assignment results
$assignment
to view the ggplot2 figure $assignment.plot
.
See example below.
Ideally, forget about this section. For advance users, through dots-dots-dots ... you can pass several arguments for fine-tuning the function:
adegenet.dapc.opt
(optional, character) Argument available only when
using:
assignment.analysis = "adegenet"
with
markers.sampling == "random"
.
The assignment using dapc can use the optimized alpha score
adegenet.dapc.opt == "optim.a.score"
or
cross-validation adegenet.dapc.opt == "xval"
for stability of group membership probabilities.
For fine tuning the trade-off between power of discrimination and over-fitting.
See adegenet documentation for more details.
adegenet.dapc.opt == "xval"
doesn't work with missing data, so it's
only available with full dataset or imputed dataset.
With non imputed data or the default: adegenet.dapc.opt == "optim.a.score"
.
adegenet.n.rep
: (optional, integer)
When adegenet.dapc.opt == "xval"
, the number of replicates to be
carried out at each level of PC retention.
Default: adegenet.n.rep = 30
.
See adegenet documentation for more details.
adegenet.training
: (optional, numeric)
When adegenet.dapc.opt == "xval"
, the proportion of data (individuals)
to be used for the training set.
Default: adegenet.training = 0.9
, if all groups have >= 10 members.
Otherwise, training.set scales automatically to the largest proportion
that still ensures all groups will be present in both training
and validation sets.
See adegenet documentation for more details.
folder
: (optional) The name of the folder created in the working
directory to save the files/results. Default: folder = NULL
will create
the folder for you with data and time.
filename
: (optional) The name of the file written to the directory.
Use the extension ".txt" at the end.
Several info will be appended to the name of the file.
Default assignment_data.txt
.
keep.gsi.files
: (logical, optional) With the default,
the intermediate input and output gsi_sim files will be deleted from the
directory when finished processing. I you decide to keep the files, with
keep.gsi.files = TRUE
, remember to allocate a large chunk of the disk
space for the analysis.
Default: keep.gsi.files = FALSE
random.seed
: (integer, optional) For reproducibility, set an integer
that will be used inside function that requires randomness. With default,
a random number is generated and printed in the appropriate output.
Default: random.seed = NULL
.
full.y.range
: (logical, optional) By default the y axis print
min and max values are determied automatically from the data. It might
be more usefull to see the full range of the scale from 0 to 100, in
this case use full.y.range = TRUE
. This can also be modified after
running the analysis. See example below.
Default: full.y.range = FALSE
.
Individuals QC: Bad individual QC will bias the assignment results.
remove duplicates samples: when found within the same strata, duplicates generate a false population signal, when they are found between strata (yes, I've seen it), it's generating noise and the core population signal is diluted.
remove individual with outlier heterozygosity: unchecked, outlier individuals based on heterozygosity will generate false population signal when the sample as lower heterozygosity and match against several strata (week assignment) when the sample as higher heterozygosity.
remove individuals with too many missing: these individuals will exacerbate or dilute the signal, depending on correlation with heterozygosity and presence of pattern of missingness.
Markers QC: Bad markers QC will bias the assignment results.
low MAC: improper Minor Allele Count filtering generate noise. The LOO and THL methods, both removes samples during model construction, if MAC is too low, the population core signature is greatly impacted at each iteration.
Linkage disequilibrium: remove highly linked markers.
HWE: remove markers in very strong Hardy-Weinberg disequilibrium likely artefactual and/or genotyping errors.
Strata: bad K selection will result in poor assingment results.
filtered data: Don't expect to filter your data here.
radiator was designed for this, and filter_rad
will handle
the issues mentioned above. By default, the function will only remove
monomorphic markers and keep markers in common between strata.
Map-independent imputation of missing genotype is avaible in my other R package called grur.
Use grur to :
Visualize your missing data: before imputing your genotypes, visualize your missing data. Several visual tools are available inside grur to help you decide the best strategy after.
Optimize: use grur imputation module and other functions to optimize the imputations of your dataset. You need to test arguments. Failing to conduct tests and adjust imputations arguments will generate artifacts and/or exacerbate bias. Using defaults is not optional here...
genomic_converter: use the output argument inside grur imputation module to generate the required formats for assigner (e.g. a tidy dataset)
Deprecated arguments:
sampling.method
: renamed markers.sampling
.
Thierry Gosselin thierrygosselin@icloud.com and Eric C. Anderson
Anderson, Eric C., Robin S. Waples, and Steven T. Kalinowski. (2008) An improved method for predicting the accuracy of genetic stock identification. Canadian Journal of Fisheries and Aquatic Sciences 65, 7:1475-1486.
Anderson, E. C. (2010) Assessing the power of informative subsets of loci for population assignment: standard methods are upwardly biased. Molecular ecology resources 10, 4:701-710.
Weir BS, Cockerham CC (1984) Estimating F-Statistics for the Analysis of Population Structure. Evolution, 38, 1358–1370.
Jombart T, Devillard S, Balloux F. Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genet. 2010:11: 94. doi:10.1186/1471-2156-11-94
Jombart T, Ahmed I. adegenet 1.3-1: new tools for the analysis of genome-wide SNP data. Bioinformatics. 2011:27: 3070–3071. doi:10.1093/bioinformatics/btr521
## Not run:
assignment.treefrog <- assignment_ngs(
data = "batch_1.vcf", strata = "strata.treefrog.tsv",
assignment.analysis = "gsi_sim",
marker.number = c(500, 5000, "all"),
markers.sampling = "ranked", thl = 0.3
)
# To create a dataframe with the assignment results:
assignment <- assignment.treefrog$assignment
# To plot the assignment using ggplot2 and facet
fig <- assignment.treefrog$assignment.plot
# To view the full range of y values = Assignment success(%):
fig + ggplot2::scale_y_continuous(limits = c(0,100))
# Or use the ... argument: full.y.range = TRUE
# If you want to remove underscore in population names that contained white space:
facet_names <- c(
`some_pop` = "Some POP",
`some_other_pop` = "This is what I want",
`OVERALL` = "Overall")
# use the labeller in the facet_grid or facet_wrap call:
fig +
ggplot2::facet_grid(
SUBSAMPLE ~ CURRENT,
ggplot2::labeller = ggplot2::as_labeller(facet_names)
) +
ggplot2::scale_y_continuous(limits = c(0,100))
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.