Objectives

With this vignette, you should be able to learn the basics in under 30 minutes.

Assumptions:

Prepare your workspace

library(assigner)

By default, the working directory is where the file for this vignette is. To change: setwd("new path here")

Data 1: high structure

For this exercise, we use the first dataset included in assigner. It's a simulated dataset generated with grur. Details of the dataset are available using: ?assigner::data_assigner_sim_01.

data <- data_assigner_sim_01

Analysis using THL with gsi_sim

The analysis below uses the cross-validation technique called Training, Holdout, Leave-one-out (THL):

gsi_sim is not distributed with assigner, follow the install instruction, here for my mac it's:

assigner::install_gsi_sim(fromSource = TRUE)

no subsampling

test1 <- assigner::assignment_ngs(
  data = data,
  assignment.analysis = "gsi_sim",
  markers.sampling = "ranked", 
  thl = 0.2,
  iteration.method = 5
)
#>################################################################################
#>########################## assigner::assignment_ngs ############################
#>################################################################################
#>Execution date/time: 20190501@1104
#>Assignment analysis with gsi_sim
#>Folder created: assignment_analysis_method_ranked_20190501@1104
#>Calibrating REF/ALT alleles...
#>Subsampling: not selected
#>Conducting Assignment analysis using Training, Holdout, Leave-one-out
#>Using training samples to rank markers based on Fst
#>Holdout samples saved in your folder
#>Starting parallel computations, for progress monitor activity in folder...
#>
#>Computation time, overall: 7 sec
#>########################## assignment_ngs completed ############################

By default, the function uses all the markers and all my CPUs - 1, to change this last behavior use the argument parallel.core.

The most important information that you'll probably want to see are both returned in the working directory and the object. Not much is kept in the object test1, because disk space is cheap, memory is not!

names(test1)
#>[1] "assignment"      "assignment.plot"

Written in the working directory under assignment_analysis_method_ranked_date@time, where date@time is the date and time the function was run:

# 01_radiator_tidy_genomic: folder
# assigner_assignment_ngs_args_20190501@1102.tsv: tibble, file
# assignment_1: folder
# assignment_2: folder
# assignment_3: folder
# assignment_4: folder
# assignment_5: folder
# assignment.plot.pdf: figure
# assignment.ranked.results.iterations.raw.tsv: tibble, file
# assignment.ranked.results.iterations.summary.tsv: tibble, file
# assignment.results.summary.stats.tsv: tibble, file
# holdout.individuals.tsv: tibble, file

These folders and files are detailed in the function documentation under value returned by the function.

To see the figure:

test1$plot.assignment

By default, the figure display the range of the data, to change this behavior and see the full range for the y axis:

test1$plot.assignment + ggplot2::scale_y_continuous(limits = c(0,100)) 

data %<>% 
  radiator::filter_monomorphic(data = .) %>%
  radiator::filter_common_markers(data = .)
#>Filter monomorphic markers
#>Number of individuals / strata / chrom / locus / SNP:
#>    Blacklisted: 0 / 0 / NA / NA / 3
#>
#>Filter common markers:
#>Number of individuals / strata / chrom / locus / SNP:
#>    Blacklisted: 0 / 0 / 0 / 0 / 0

with subsampling

For the second test, we will test several marker numbers and use the subsampling arguments to select 30 individuals in each strata (repeating this 3 times).

test2 <- assigner::assignment_ngs(
  data = data,
  assignment.analysis = "gsi_sim",
  markers.sampling = "ranked", 
  thl = 0.2,
  iteration.method = 5, 
  marker.number = c(100, 200, 300, 400, "all"),
  subsample = 30, 
  iteration.subsample = 3
)
#> ################################################################################
#> ########################## assigner::assignment_ngs ############################
#> ################################################################################
#> Execution date/time: 20190501@1158
#> Assignment analysis with gsi_sim
#> Folder created: assignment_analysis_method_ranked_20190501@1158
#> Calibrating REF/ALT alleles...
#> Subsampling: selected
#>     using subsample size of: 30
#> 
#> Analyzing subsample: 1
#> Conducting Assignment analysis using Training, Holdout, Leave-one-out
#> Using training samples to rank markers based on Fst
#> Holdout samples saved in your folder
#> Starting parallel computations, for progress monitor activity in folder...
#> 
#> Analyzing subsample: 2
#> Conducting Assignment analysis using Training, Holdout, Leave-one-out
#> Using training samples to rank markers based on Fst
#> Holdout samples saved in your folder
#> Starting parallel computations, for progress monitor activity in folder...
#> 
#> Analyzing subsample: 3
#> Conducting Assignment analysis using Training, Holdout, Leave-one-out
#> Using training samples to rank markers based on Fst
#> Holdout samples saved in your folder
#> Starting parallel computations, for progress monitor activity in folder...
#> 
#> Computation time, overall: 19 sec
#> ########################## assignment_ngs completed ############################

The object generated is similar to the analysis witout subsampling. The output folder is different. The subsample folders have the same content as the output of the analysis witout subsampling.

# 01_radiator_tidy_genomic: folder
# assigner_assignment_ngs_args_20190501@1540.tsv: tibble, file
# assignment.plot.pdf: figure
# assignment.ranked.results.summary.stats.all.subsamples.tsv: tibble, file
# assignment.results.summary.stats.tsv: tibble, file
# subsample_1: folder
# subsample_2: folder
# subsample_3: folder
# subsampling_individuals.tsv: tibble, file

To view the figure:

test2$plot.assignment + ggplot2::scale_y_continuous(limits = c(0,100)) 

This dataset as a high overall Fst value:

assigner::fst_WC84(data) %$% fst.overall$FST
#>[1] 0.39603

The look of the membership probabilities with adegenet dapc analysis would show something similar to this:

Data 2: low structure

Let's try the same analysis, but this time with a dataset with lower Fst. It's a simulated dataset generated with grur. Details of the dataset are available using: ?assigner::data_assigner_sim_02.

data <- data_assigner_sim_02

Analysis using THL with gsi_sim

test3 <- assigner::assignment_ngs(
  data = data,
  assignment.analysis = "gsi_sim",
  markers.sampling = "ranked", 
  thl = 0.2,
  iteration.method = 5, 
  marker.number = c(100, 200, 300, 400, "all"),
  subsample = 30, 
  iteration.subsample = 3
)

To view the figure:

test3$plot.assignment + ggplot2::scale_y_continuous(limits = c(0,100)) 
# <img src="assignment_thl_test3.png">: works
#![](assignment_thl_test3.png): works
#knitr::include_graphics("assignment_thl_test3.png"):works
knitr::include_graphics("assignment_thl_test3.png")

This is the overall Fst value:

assigner::fst_WC84(data) %$% fst.overall$FST
#>[1] 0.001320833

This is the membership probabilities with adegenet dapc analysis:

The populations are very admixed because of the high migration rate used during the simulations.

Analysis using LOO with gsi_sim

Let's try using the Leave-One-Out cross-validation technique with dataset2. With this method, there is no potential bias during marker selection, you can used all the markers with marker.number = "all" or a string of marker numbers like the example above, but here, there's no point in using less markers randomly.

The Leave-One-Out method means that the allele frequencies are calculated without the sample being assigned. This is repeated for each sample.

test4 <- assigner::assignment_ngs(
  data = data,
  assignment.analysis = "gsi_sim",
  markers.sampling = "random", 
  marker.number = "all"
)
#> ################################################################################
#> ########################## assigner::assignment_ngs ############################
#> ################################################################################
#> Execution date/time: 20190501@1317
#> Assignment analysis with gsi_sim
#> Folder created: assignment_analysis_method_random_20190501@1317
#> Calibrating REF/ALT alleles...
#> Subsampling: not selected
#> Conducting Assignment analysis with markers selected randomly
#> Making a list containing all the markers combinations
#> Starting parallel computations, for progress monitor activity in folder...
#> Summarizing the assignment analysis results by iterations and marker group
#> Compiling results
#> ########################## assignment_ngs completed ############################

To view the figure:

test4$plot.assignment + ggplot2::scale_y_continuous(limits = c(0,100)) 
knitr::include_graphics("assignment_loo_test4.png")

Conclusion: not much to gain here by using the LOO...



thierrygosselin/assigner documentation built on Nov. 9, 2024, 3:38 a.m.