knitr::opts_chunk$set(message = FALSE) start_time <- Sys.time()
library(tidyverse) library(HatcheryPedAgree) genotypes <- read_rds("../private/cvsh_genotypes.rds") metadata <- read_rds("../private/cvsh_metadata.rds")
miss_dsn <- genotypes %>% group_by(indiv) %>% summarise( num_all_loci = n() / 2, # check to make sure each gene copy is explicitly listed as NA num_non_miss_loci = sum(!is.na(allele_int)) / 2) count(miss_dsn, num_all_loci) # good. Everyone has 93 loci (even if some are missing) full_histo <- ggplot(miss_dsn, aes(x = num_non_miss_loci)) + geom_histogram(binwidth = 1) full_histo
Zoom in on that a bit:
full_histo + ylim(0, 1000)
How many fish would different cutoffs remove? Let's see what different choices would give us in terms of how many individuals are retained/removed:
miss_dsn %>% count(num_non_miss_loci) %>% arrange(desc(num_non_miss_loci)) %>% mutate( fract_retained = cumsum(n) / sum(n), fract_tossed = 1 - fract_retained )
Let's set the non-missing locus cutoff to >= 84 loci. That is about 10% missing, and means we toss 1.77% of our samples.
meta <- metadata %>% semi_join(miss_dsn %>% filter(num_non_miss_loci >= 84), by = "indiv") genos <- genotypes %>% semi_join(miss_dsn %>% filter(num_non_miss_loci >= 84), by = "indiv")
# first time through, just get the distribution of matching proportions for_histo <- find_matching_samples(genos, min_frac_matching = 0.8) # check the distribution for_histo$pairs %>% mutate(frac_match = num_match / num_non_miss) %>% ggplot(aes(x = frac_match)) + geom_histogram() # get the clusters of matching genotypes each indvidual belongs to for_real <- find_matching_samples(genos, min_frac_matching = 0.95, return_clusters = TRUE) # we will end up using the identified clusters head(for_real$clusters)
At this point, for_real$clusters
is the tibble we need to re-organize our genotypes and meta-data
for SNPPIT. For every cluster of matching samples we will use, as the genotype, the
sample with the least missing data. We will also use the sex of that individual (sometimes
there are mismatches in the sex of the matching genotypes). Sometimes there are mismatches in
the hatchery of the matching genotypes. In those cases, each separate hatchery gets its own
canonical individual named as the ID of the main canonical individual with the hatchery
name appended to it. The following function takes care of this and reorganizes both the
genotypes and also the meta data into snppit_genos
and snppit_meta
(as well as a few other list
components).
reorg <- reorganize_matching_samples( genotypes = genos, metadata = meta, clusters = for_real$clusters )
Let's have a look at some of the different components of that output.
matchers_metadata
This is the meta data for all the matching genotypes. Column original_id
shows what they
were named on input, and column new_id
shows the ID used to identify them now in the
SNPPIT-ready output.
reorg$matchers_metadata
snppit_meta
and snppit_genos
These are the tibbles that are ready to pass into prepare_snppit_infile()
. Multiple
years and spawner_groups of the matching individuals have been lumped into
comma-separated strings for the year and spawner group inputs to SNPPIT.
cross_hatchery_matches
A tibble that shows you which clusters of matching genotypes included fish from more than one hatchery.
reorg$cross_hatchery_matches
cross_sex_matches
A tibble that shows you which clusters of matching genotypes included fish with more than one reported sex
reorg$cross_sex_matches
We have rolled these two steps into a single run_snppit() function.
Internally, it calls prepare_snppit_infile()
to write the data,
and then it runs snppit
inside the system()
command.
Here is what that looks like:
snppit_dir1 <- run_snppit(reorg$snppit_genos, reorg$snppit_meta) snppit_dir1
Running the code and rendering this notebook required approximately this much time on a Mac laptop of middling speed:
Sys.time() - start_time
# here we call the "constrained" runs SAD for "sex_and_date" SAD_results <- slurp_snppit(snppit_dir1, reorg$snppit_meta) dir.create("outputs") write_rds(SAD_results, path = "outputs/CV_SAD_results.rds", compress = "xz")
snppit_dir_noSAD <- run_snppit( reorg$snppit_genos, reorg$snppit_meta, outdir = "../tmp_snppit_arena/CV_no_SAD_run", use_spawner_group = FALSE, use_sex = FALSE ) no_SAD_results <- slurp_snppit("../tmp_snppit_arena/CV_no_SAD_run", reorg$snppit_meta) write_rds(no_SAD_results, path = "outputs/CV_no_SAD_results.rds", compress = "xz")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.