knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "selecting-wifl-assays-figs/"
)

Load some packages

library(readr)
library(dplyr)
library(ggplot2)
library(stringr)
library(genoscapeRtools)

Introduction

This is a bit of a mess, but we have something of a way forward, having looked at many of the different comparisons. We are going to look at every relevant between-subspecies comparison and every relevant within-subspecies comparison, and just imagine taking the top 300 from each. Then we will filter by assayability and see how many we end up with.

I have wrapped a lot of the gritty work into a few functions.

Processing the "Divergence SNPs"

These are the SNPs that we are choosing because of allele freq differences between certain groups.

Get the SNPs

We are going to develop SNPs from amongst those that passed our "clean" filters (even though, later, we include more variation to ensure nothing is in the flanking regions)

wifl_clean <- read_012("../cleaned_indv175_pos105000")

Get and process the groups

We also want the groups that we are putting the birds into. Here is the file of designations that Kristen has supplied:

grps <- read_delim("../development/data/WIFL_Cat_forAssayDesign.txt", delim = "\t")
grps

That is pretty straightforward. Let's just count up how many birds we have in each group and subgroup:

grps %>%
  group_by(Subspecies_Pure, Within_Subspecies) %>%
  tally()

This is good to have a look at. It makes it clear that these categories are not strictly nested. But we have plenty of birds to choose from, so that is good.

We need to get a list of the IDs of the individuals in each of those groups:

major_grps <- grps %>%
  filter(!is.na(Subspecies_Pure)) %>%
  split(.$Subspecies_Pure) %>%
  lapply(., function(x) x$Individual)

minor_grps <- grps %>%
  filter(!is.na(Within_Subspecies)) %>%
  split(.$Within_Subspecies) %>%
  lapply(., function(x) x$Individual)

# see what that looks like
minor_grps[1:2]

Computing allele frequencies in the groups

Now we can do that with a function from genoscapeRtools:

major_grp_freqs <- group_012_cnts(wifl_clean, major_grps)
minor_grp_freqs <- group_012_cnts(wifl_clean, minor_grps)

Doing the comparisons

Here we have some messy stuff to just pick out the comparisons that we want to look at, and then we actually make a big data frame of all those comparisons.

The "major" comparisons (between subspecies level groups)

comps <- combn(unique(major_grp_freqs$group), 2)
keep <- str_detect(comps, "ext") %>%
  matrix(nrow = 2) %>%
  colSums() < 2
comps <- comps[, keep]

major_comps <- lapply(1:ncol(comps), function(c) {
  comp_pair(major_grp_freqs, comps[1, c], comps[2, c])
}) %>%
  bind_rows()

# print out how many that is:
major_comps %>% group_by(comparison) %>% tally()

The "minor" comparison (within subspecies)

We will only look within subspecies in these cases

comps2 <- combn(unique(minor_grp_freqs$group), 2)

# now restrict that to just the comparisons within subspp
pref <- str_replace_all(comps2, "_.*", "") %>%
  matrix(nrow = 2)
keep2 <- pref[1,] == pref[2,]

comps2 <- comps2[, keep2]

minor_comps <- lapply(1:ncol(comps2), function(c) {
  comp_pair(minor_grp_freqs, comps2[1, c], comps2[2, c])
}) %>%
  bind_rows()

# print out how many that is:
minor_comps %>% group_by(comparison) %>% tally()

Choosing the top 300 for each comparison

This is relatively easy, it is just filtering. We rank things and then take the top 300. This might be slightly less than 300 if there are ties, but that is OK, I think. We will stick them together at the end.

top300 <- lapply(list(major = major_comps, minor = minor_comps), function(x) {
  x %>%
    group_by(comparison) %>%
    mutate(rank_full = rank(-cap)) %>%
    filter(rank_full <= 300)
}) %>%
  bind_rows(.id = "type_of_comparison")

# we print that out here so it can be seen
top300

There are r nrow(top300) rows in that data frame, but the total number of unique SNPs is smaller: r length(unique(top300$pos)) Methinks that would be a wonderful set to use if we were able to economically use baits and NGS to get the genotypes. That might be possible some day.

Whittling these down by assayability

For this, we are going to take top300 and just filter it by whether or not it is assayable. We define it as assayble if it doesn't have any SNP with MAF > 0.02 (and taken from our less filtered data set where a SNP is called if it appears in at least 50% of the individuals) with 20 bp.

First, get that "less filtered" data. This is from the 012 file with 350K SNPs:

wifl <- read_012(prefix = "~/Documents/UnsyncedData/WIFL_10-15-16/wifl-10-15-16", gz = TRUE)

And then we assess the assayability of these all. The first thing we are going to do is just filter on MAF > 0.02 to get all the SNPs (all the variation, really, since we are going to ignore the chance of indels) that we might want to worry about as far as whether they are in the flanking regions. While we are at this, we are going to write all those positions to a file so we can pull them out of the VCF file later so that we can get what the REF and ALT alleles are.

# keep track of which of the SNPs we are going to "worry about"
# as far as whether they might be in the flanking region
snps_to_care_about <- compile_assayable(wifl, minMAF = 0.02, flank = 20)

# while we are at it.  Let's write out a tab delimited file of these positions
# so we can easily pull them out of the VCF file later 
snps_to_care_about %>%
  select(snp) %>%
  tidyr::separate(snp, into = c("CHROM", "POS"), sep = "--") %>%
  write.table(., quote = FALSE, row.names = FALSE, col.names = FALSE, file = "snps-to-care-about.txt", sep = "\t")

Then keep track of which ones are assayable.

cassay <- snps_to_care_about %>%
  filter(assayable == TRUE)  # we will keep only the assayable ones (possibly on edges)
cassay

So, now we can filter top300 by whether they are assayable or not, and then join them:

top_assayable <- top300 %>%
  rename(snp = pos) %>%
  filter(snp %in% cassay$snp) %>%
  left_join(., cassay)

That gives us r length(unique(top_assayable$snp)) unique SNPs to play around with. If we wanted to filter it even harder, we could throw out ones that were on the edges (i.e. no SNP within 1000 bp of it).

top_off_edge <- top_assayable %>%
  filter(on_left_edge == FALSE, on_right_edge == FALSE)

which gives us r length(unique(top_off_edge$snp)) unique SNPs.

To give us a sense for what sorts of things we are looking at here, let's look at the 40 best from each of the comparisons. And we want to record how many of the top 40 each SNP is in

tmp <- top_off_edge %>%
  select(type_of_comparison, snp, leftpad, rightpad, comparison, freq.x, freq.y, ntot.x, ntot.y, cap, rank_full) %>%
  group_by(type_of_comparison, comparison) %>%
  top_n(40, -rank_full) %>%
  mutate(idx = 1:n()) %>%  # this and the next two lines are just so that we get exactly
  filter(idx <= 40) %>%    # 20---the ties usually occur when the sample size is small and so
  select(-idx)  %>%           # it is OK to pitch them.
  ungroup()

num_occur <- tmp %>%
  group_by(snp) %>%
  summarise(num_comps = n())

top40 <- left_join(tmp, num_occur)

Just for fun---what does LFMM say about these?

Before we print that out, I want to join the LFMM scores onto it:

lfmm <- readr::read_delim("~/Documents/UnsyncedData/WIFL/WIFL_LFMMresults.txt", delim = "\t") %>%
  tidyr::gather(key = "variable", value = "fdr_p", -scaffold, -pos) %>%
  tidyr::unite(snp, scaffold, pos, sep = "--")

# and now filter it to the smallest p-value for each snp:
lfmm_mins <- lfmm %>%
  group_by(snp) %>%
  filter(fdr_p == min(fdr_p))

# and now we join that on there
top40_lfmm <- top40 %>%
  left_join(., lfmm_mins)

# then print it
top40_lfmm

It is possible to flip through that and pretty quickly see what is going on. There are r length(unique(top40$snp)) unique SNPs there, and I think we can just look through them and pretty easily whittle that down to 192, or as many as we think we might want to develop into assays.

Just for fun, let's look at the distribution of all the min LFMM p-values in these SNPs compared to all the rest:

# first get a data frame of just one occurrence per SNP
one_each <- top40_lfmm %>%
  group_by(snp) %>%
  mutate(tmp = 1:n()) %>%
  filter(tmp == 1)

# then plot
ggplot() +
  geom_density(data = lfmm_mins, mapping = aes(x = fdr_p), alpha = "0.5", fill = "blue") +
  geom_density(data = one_each, mapping = aes(x = fdr_p), alpha = "0.5", fill = "orange")

Orange are the selected SNPs and blue are all SNPs. We see that there is a differnet in the distributions but that might just be because we have chosen SNPs that do not have very low minor allele frequencies.

Processing the "Climate SNPs"

While we are at it, we are going to include about 25 climate-associated SNPs, too. Kristen gave me this:

Here are the climatic variables with the highest loading: 

Bio11 = mean temperature of warmest quarter
bio10 = mean temp of coldest quarter
bio5 = max temp of coldest month
bio9 = mean temp of driest month
bio17 = Precip of driest quarter

We will take the top 5 most sig loci associated with each of these variables.

So, from the assayable SNPs we are going to want to pull those out and include them as the first set, and then get more non-duplicated ones.

So, since the most significant assayable locus is often shared between these climate variables, we end up taking the 8 most signicant from each, which gives us 24 unique SNPs that we will call the "climate" variables.

climate_vars <- c("BIO_11", "BIO_10", "BIO_5", "BIO_9", "BIO_17")
climate_snps <- lfmm %>%
  filter(variable %in% climate_vars) %>%
  left_join(cassay %>% filter(on_left_edge == FALSE, on_right_edge == FALSE), .) %>%   # limit it to just the assayable ones THAT ARE NOT ON EDGES
  group_by(variable) %>%
  top_n(n = 8, wt = -fdr_p) %>%
  group_by(snp) %>%
  arrange(fdr_p) %>%
  mutate(comparison = paste(variable, sprintf("%.2e ", fdr_p), collapse = ", "),
         type_of_comparison = "climate") %>%
  select(-variable, -fdr_p) %>%
  filter(!duplicated(snp)) %>%
  ungroup() %>%
  select(-(num_gene_copies:rightpos), -(assayable:on_left_edge))

climate_snps

Getting >192 candidate SNPs from the Climate and Divergence SNPs

Selecting the 192 targets

"Divergence" SNPs

Kristen made up a data frame showing how many SNPs to take from the top of each pairwise comparison. Some pairwise comparisons were deemed unimportant and others had low sample sizes, so we don't take many from them, etc. That data frame looks like:

snp_pair_wts <- read_delim("inputs/wifl_snp_comparison_counts.txt", delim = "\t")
snp_pair_wts

Picking out SNPs

So, we can pick out (an expanded) number of SNPs from each comparison, as given above, and then we add the climate SNPs on top of those, and filter out any "divergence SNPs" that are duplicates of climate SNPs (or other divergence SNPs) and see how many we end up with after eliminating duplicates and things that are too close together.

expand_it <- 1.8  # this is the factor by which to increase the number of SNPs from each comparison

div_candidates <- top40_lfmm %>%
  left_join(snp_pair_wts) %>%
  group_by(comparison) %>%
  arrange(desc(cap)) %>%
  mutate(idx = 1:n()) %>%
  filter(idx <= (number_markers) * expand_it) %>%  # keep only a certain number from each comparison
  ungroup() 

# now, add the climate SNPs in there, filter out the duplicates, and then mark those
# that are too close to their neighbors and count up about how many we have
candidate_snps <- bind_rows(climate_snps, div_candidates) %>%
  filter(!duplicated(snp))  %>%                      # filter out the duplicates
  tidyr::separate(snp, into = c("CHROM", "POS"), sep = "--", convert = TRUE, remove = FALSE) %>%  # get CHROM and POS back
  arrange(CHROM, POS) %>%
  group_by(CHROM) %>%
  mutate(dist = POS - lag(POS, 1),
         could_be_too_close = ifelse(!is.na(dist) & dist <= 10^4, TRUE, FALSE),       # flag them if closer than 10 KB
         toss_it = FALSE)  # we add the toss_it column since we will modify that by hand to do final thinning

# now return the approximate number after thinning
sum(!candidate_snps$could_be_too_close)

So, that has given us, potentially, 210 candidates. That is more than 192, but some of them might not be designable because of the GC content. But, now we need to decide which ones to toss when they are too close together on the genome. There are only a few of these and it will be good to get eyes on the data anyway. So, at this point we write out the file and then Kristen chooses which ones to toss by putting a "TRUE" in the "toss_it" column and saving the result in exactly the same format in a file named candidate_snps_tossed.csv.

write_csv(candidate_snps, path = "candidate_snps_untossed.csv")

Fetching Allelic types and Sequences to prepare the assay order

We can get our list of final candidates like this:

final_candidates <- read_csv("candidate_snps_tossed.csv") %>%
  filter(toss_it == FALSE)

Getting allelic types

Now, from those candidates we want to make a file that is tab-delimited with the name of the SNP we are focusing on, along with a string that can be used to pick out any variation (using tabix) within 250 bp of that SNP.

We will make the LeftStartingPoint 250 bp to the left of POS and the RightEndPoint 250 bp to the right.

regions <- final_candidates %>% 
  mutate(left = POS - 250,
         right = POS + 250) %>%
  mutate(extractor = paste(CHROM, ":", left, "-", right, sep = ""))  %>%
  select(snp, extractor)

regions

write.table(regions, row.names = FALSE, col.names = FALSE, quote = FALSE, sep = "\t", file = "regions-for-candidates.txt")

For the final stage, we need to go back to the original VCF so that we know what the REF and ALT alleles are, so we can put together the assay order. We are going to continue to ignore the possibility of indels, which means that we just have to pull the snps that we care about out of the VCF file. We don't need all the individual data on these, so I will just return info on one individual, NULL, that is not even in the file.

Doing this part on hoffman: ```{sh, eval=FALSE} [kruegg@n2238 WIFL_10.15.16]$ pwd /u/home/k/kruegg/nobackup-klohmuel/WIFL/WIFL_10.15.16 [kruegg@n2238 WIFL_10.15.16]$ module load vcftools [kruegg@n2238 WIFL_10.15.16]$ module load bcftools

filter it all down....

[kruegg@n2238 WIFL_10.15.16]$ vcftools --vcf WIFL_10.15.16.recode.vcf --out design-variants --recode --positions snps-to-care-about.txt --indv NULL

VCFtools - 0.1.14 (C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted: --vcf WIFL_10.15.16.recode.vcf --out design-variants --positions snps-to-care-about.txt --recode --indv NULL

Keeping individuals in 'keep' list After filtering, kept 0 out of 219 Individuals Outputting VCF file...

[kruegg@n2238 WIFL_10.15.16]$ bgzip design-variants.recode.vcf

Now I can bring that file back to my laptop and tabix it:
```{sh, eval=FALSE}
2016-12-08 07:59 /WIFL/--% pwd
/Users/eriq/Documents/UnsyncedData/WIFL
2016-12-08 07:59 /WIFL/--% scp kruegg@hoffman2.idre.ucla.edu:/u/nobackup/klohmuel/kruegg/WIFL/WIFL_10.15.16/design-variants.recode.vcf.gz  ./ 

2016-12-08 08:00 /WIFL/--% tabix design-variants.recode.vcf.gz 

Now, we can retrieve variants overlapping our candidate regions pretty easily, but we need to keep each such variant associated with the actual SNP, so there is some nasty shell programming here: ```{sh, eval=FALSE} 2016-12-09 06:12 /tutorials/--% (master) while read line; do array=($line); tabix ~/Documents/UnsyncedData/WIFL/design-variants.recode.vcf.gz "${array[1]}" | awk -v snp="${array[0]}" '{printf("%s\t%s\n", snp, $0);}' ; done < regions-for-candidates.txt > relevant_variation.txt 2016-12-09 06:12 /tutorials/--% (master) pwd /Users/eriq/Documents/git-repos/genoscapeRtools/tutorials

At any rate, the result of that looks like this:
```{sh, eval=FALSE}
scaffold1006|size252274--136397 scaffold1006|size252274 136158  .   C   T   35032.8 .   .   GT:AD:DP:GQ:PL
scaffold1006|size252274--136397 scaffold1006|size252274 136276  .   T   C   6472.29 .   .   GT:AD:DP:GQ:PL
scaffold1006|size252274--136397 scaffold1006|size252274 136285  .   C   T   23582   .   .   GT:AD:DP:GQ:PL
scaffold1006|size252274--136397 scaffold1006|size252274 136397  .   C   A   49764   .   .   GT:AD:DP:GQ:PL
scaffold1006|size252274--136397 scaffold1006|size252274 136418  .   A   G   158703  .   .   GT:AD:DP:GQ:PL
scaffold1006|size252274--136397 scaffold1006|size252274 136426  .   T   C   96229.6 .   .   GT:AD:DP:GQ:PL
scaffold1006|size252274--136397 scaffold1006|size252274 136580  .   G   A   20303.2 .   .   GT:AD:DP:GQ:PL
scaffold1006|size252274--136397 scaffold1006|size252274 136604  .   T   G   3768.48 .   .   GT:AD:DP:GQ:PL
scaffold101|size1699134--364887 scaffold101|size1699134 364752  .   T   C   64148.6 .   .   GT:AD:DP:GQ:PL
scaffold101|size1699134--364887 scaffold101|size1699134 364770  .   G   A   5528.3  .   .   GT:AD:DP:GQ:PL

And now we can read that into something that will be appropriate for snps2assays:

variation <- read_delim("relevant_variation.txt", delim = "\t", col_names = FALSE) %>%
  select(-X2, -X4, -(X7:X10)) %>%
  setNames(c("CHROM", "truePOS", "REF", "ALT")) %>%
  mutate(snppos = as.integer(str_replace(CHROM, "^.*--", ""))) %>%
  mutate(POS = (truePOS - snppos) + 251) %>%
  select(CHROM, POS, REF, ALT, truePOS)
variation

Note that we actually use the name of the SNP as if it were the "CHROM" because we will pull sequence out around just that spot. And POS is the position of the SNP within that fragment (which is at position 251, cuz we put 250 bp of flanking region on each side.)

Getting sequences

We are close now. We just need to retrieve the consensus sequences 250 bp on either side of each SNP we want. This is a job for samtools faidx. I have indexed the fasta of the genome and now we just use an ugly bash script much like with tabix to get the consensus sequence and store it in a file along with the snp name. ```{sh, eval=FALSE} 2016-12-09 06:35 /tutorials/--% (master) while read line; do array=($line); samtools faidx ~/Documents/UnsyncedData/WIFL_genome/WIFL.fa "${array[1]}" | awk -v snp="${array[0]}" '/^>/ {printf("%s\t", snp); next} {printf("%s", $1)} END {printf("\n")}' ; done < regions-for-candidates.txt > consensus_sequences_of_snps.txt 2016-12-09 06:35 /tutorials/--% (master) pwd /Users/eriq/Documents/git-repos/genoscapeRtools/tutorials

Then we can read those into a data frame
```r
consSeq <- read_delim("consensus_sequences_of_snps.txt", delim = "\t", col_names = FALSE) %>%
  setNames(c("CHROM", "Seq"))
consSeq

Data frame of targets

These are just the SNPs that we want with the CHROM and POS. (Where again, CHROM is the name of the SNP)

targets <- final_candidates %>%
  select(snp, POS) %>%
  rename(CHROM = snp) %>%
  mutate(truePOS = POS,
         snppos = as.integer(str_replace(CHROM, "^.*--", ""))) %>%
  mutate(POS = (truePOS - snppos) + 251) %>%
  select(CHROM, POS)

targets

Running snps2assays

The moment we have all been waiting for---we want to see if this is going to work or not...

library(snps2assays)

yay_assays <- assayize(V = variation, targets = targets, consSeq = consSeq, allVar = FALSE)

write_csv(yay_assays, path = "yay_assays.csv")

Then, Kristen wanted me to join back onto that a column or two that told us about why any particular SNP was chosen (i.e. major or minor comparison, and its rank within those). So, I should be able to do that with a little left join.

yay_assays_with_ranks <-  candidate_snps %>%
  ungroup() %>%
  select(-(CHROM:rightpad)) %>%
  rename(CHROM = snp) %>%
  left_join(yay_assays, .) 

write_csv(yay_assays_with_ranks, path = "yay_assays_with_ranks.csv")

Once those are done, they can be spot-checked using samtools tview with a nify set of commands like this on hoffman: {sh, eval=FALSE} REGION="scaffold1006|size252274:136147-136647"; \ SNP="scaffold1006|size252274:136397"; \ samtools view -q 42 -b -o region.bam /u/nobackup/klohmuel/rbay/WIFL/SNP/WIFL_merged.bam "$REGION"; \ samtools index region.bam; \ samtools tview -p "$SNP" region.bam /u/nobackup/klohmuel/rbay/References/WIFL/WIFL.fa

Though that is a huge hassle. Instead, I used samtools view to pull the 500 bp around each SNP out of the big merged bamfile (filtering on MAPQ>40). I brought that to my laptop, sorted it and then indexed it, and then viewed them using IGV. I mostly wanted to verify that I had the SNP in the right spot. I spot checked a good handful and they all were correct.



eriqande/genoscapeRtools documentation built on Dec. 27, 2021, 8:01 a.m.