knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "plot-freqs-figs/" )
This is just documenting some explorations I am making to see how we might want to try plotting allele freqs between different groups to get a sense for what sorts of variation we are going to be seeing. I am thinking that hexbin plots of the number of loci with SNP allele freqs would be good.
But we want to do it between all different pairs of populations. And we probably would be well-served by heavily filtering for overall minor allele frequency. Let's face it, we are interested in finding SNPs that very different in allele freq, so we may as well toss the rare ones, cuz they won't be too different between pops anyway.
Load some libraries:
library(dplyr) library(ggplot2) library(GGally) library(genoscapeRtools) library(readr) library(stringr)
Get the data that we want to work with. We will try the 105K SNPs.
wifl_clean <- read_012("../../cleaned_indv175_pos105000")
That is a big matrix, which is a fairly good way of keeping it for the moment.
Let's get the group designations of the individuals:
grps <- read_delim("../data/WIFL_Groups_forAlleleFreqCalc.txt", delim = "\t", col_names = FALSE) %>% setNames(c("ID", "location", "state", "supspp", "alle_grp"))
It appears that this will take forever in tidyr, so let's split the matrices up and do simple colSums.
keep_birds <- grps$ID[!is.na(grps$alle_grp)] keep_bird_grps <- grps$alle_grp[!is.na(grps$alle_grp)] split_birds <- split(keep_birds, keep_bird_grps) # now, let's just focus on groups with >=10 individuals split_birds2 <- split_birds #[sapply(split_birds, length) >= 10] wifl_clean[wifl_clean == -1] <- NA # first, dump anything that is low frequency wifl_thin <- wifl_clean[, colSums(wifl_clean, na.rm = TRUE) > 10] # now, lapply to get matrices for each group freqs <- lapply(split_birds2, function(x) { y <- wifl_thin[x,] frq <- colSums(y, na.rm = TRUE) / (2 * colSums(!is.na(y))) data_frame(pos = names(frq), freq = frq) }) %>% bind_rows(., .id = "grp") # and finally we make one column for each group frqmat <- tidyr::spread(data = freqs, key = grp, value = freq) names(frqmat)[-1] <- paste0("grp_", names(frqmat)[-1])
Now, we are in a place where we might be able to ggpairs these guys. Got some ideas from here
df <- frqmat[-1] p <- ggpairs(df, lower="blank") seq <- 1:ncol(df) for (x in seq) { for (y in seq) { if (y>x) { p <- putPlot(p, ggplot(df, aes_string(x=names(df)[x],y=names(df)[y])) + stat_binhex(bins=25) + scale_fill_gradientn(colours = rev(rainbow(9)), breaks = c(1, 2, 4, 32, 128, 1000, 10000, 100000), trans = "log10"), y,x) } } } p
This would go like this:
paddy <- data_frame(pos = colnames(wifl_clean)) %>% tidyr::separate(pos, into = c("scaff", "bp"), sep = "--", convert = TRUE) %>% mutate(length = as.numeric(str_replace(scaff, "scaff.*size", ""))) %>% group_by(scaff) %>% mutate(leftpos = c(0,bp)[-(n() + 1)], rightpos = c(bp, length[1])[-1]) %>% mutate(leftpad = bp - leftpos, rightpad = rightpos - bp) # see what that looks like paddy
Now we can filter on how many flanking basepairs there are free of variation.
paddy %>% filter(leftpad > 20 & rightpad > 20) %>% filter(leftpad < 1000 & rightpad < 1000)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.