In kfarleigh/HybridFindR: Testing for differential introgression

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Load HybridFindR and investigate data

# Load HybridFindR
library(HybridFindR)
library(data.table)
library(gghybrid)

# First, we load in our data
data("Anolis")
write.csv(Anolis, 'Data.csv', row.names = FALSE)

# Let's look at our data and see what we have going on
str(Anolis)

We see that the first two columns give us information about each sample and the remaining columns contain genotype data. HybridFindR expects that the first column is the sample id and that the second indicates which population that individual is assigned to. It is important to know which population labels correspond to each source population and the hybrid population as HybridFindR looks for a greater proportion of S1 alleles in hybrids than what would be expected due to random chance. Here, P is the S0 population, H is the hybrid population, and K is the S1 population.

Prepare our data for analysis

We will prepare our data for analysis. We know that there are 2 columns before any genotype information (numprecol = 2), the value indicating missing data is NA (missingval = 'NA'), our data follows the onerow structure format (1 column per marker; onerow = 0), we have 73 individuals in our dataset (numinds = 73), and we know the labels for the source populations (S0.id = 'P' and S1.id = 'K').

Geno <- PrepareData('Data.csv', numprecol = 2, missingval = 'NA', onerow = 0, numinds = 73, S0.id = 'P', S1.id = 'K')

Run differential introgression analysis

Now we get to look for SNPs that are exhibiting signals of differential introgression. We need the dataframe generate by the PrepareData function (Geno), the labels for each population (H.id = 'H', S0.id = 'P', S1.id = 'K'), the number of individuals (n.ind = 73), the number of permutations to run for significance testing (permutations = 10000), and the ploidy of your data (ploidy = 2). Remember, HybridFindR looks for an greater proportion of S1 alleles than would be expected due to random chance. So in this example we are looking for SNPs that have a greater proportion of K alleles in hybrids than we would expect due to chance.

DI_test <- Differential_introgression(Geno, H.id = 'H', S0.id = 'P', S1.id = 'K', n.ind = 73, permutations = 10000, ploidy = 2)

Finally, we correct for multiple tests. How you correct for multiple tests or if you do is up to you, however, it should certainly be considered. Here, I show you how to adjust the p-values using both the bonferroni correction and the false discovery rate.

# Bonferroni Method
DI_test$Bon <- p.adjust(DI_test$Raw.P, method = "bonferroni")
# Benjamini-Hochberg (false discovery rate)
DI_test$BH <- p.adjust(DI_test$Raw.P, method = "BH")

# Find candidate SNPs exhibiting signals of differential introgression
DI_candidates <- DI_test[which(DI_test$Bon < 0.05),]

# Look at the candidates
DI_candidates

Visualize candidate SNPs under differential introgression

It can also be useful to visualize the proportion of parental alleles at each candidate SNP (or all of them). We will use the Propplot function to do this. We need the names of the SNPs that we want to investigate, here I only look at our candidate SNPs under differential introgression but you could do this for your entire dataset if you wanted. Otherwise, we use the dataframe generated by the PrepareData function (Geno), the ploidy in our dataset (2), and the colors that we want to use to visualize the S1 and S0 allele proportions.

# First we get the names of the SNPs that we want to investigate
Cand_SNPnames <- DI_candidates$SNP

# Then we make our plot
Plotprops(Data = Geno, SNP_names = Cand_SNPnames, ploidy = 2, S1_color = 'red', S0_color = 'blue')