knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(EFGLmh) library(tidyverse) options(tibble.max_extra_cols = 10)
This will walk through the functions in the EFGLmh
package. This package was written as a replacement for IDFGEN mainly motivated by the need to work with microhaps. It has been written to function for any codominant diploid marker, but is written and tested with SNPs and microhaps (SNPs are really just a subcategory of microhaps) in mind.
If you want to see all the options for a function, the manual has this information in a quicker-to-find format than this vignette.
One of the main differences between IDFGEN and EFGLmh
is that IDFGEN keeps data in a separate environment, as if the data is somewhere in the ether until using an IDFGEN function to access it. This (mostly) prevents users from accidentally modifying things, but it also causes some bugginess, makes it hard for users to purposefully modify objects, and can cause some issues when loading in data from multiple data files. EFGLmh
instead holds data as objects of a new class, called EFGLdata
. As such, your dataset will have a variable name associated with it.
Now, let's walk through the functions in the package.
First, install if needed, and load the package.
# install tidyverse if you haven't already install.packages("tidyverse") # install the package if needed devtools::install_github("delomast/EFGLmh") # and load the package library(EFGLmh) # and load the tidyverse for the examples here library(tidyverse) options(tibble.max_extra_cols = 10) # one of my preferred options for tidyverse
Now, let's first load in our data. We can either read our data into as a dataframe, matrix, or tibble, perhaps do some modifications, and then hand it over to EFGLmh
, or we can load it directly from a tab separated file into EFGLmh
.
If we read it in separately, the dataframe, matrix, or tibble (below, variable exampleData
) must have a column of population (pedigree) names, a column of unique individual names, optional metadata columns, and then genotype columns in a two column per call format. The pedigree and individual names can be anywhere, if their locations are specified. The genotype columns must be consecutive and on the right hand side. The start of the genotype columns can be specified, but if not, it will look for the first column with a column name ending in ".A1", ".a1", "-A1", or "-a1". Locus names are pulled from the first column for each locus.
# example of loading from file outside of EFGLmh exampleData <- readr::read_tsv("example_snp_mh.txt", guess_max = 1e4)
print(exampleData) # example of modifying in r, here we are just selecting a few populations t1 <- exampleData %>% filter(Pedigree %in% c("OmyOXBO19S", "OmyEFSW19S")) # or, without dplyr: # t1 <- t[t$Pedigree %in% c("OmyOXBO19S", "OmyEFSW19S"),] # and now we pass the data to EFGLmh data1 <- readInData(t1)
If we want to read data directly from a file, we just specify the file name as the first argument. The file must have the same structure as described above: a column of population (pedigree) names, a column of unique individual names, optional metadata columns, and then genotype columns in a two column per call format. The pedigree and individual names can be anywhere, if their locations are specified. The genotype columns must be consecutive and on the right hand side. The start of the genotype columns can be specified, but if not, it will look for the first column with a column name ending in ".A1", ".a1", "-A1", or "-a1". Locus names are pulled from the first column for each locus. This has been chosen to work directly with Progeny outputs.
data_direct_from_file <- readInData("example_snp_mh.txt") # metadata columns withs lots of blanks can give parsing errors. # this can usually be solved with a larger number for the guess_max argument: # data_direct_from_file <- readInData("example_snp_mh.txt", guess_max = 1e5)
There are many optional arguments for readInData
:
genotypeStart
: you can specify the first genotype column if you don't want it to be auto-detected.pedigreeColumn
: The column number that contains population (pedigree) names. Default is 1
nameColumn
: The column number that contains individual names. Default is 2
convertNames
: TRUE
to convert genotype and pedigree names in the same way that IDFGEN does (remove special characters from both and remove "." from genotype names). Default is TRUE
convertMetaDataNames
: TRUE
to remove special characters and spaces from metadata column names. This makes accessing them easier. Default is TRUE
missingAlleles
: a vector of values to treat as missing alleles. Default is c("0", "00", "000")
guess_max
: when reading from a file, this is passed to read_tsv
. Parsing errors due to column data types can sometimes be fixed by increasing this. Default is 1e4The readInData
function creates object of class EFGLdata
. When we print
them, we just see a list of the population names and the number of loci.
data1
The underlying structure of an EFGLdata
object is just a list of two tibbles. The first entry is named "genotypes" and contains... metadata! No, it contains population names, individual names, and genotypes in a two column per call format (missing genotype is NA
). The second entry is named "metadata" and contains population names, individual names, and metadata.
names(data1) data1$genotypes data1$metadata
This (hopefully) makes it simple for you to pull out or modify data manually if there is not a specially written function in the package addressing what you want to accomplish. Simply refer to data1$genotypes
or data1$metadata
and treat it as you would treat any dataframe or tibble. One thing to remember if you modify things manually: the genotypes and metadata tibbles in an EFGLdata
object should have the same individuals in the same order. You can check that you haven't messed things up by using the construct_EFGLdata
function (performs some basic checks).
# do some modification on data1 # ... # and then run some error checking data1 <- construct_EFGLdata(data1)
There are some common data summaries you may want from your data that EFGLmh has special functions to address. We'll walk through them here.
Get the names of all populations in an EFGLdata object
pop_names <- getPops(data1) pop_names
Get the names of all individuals, or all individuals in a subset of populations
# all inds in data1 all_inds <- getInds(data1) # just look at first 20 all_inds[1:20] # only inds in one pop subset_inds <- getInds(data1, pops = c("OmyOXBO19S")) # looking at first 20 subset_inds[1:20]
Get the locus names
loci <- getLoci(data1) # looking at first 20 loci[1:20]
Get the metadata column names
meta_names <- getMeta(data1) meta_names
Get the number of individuals in all, or a subset of populations
countInds <- numInds(data1) countInds countInds_1 <- numInds(data1, pops = c("OmyOXBO19S")) countInds_1
dumpTable from IDFGEN is also included, as it is commonly used
dumpTable(geno_success, "genotyping_success.txt")
Calculate allelic richness. I recommend doing this if you load in a dataset of just SNPs to make sure every locus has $\leq 2$ alleles, as expected.
allele_rich <- aRich(data1) # this returns a tibble allele_rich # if you want to see counts of loci by allelic richness allele_rich %>% count(aRich) # same thing, without tidyverse table(allele_rich$aRich)
Genotyping success of loci, as a proportion
loci_success <- lociSuccess(data1) loci_success
Genotyping success of individuals (more about removing failed individuals later)
geno_success <- genoSuccess(data1) # this returns a tibble with both success as a proportion, and as the number of missing genotypes geno_success # or with just a subset of loci subsetLoci <- getLoci(data1) subsetLoci <- subsetLoci[!grepl("SEX", subsetLoci)] geno_success_noSDY <- genoSuccess(data1, loci = subsetLoci) geno_success_noSDY
Calculate allele frequency. Note that only alleles present in a population are shown for that population. As such, if a locus is all missing genotypes in a population, it is omitted for that population.
allele_freq <- calcAF(data1) allele_freq
Calculate observed and expected heterozygosity within each population
heterozygosity <- calcHet(data1) heterozygosity
These functions handle some common operations we perform on our datasets.
Combining EFGLdata objects. Let's say we have two data files to read in (e.g. one mixture and one baseline, or multiple PBT baselines). We then want to combine them for filtering, analysis, export, etc.
# creating a second input datafile t2 <- exampleData %>% filter(!(Pedigree %in% c("OmyOXBO19S", "OmyEFSW19S"))) # and now creating a second EFGLdata object data2 <- readInData(t2) data2 data1
So we have two EFGLdata objects, one with 6 populations and one with 2. Now, we combine them:
all_data <- combineEFGLdata(data1, data2, genoComb = "intersect", metaComb = "intersect") all_data
We've now created a third EFGLdata object with all 8 populations. The arguments genoComb and metaComb tell the function how to combine loci and metadata if they are different. Options are "intersect" to only keep the loci or metadata that are in both, or "union" to keep the loci or metadata that are in either (missing loci/metadata will be given values of NA
). Note that you can combine an unlimited number of EFGLdata
objects in one command. For example, for four objects: all_data <- combineEFGLdata(data1, data2, data3, data4, genoComb = "intersect", metaComb = "intersect")
. Having many EFGLdata objects can use a lot of memory if you have large numbers of individuals/loci. So, if you are done with the original EFGLdata objects, you can remove them to free up some memory:
rm(data1) rm(data2)
Moving all individuals in a subset of populations to a different (new or existing) population.
# say we want to combine OmyDWOR19S and OmyEFSW19S, and call it "newPop" all_data <- movePops(all_data, pops = c("OmyDWOR19S", "OmyEFSW19S"), newName = "newPop") all_data
Moving a subset of individuals to a different (new or existing) population
numInds(all_data) toMove <- c("OmyWALL19S_0696", "OmyWALL19S_0697", "OmyWALL19S_0698") all_data <- moveInds(all_data, inds = toMove, newName = "specialInds") numInds(all_data)
Remove loci
toRemove <- c("OMS00079", "Omy_Omyclmk43896") all_data <- removeLoci(all_data, lociRemove = toRemove) all_data
Remove individuals
numInds(all_data) toRemove <- c("OmyWALL19S_0001", "OmyWALL19S_0002") all_data <- removeInds(all_data, inds = toRemove) numInds(all_data)
Remove populations
all_data all_data <- removePops(all_data, pops = c("OmyLSCR19S", "OmySAWT19S")) all_data
These functions export data in formats used by other packages and programs. They are listed here mainly so you can have a list of the export functions in one place and see an example. For a full explanation of the options within each of these functions, consult the manual. Most have options to subset loci and populations.
A rubias baseline
gsi_baseline <- exportRubias_baseline(all_data, pops = c("OmyOXBO19S", "newPop"), repunit = "Pop", collection = "Pop", loci = NULL)
A rubias mixture
gsi_mixture <- exportRubias_mixture(all_data, pops = c("OmyLYON19S", "specialInds"), collection = "Pop", loci = NULL)
A gRandma baseline or mixture
# baseline gma_baseline <- exportGrandma(all_data, pops = c("OmyOXBO19S", "newPop"), loci = NULL, baseline = TRUE) # mixture gma_mixture <- exportGrandma(all_data, pops = c("OmyLYON19S", "specialInds"), loci = NULL, baseline = FALSE)
In addition to exporting gRandma inputs, there is also a function to remove loci for gRandma inputs that either failed for all individuals, or have no variation.
# this creates a list with baseline and mixture cleanInput <- cleanGrandma(baseline = gma_baseline, mixture = gma_mixture) gma_baseline <- cleanInput$baseline gma_mixture <- cleanInput$mixture
A hierfstat input dataframe
hfstat_in <- exportHierFstat(all_data)
A CKMRsim allele frequency tibble. Note that loci with all missing genotypes will be removed. And all pops (or all pops specified with the pops
argument) are combined.
ckmr_af <- exportCKMRsimAF(all_data)
A long format tibble of genotypes. Meant for input into CKMRsim. Note that pop information is not included in the export.
ckmr_lg <- exportCKMRsimLG(all_data, pops = c("OmyOXBO19S"))
Write a GenePop file
exportGenePop(all_data, "genepop.txt", useIndNames = TRUE)
Write a GenAlEx file
exportGenAlEx(all_data, "genalexInput.txt")
Write a Structure file
exportStructure(all_data, "structureInput.txt")
Write a Colony file
# to write to file exportColony(all_data, filename = "ColonyInput.dat") # OR, if you want to modify a setting # to return as a character vector (each line an item), modify, and write out colonyInput <- exportColony(all_data, filename = NULL) colonyInput[1] <- "NewProjectName" # example of modification writeLines(colonyInput, "modifiedColonyInput.dat") # write it to a file
Write a "Progeny-export-style" file that can be later loaded back into EFGLmh with readInData()
. This file will have Pop, Ind, metadata, genotypes (2 column per call).
exportProgenyStyle(all_data, "progenyStyleFile.txt")
Write a SNPPIT input file (only biallelic markers used)
exportSNPPIT(all_data, "snppitInput.txt", baseline = c("OmyOXBO19S", "newPop"), mixture = c("OmyLYON19S", "specialInds"), errorRate = .005)
Write PLINK input files (only biallelic markers used)
exportPlink(all_data, "testPlink.ped", map = "testPlink.map")
First, we determine genotyping success. We're using all loci, but remember genoSuccess
can also use just a subset of loci if you input a vector of locus names.
geno_success <- genoSuccess(all_data) geno_success
Now we get a list of individual names to remove and then use the removeInds
function. We can filter by the proportion success or by the number of missing loci.
# identify any under 90% failedInds <- geno_success %>% filter(success < .9) %>% pull(Ind) # example of code to filter by number of missing loci # (remove any with more than 37 genotypes missing) # failedInds <- geno_success %>% filter(numFail < 37) %>% pull(Ind) # and remove sum(numInds(all_data)) all_data <- removeInds(all_data, inds = failedInds) sum(numInds(all_data))
And perhaps we want to save a list of the failed individuals and their populations
removeTable <- geno_success %>% filter(Ind %in% failedInds) dumpTable(removeTable, "failed_inds.txt") # or, to efficiently do it all in the tidyverse: # geno_success %>% filter(Ind %in% failedInds) %>% dumpTable("failed_inds.txt")
Just as we use GSI_sim to quickly identify duplicate samples, we can use rubias.
rubiasIn <- exportRubias_baseline(all_data, pops = c("OmyPAHH19S", "specialInds"), repunit = "Pop", collection = "Pop") # require 70% genotypes successful in both (but we've already filtered), and # requie 95% of genotypes to be the same library(rubias) dupTable <- close_matching_samples(rubiasIn, gen_start_col = 5, min_frac_non_miss = .7, min_frac_matching = .95) dupTable
So we've found two pairs of duplicates, now we want to keep the ones with more genotypes. We can use the genotyping success calculated earlier to choose which one to remove. There is a special function to identify the one with lower genotyping success.
toRemove <- whichLower(dupTable, geno_success) all_data <- removeInds(all_data, inds = toRemove)
And let's write a table of the duplicates
dupTable %>% select(1:6) %>% dumpTable("duplicates.txt")
A basic method of simulating F1 hybrids resulting from mating between two populations is available. Samples alleles based on allele frequencies in the two populations. If your data has a sex marker in it, you probably want to remove it prior to using this function.
# remove sex marker, then simulate 125 F1 hybrids data1_with_hyb <- all_data %>% removeLoci(lociRemove = "OmyY1_2SEXY") %>% createF1Hybrids(pop1 = "OmyWALL19S", pop2 = "OmyOXBO19S", newName = "fakeHybridsF1", n = 125, missingGenos = TRUE) data1_with_hyb
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.