knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Assume you have a Plink files and decided to impute the missing genotypes with fastPHASE 1.4.8 tool. The number of haplotype clusters (K) is required as an argument. One option is to leave the default value of 15. For the most cases it will probably work. But what if you have several populations in one file? Say, you search for the fingerprints of selection with hapFLK tool. Both hapFLK and fastPHASE are based on the haplotype cluster model.
How many clusters to choose for 4, 6, 8 or even more populations in a pool? My tests of datasets from 1000 Genomes project revealed that K influences the results of hapFLK output. If fastPHASE had an option to estimate the error of imputation, the life would be easier. We could apply the masked data analysis.
The idea of the analysis is as follow. Hide some genotypes, impute them, then count the error of imputation. Repeat the trick several times with different K and choose that one that minimizes the error of imputation.
Below is the implementation of the masked data analysis on the base of imputeqc R package. Let's list 4 main steps and then consider them a little bit closer.
Convert Plink files to fastPHASE *.inp files, using Plink tool.
Generate a few test files with make_test_files.R which is enclosed to the package. I think five test files each having 10% of masked genotypes are enough to start with.
Impute the missing genotypes in each test file with fastPHASE. Apply different K for each set of files.
Estimate the imputation quality with EstimateQuality() function and chose the K that minimizes the imputation error.
Say, you have chr1.{ped, map} files in the current directory. They contain the genotypes of individuals from one or several populations.
Check that alleles are coded as letters or numbers. The missing ones should be ?. If your data look differently, send me a chunk of them. I'll figure out what to do next.
To convert Plink files to fastPHASE, run from command line
plink --file chr1 --recode fastphase-1chr --out chr1
It will create chr1.recode.phase.inp ready for further manipulations.
install.packages("devtools") devtools::install_github("inzilico/imputeqc", build_vignettes = TRUE)
If you already have devtools package installed, skip the first command.
system.file("extdata", package="imputeqc")
Save the returned path in the environment variable in BASH:
$ export IMPQC="/path/to/imputeqc/extdata"
Let chr1.inp to contain the genotypes of a population from the chromosome one.
The following command will result in 5 test files with 10% percent of genotypes missed.
Rscript $IMPQC/make_test_files.R chr1.inp
The default parameters (-n 5 -p 0.1 -o test/test) are applied. Five test files named test.m{1:5}.inp are saved in /test subdirectory created by the script.
The following command will produce 3 test files with 5% of genotypes masked. The test files chr1.m{1:3}.inp are saved in /masked subdirectory.
Rscript $IMPQC/make_test_files.R -n 3 -p 0.05 -o masked/chr1 chr1.inp
If there are missing genotypes before running the script, the actual proportion of missing genotypes will be higher, since mask adds missing genotypes to those that exist in original data set. You will see the total proportion of missing genotypes on screen.
According to fastPHASE manual, we can adjust the following option arguments:
There are some flags I advice to apply:
I recommend to save the results of imputation in different folders for different K (/k10, /k15, etc. )
An example of command for imputation with K = 10:
for i in $(seq 1 5); do fastPHASE -T10 -C25 -K10 -H-1 -n -Z -ok10/chr1.m$i masked/chr1.m$i.inp; done
We assume 5 test files (chr1.m{1:5}.inp) being in folder /masked.
The code will produce 5 chr1.m{1:5}_genotypes.out files, where chr1 is your identifier, m{1:5} is added by the above command, and _genotypes.out is given by fastPHASE. The imputed data sets are saved in /k10 subdirectory.
The above code helps us to agree input/output from different stages of workflow.
Be aware this step takes a lot of computational time depending on the number of populations under consideration, the number markers, and, of course, the K. To speed up it, I use a GNU parallel tool. It allows to launch several processes in parallel on different CPUs.
Upon accomplishing, we can estimate the quality of imputation.
There are several metrics to estimate the imputation quality of genotypes (Chan et al, 2016). So far I applied the proportion of wrongly imputed genotypes. To compute it, use EstimateQuality() function. It returns a data frame with two columns: "discrepancy" and "id". The first one contains the proportion of genotypes that are not coincided with the original ones. The genotype is considered to be imputed wrongly if one or two alleles are not coincided.
The function returns the values for one set of test files. To distinguish different sets, use an optional argument id. It can be considered as an identification of different imputations.
# Count errors for one set of test files errors <- EstimateQuality(origin = "chr1.inp", mask = "masks.RDS", imputed = imputed, id = 15)
Here we assume that chr1.inp has original genotypes, masks.RDS contains the list of all masks (matrices) generated above by make_test_files.R, and imputed is a vector with the full paths to *_genotypes.out produced by fastPHASE. The order of files in imputed is the same as the order of masks applied upstream. All test files in a set were treated with fastPHASE applying K = 15.
imputeqc R package github
Scheet P, Stephens M. A Fast and Flexible Statistical Model for Large-Scale Population Genotype Data: Applications to Inferring Missing Genotypes and Haplotypic Phase. American Journal of Human Genetics. 2006;78(4):629-644. PubMed
fastPHASE 1.4.8. link
fastPHASE 1.4 manual. download
hapFLK software. link
hapFLK tutorial. link
Plink tool. link
Chan AW, Hamblin MT, Jannink J-L. Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data. Feltus FA, ed. PLoS ONE. 2016;11(8):e0160733. PubMed
GNU parallel link
Gennady Khvorykh. inzilico/imputeqc v1.0.0 (2018). GitHub repository, https://github.com/inzilico/imputeqc.
Gennady Khvorykh, a bioinformatician, inzilico.com
Suggestions, questions, and comments are open! Feel free to drop me the message.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.