BiocStyle::markdown()
knitr::opts_chunk$set(message=FALSE, error=FALSE, warning=FALSE, fig.width=10, fig.height=10)
options(width=80) 

About Gimpute

Background

Genome-wide association studies (GWAS) have become an essential tool in assisting the improvement of understanding of neuropathology of diseases. Moreover, the imputation of genotyping data with missing information can enhance the power of GWAS by using haplotype relationships between genetic variants [1,2]. In the meantime, the imputation can substantially reduce the cost of genotyping large number of samples. Furthermore, the comparison of significant findings across different cohorts and genotyping platforms is also achievable. Currently large amounts of genotyping data are produced at an unprecedented rate to be analyzed; however, the preparation of ready-to-use data sets needs to be of high quality. Thus the implementation of an efficient and reliable genetic data pre-processing and imputation pipeline is desirable.

Scope of the pipeline

In order to ensure the reliability and reproducibility of genome-wide association study (GWAS) data, we set up an efficient, automatic and comprehensive genotype data processing and imputation pipeline termed Gimpute. It consists of pre-processing (genetic variant information updating/matching/liftOver, quality control of genetic variants and study samples, the alignment of variants to the imputation references), pre-phasing, imputation and post-imputation quality control [3], as well as an extension to the Genipe framework in this vignette, which is easy-to-follow and user-friendly.

Getting started

Prerequisites

Gimpute runs on any 64-bit x86 Linux distribution and it requires the following tools [4,5,6,7]:

You can install the tools from the command line as the following examples:

## PLINK
wget https://www.cog-genomics.org/static/bin/plink180717/plink_linux_x86_64.zip
unzip plink_linux_x86_64.zip

## GCTA64
wget http://cnsgenomics.com/software/gcta/gcta_1.91.5beta.zip
unzip gcta_1.91.5beta.zip

## SHAPEIT
wget https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.v2.r904.glibcv2.17.linux.tar.gz
tar -zxvf shapeit.v2.r904.glibcv2.17.linux.tar.gz

## IMPUTE2
wget https://mathgen.stats.ox.ac.uk/impute/impute_v2.3.2_x86_64_static.tgz
tar -zxvf impute_v2.3.2_x86_64_static.tgz

## GTOOL
wget http://www.well.ox.ac.uk/~cfreeman/software/gwas/gtool_v0.7.5_x86_64.tgz
tar -zxvf gtool_v0.7.5_x86_64.tgz

## QCTOOLv2 
wget http://www.well.ox.ac.uk/~gav/resources/qctool_v2.0-CentOS6.8-x86_64.tgz
tar -zxvf qctool_v2.0-CentOS6.8-x86_64.tgz

Note that, IMPUTE4 can be downloaded only after applying for access. IMPUTE4 and QCTOOLv2 are not required if you are using IMPUTE2 for the imputation.

Installation

Development version from Github:

install.packages("devtools")
library("devtools")
install_github("transbioZI/Gimpute", build_vignettes=TRUE)
git clone https://github.com/transbioZI/Gimpute
R CMD build Gimpute
R CMD INSTALL Gimpute_*.tar.gz

Gimpute runs on any 64-bit x86 Linux distribution. Additional dependencies are described below.

Experimental data

PLINK format files (PED/MAP or BED/BIM/FAM) are required as the input genotyping data for our pipeline. Free datasets are available under ./extdata/ in our package. The way for loading these PLINK binary files is shown as below. Since sharing of genetic genotyping data is strictly prohibited, we did not upload any private genotyping data for this demonstration.

library(Gimpute)
bedFile <- system.file("extdata", "dataChr21.bed", package="Gimpute")
bimFile <- system.file("extdata", "dataChr21.bim", package="Gimpute") 
famFile <- system.file("extdata", "dataChr21.fam", package="Gimpute")  

Get it Setup

File naming conventions

The files processed and generated are named in a consistent way across all datasets as follows:

Directory naming conventions

The recommended directories are named as follows:

Required packages and tools

## Gimpute has the following dependencies:
library(lattice)
library(doParallel)
## Define required tools
toolDIR <- "/home/Gimpute/tools/"
plink <- paste0(toolDIR, "plink")
gcta <- paste0(toolDIR, "gcta64") 
shapeit <- paste0(toolDIR, "shapeit") 
gtool <- paste0(toolDIR, "gtool") 
qctool <- paste0(toolDIR, "qctool")
imputeTool <- "impute4" ## 
if (imputeTool == "impute2"){ 
    impute <- paste0(toolDIR, "impute2") 
} else if (imputeTool == "impute4"){ 
    impute <- paste0(toolDIR, "impute4.1_r291.2")
} 

Configuration

## Download imputation reference panels
## Reference panel 1: 1000Gphase1v3_macGT1
wget https://mathgen.stats.ox.ac.uk/impute/ALL_1000G_phase1integrated_v3_impute_macGT1.tgz
tar -zxvf ALL_1000G_phase1integrated_v3_impute_macGT1.tgz

## Reference panel 2: 1000Gphase3
wget https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.tgz
wget https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3_chrX.tgz
tar -zxvf 1000GP_Phase3.tgz
tar -zxvf 1000GP_Phase3_chrX.tgz

A string indicating the type of imputation reference panels is used: c("1000Gphase1v3_macGT1", "1000Gphase3"). This must be defined before running pipeline.

## Define the directory where you place the imputation reference files 
impRefDIR <- "/home/imputeRef/1000Gphase1v3_macGT1/"
referencePanel <- "1000Gphase1v3_macGT1" 
## Alternatively,
impRefDIR <- "/home/imputeRef/1000GP_Phase3/"
referencePanel <- "1000Gphase3"

Gimpute flowchart

Running pipeline

Genotype information update

All files named in this subsection are in the directory 1-genoUpdate/.

In this subsection, the function updateGenoInfo() is mandatory and updates genotype information of original genotype data by the use of metadata and configuration files.

Input files

Processing example

  1. Locate the original PLINK files and other required files either from our package or your side.
## Not run
bedFile <- system.file("extdata", "dataChr21.bed", package="Gimpute")
bimFile <- system.file("extdata", "dataChr21.bim", package="Gimpute") 
famFile <- system.file("extdata", "dataChr21.fam", package="Gimpute")

metadataFile <- system.file("extdata", "1_01_metaData.txt", package="Gimpute")
removedSampIDFile <- system.file("extdata", "excludedSampIDs.txt", 
                                 package="Gimpute")
excludedProbeIdsFile <- system.file("extdata", "excludedProbeIDs.txt", 
                                    package="Gimpute")
chipAnnoFile <- system.file("extdata", "coriellAffyChip.txt", 
                                 package="Gimpute")
inputPrefix <- "dataChr21"
ancestrySymbol <- "EUR"
outputPrefix <- "1_11_removedYMtSnp" 
metaDataFile <- "1_01_metaData.txt"
chipType <- "rsIDstudy"
updateGenoInfo(plink, inputPrefix, metaDataFile, removedSampIDFile,
               ancestrySymbol, excludedProbeIdsFile, chipAnnoFile,
               chipType, outputPrefix, keepInterFile=TRUE)

Individual steps

Note that, the output files of each step are the input of the next step.

  1. Use the raw genotype data and metadata information file to generate a metadata file. In the following example, the metadata file is already preprocessed.
  2. Output file: 1_01_metaData.txt
system(paste0("scp ", metadataFile, " ."))  
  1. Remove all excluded instances (subjects, samples) from the raw chip data. Duplicated, related or useless instances are suggested to be removed.
  2. Output files: 1_02_removedExclInst.#
inputPrefix <- "dataChr21"
outputPrefix2 <- "1_02_removedExclInst" 
removedSampIDFile <- system.file("extdata", "excludedSampIDs.txt", 
                                 package="Gimpute")
removeSampID(plink, removedSampIDFile, inputPrefix, 
             outputPrefix=outputPrefix2)
  1. Replace all values for affection (phenotype, group) and sex in the PLINK fam file by those values in the file 1_01_metaData.txt. In regards to this, the missing value for sex is represented by the value 0 (zero), the missing value for affection (group) is represented by the value -9.
  2. Output files: 1_03_replacedGroupAndSex.#
metaDataFile <- "1_01_metaData.txt" 
outputPrefix3 <- "1_03_replacedGroupAndSex"
updateGroupIdAndSex(plink, inputPrefix=outputPrefix2, 
                   metaDataFile, outputPrefix=outputPrefix3)
  1. Remove instances which have the missing affection value (i.e. the value -9).
  2. Output files: 1_04_removedNoGroupId.#
outputPrefix4 <- "1_04_removedNoGroupId"
removeNoGroupId(plink, inputPrefix=outputPrefix3, outputPrefix=outputPrefix4)
  1. Remove instances which have a value for ance (self identified ancestry) in the file 1_01_metaData.txt that is excluded from the dataset (for the excluded ancestry values).
  2. Output files: 1_05_removedWrongAnceInst.#
outputPrefix5 <- "1_05_removedWrongAnceInst"
removedWrongAnceInst(plink, inputPrefix=outputPrefix4, metaDataFile,  
                     ancestrySymbol, outputPrefix=outputPrefix5)
  1. Remove the excluded probes of the chip defined in advance. Improper SNP name often starts with unexpected format such as AFFX, cnvi.
  2. Output files: 1_06_removedExclProbe.#
outputPrefix6 <- "1_06_removedExclProbe"  
removedExclProbe(plink, inputPrefix=outputPrefix5, 
                 excludedProbeIdsFile, outputPrefix=outputPrefix6) 
  1. Remove probes which are not contained in the annotation file or which have missing values in the annotation file.
  2. Primary output files: 1_07_removedUnmapProbes.#
  3. Output file with the removed probes:1_07_probesUnmapped2ChipRef.txt
outputPrefix7 <- "1_07_removedUnmapProbes"   
outputSNPfile7 <- "1_07_probesUnmapped2ChipRef.txt"
removedUnmapProbes(plink, inputPrefix=outputPrefix6, chipAnnoFile,
                   outputPrefix=outputPrefix7, outputSNPfile=outputSNPfile7)
  1. Remove probes which belong to a SNP or have a position (i.e. a base pair position and chromosome) in the annotation file which is the same as that of at least one other probe.
  2. Primary output files: 1_08_removedDoubleProbes.#
  3. Output file with the removed probes:1_08_probesDouble.txt
outputSNPdupFile8 <- "1_08_probesDouble.txt"
outputPrefix8 <- "1_08_removedDoubleProbes"   
removedDoubleProbes(plink, inputPrefix=outputPrefix7, chipAnnoFile, 
                    chipType, outputSNPdupFile=outputSNPdupFile8, 
                    outputPrefix=outputPrefix8) 
  1. Use the annotation file to replace the probe names by SNP names, in order to update the SNP chromosomal location and the SNP base pair position, as well as to convert all alleles to the positive strand. From this step on, chromosome codes are mapped as: X=23, Y=24, XY (pseudo-autosomal region of X)=25, MT (mitochondrial)=26.
  2. Output files: 1_09_updatedSnpInfo.#
outputPrefix9 <- "1_09_updatedSnpInfo"   
updatedSnpInfo(plink, inputPrefix=outputPrefix8, 
               chipAnnoFile, chipType, outputPrefix=outputPrefix9)
  1. Correct the chromosome of the SNPs in the pseudoautosomal region by using the plink option --split-x with the name of the genome build of the annotation file.
  2. Output files: 1_10_changedXyChr.#
outputPrefix10 <- "1_10_splitXchr"
splitXchr(plink, inputPrefix=outputPrefix9, outputPrefix=outputPrefix10)
  1. Remove the SNPs of the chromosomes 24 (Y) and 26 (MT).
  2. Output files: 1_11_removedYMtSnp.#
outputPrefix11 <- "1_11_removedYMtSnp"
removedYMtSnp(plink, inputPrefix=outputPrefix10, outputPrefix=outputPrefix11)

Quality control

All files named in this subsection are in the directory 2-genoQC/.

In this subsection, the function genoQC() is mandatory and performs QC on the genotype data. removedSnpHetX(), removedMaleHetX(), and removeOutlierByPCs() are optional and can be used part of QC . removedSnpHetX() and removedMaleHetX() exclude heterozygous markers in male, and males with haploid heterozygous SNPs on chromosome X. Our default QC procedure does not remove any population outliers before imputation. However, if your study data includes significant population outliers and this may affect the qualtiy of subquent imputation analysis. For this, it is recommended to remove them by the function removeOutlierByPCs() after looking at PCA plot generated by plotPCA4plink().

Input files

Processing example

This step does QC in a single function, including all required substeps to ensure that the QC-ed genotype data is of high quality prior to imputation.

inputPrefix <- "2_02_removedInstHetX"
outputPrefix <- "2_13_removedSnpHweFemaleX" 
genoQC(plink, inputPrefix, 
       snpMissCutOffpre=0.05, 
       sampleMissCutOff=0.02, 
       Fhet=0.2, cutoffSubject=0.05, cutoffSNP=0.1,
       snpMissCutOffpost=0.02, 
       snpMissDifCutOff=0.02,
       femaleChrXmissCutoff=0.05, 
       pval4autoCtl=0.000001, 
       pval4femaleXctl=0.000001, outputPrefix, keepInterFile=TRUE)

Individual steps

Alternatively, we elaborate step by step instructions of QC procedure in genoQC() for better understanding. Note that, the output files of each step are the input of the next step.

  1. Determine for each SNP of the chromosome 23 from the genotype data the number of male instances which have the value one as the minor allele count for that SNP. Then, remove all SNPs which number is higher than 0.5 % of the number of male instances. Note that 0.5% is just an example, this should be dataset specific.
  2. Primary output files: 2_01_removedSnpHetX.#
  3. Output file with the number of instances with heterozygous alleles for each SNP of the chromosome 23 before SNP removal (each line contains a SNP name and the respective number, lines are sorted descending by number): 2_01_snpHHfreqAll.txt
  4. Output file with the number of instances with heterozygous alleles for each SNP of the chromosome 23 after SNP removal:2_01_snpHHfreqRetained.txt
inputPrefix <- "1_11_removedYMtSnp"
hhCutOff <- 0.005 ##  can be tuned
outputPrefix <- "2_01_removedSnpHetX" 
outputHetSNPfile <- "2_01_snpHHfreqAll.txt"
outputRetainSNPfile <- "2_01_snpHHfreqRetained.txt"
removedSnpHetX(plink, inputPrefix, hhCutOff, outputPrefix, 
               outputHetSNPfile, outputRetainSNPfile)
  1. Determine for each male instance the number of SNPs of the chromosome 23 which have the value one as the minor allele count for that instance. Next, remove all instances which number is higher than 15. Note that 15 is just an example, this should be dataset specific.
  2. Primary output files: 2_02_removedInstHetX.#
  3. Output file stores male subjects that have heterozygous SNPs with their frequency (if any): 2_02_instHetXfreqAll.txt
  4. Output file stores male subjects that have heterozygous SNPs with their frequency after 'improper' subject removal (if any):2_02_instHetXfreqRetained.txt
  5. Output file stores all heterozygous SNPs with their frequency (the number of males for this SNP): 2_02_snpHHfreqAll.txt
inputPrefix <- "2_01_removedSnpHetX"
hhSubjCutOff <- 15
outputPrefix <- "2_02_removedInstHetX"
outputSubjHetFile <- "2_02_instHetXfreqAll.txt" 
outputRetainSubjectFile <- "2_02_instHetXfreqRetained.txt"  
outputHetSNPfile <- "2_02_snpHHfreqAll.txt"
removedMaleHetX(plink, inputPrefix, hhSubjCutOff, 
                outputPrefix, outputSubjHetFile, 
                outputRetainSubjectFile, outputHetSNPfile) 

These two steps are automatically skipped if you are working on the autosomal data.

  1. Set all heterozygous alleles of SNPs of the chromosome 23 for males (i.e. when the SNP has the value one as the minor allele count) as missing.
  2. Output files: 2_03_setHeteroHaploMissing.#
outputPrefix3 <- "2_03_setHeteroHaploMissing" 
setHeteroHaploMissing(plink, inputPrefix, outputPrefix=outputPrefix3) 
  1. Remove SNPs with missingness >= 0.05 (before instance removal).
  2. Output files: 2_04_removedSnpMissPre.#
outputPrefix4 <- "2_04_removedSnpMissPre" 
removedSnpMiss(plink, snpMissCutOff=0.05, 
               inputPrefix=outputPrefix3, outputPrefix=outputPrefix4)  
  1. Remove instances with missingness >= 0.02.
  2. Output files: 2_05_removedInstMiss.#
outputPrefix5 <- "2_05_removedInstMiss" 
removedInstMiss(plink, sampleMissCutOff=0.02,  
                inputPrefix=outputPrefix4, outputPrefix=outputPrefix5)
  1. Remove instances with great autosomal heterozygosity deviation, i.e. with |Fhet| >= 0.2
  2. Output files: 2_06_removedInstFhet.#
outputPrefix6 <- "2_06_removedInstFhet" 
removedInstFhet(plink, Fhet=0.2, 
                inputPrefix=outputPrefix5, outputPrefix=outputPrefix6)
  1. Replace the paternal ID and maternal ID of instances (childs) by the value zero if the paternal ID and the maternal ID do not belong to any instance (parent) with the same family ID as the child.
  2. Output files: 2_07_removedParentIdsMiss.#
outputPrefix7 <- "2_07_removedParentIdsMiss" 
removedParentIdsMiss(plink, inputPrefix=outputPrefix6, 
                     outputPrefix=outputPrefix7)
  1. Exclude subjects and/or genetic variants (SNPs) based on Mendel errors for the family data (trio/duo). (Optional, if no family data)
  2. Output files: 2_08_removedMendelErr.#
outputPrefix8 <- "2_08_removedMendelErr" 
removedMendelErr(plink, inputPrefix=outputPrefix7, cutoffSubject, 
                 cutoffSNP, outputPrefix=outputPrefix8)
  1. Remove SNPs with missingness >= 0.02 (after instance removal).
  2. Output files: 2_09_removedSnpMissPost.#
outputPrefix9 <- "2_09_removedSnpMissPost" 
removedSnpMiss(plink, snpMissCutOff=0.02, 
               inputPrefix=outputPrefix8, outputPrefix=outputPrefix9)
  1. Remove SNPs with difference >= 0.02 of SNP missingness between cases and controls.
  2. Output files: 2_10_removedSnpMissDiff.#
outputPrefix10 <- "2_10_removedSnpMissDiff"
groupLabel <- getGroupLabel(inputFAMfile=paste0(outputPrefix8, ".fam"))
removedSnpMissDiff(plink, inputPrefix=outputPrefix9, snpMissDifCutOff=0.02, 
                   outputPrefix=outputPrefix10, groupLabel)  
  1. Remove chrX SNPs with missingness >= 0.05 in females. (Optional, if no chrX data)
  2. Output files: 2_11_removedSnpFemaleChrXmiss.#
outputPrefix11 <- "2_11_removedSnpFemaleChrXmiss" 
removedSnpFemaleChrXmiss(plink, femaleChrXmissCutoff=0.05, 
                         inputPrefix=outputPrefix10, 
                         outputPrefix=outputPrefix11) 
  1. Remove autosomal SNPs with Hardy-Weinberg equilibrium p < 10-6 in controls.
  2. Primary output files: 2_12_removedSnpHweAuto.#
  3. Output file with HWE p-value for the autosomal SNPs before removing, sorted ascending by the p-values: 2_12_snpHwePvalAuto.txt
  4. Output file with the removed SNP names: 2_12_snpRemovedHweAuto.txt
outputPvalFile <- "2_12_snpHwePvalAuto.txt" 
outputSNPfile <-  "2_12_snpRemovedHweAuto.txt" 
outputPrefix12 <- "2_12_removedSnpHweAuto" 
if (groupLabel == "control" | groupLabel == "caseControl"){
    removedSnpHWEauto(groupLabel="control", plink, 
                      inputPrefix=outputPrefix11, 
                      pval=0.000001, outputPvalFile, 
                      outputSNPfile, outputPrefix=outputPrefix12)
}
  1. Remove chrX SNPs with HWE p < 10-6 in female controls. (Optional, if no chrX data)
  2. Primary output files: 2_13_removedSnpHweFemaleXct.#
  3. Output file with HWE p-value for the female chrX SNPs before removing, sorted ascending by the p-values:2_13_snpHwePvalfemaleXct.txt
  4. Output file with the removed SNP names:2_13_snpRemovedHweFemaleXct.txt
outputPrefix <- "2_13_removedSnpHweFemaleX"  
outputPvalFile <- "2_13_snpHwePvalfemaleXct.txt" 
outputSNPfile <- "2_13_snpRemovedHweFemaleXct.txt"  
removedSnpFemaleChrXhweControl(plink, inputPrefix=outputPrefix12, 
                               pval=0.000001, outputPvalFile,
                               outputSNPfile, outputPrefix)
  1. Calculate the PCA (with plots) for 2_13_removedSnpHweFemaleXct.# and remove the outliers from that dataset.
  2. Primary output files: 2_14_removedOutliers.#
  3. Output file with the IDs of the instances and the values of their eigenvectors from the PCA after removal of instances: 2_14_eigenvalAfterQC.txt
  4. Output file with the plot of the first two PCs from the PCA before removal of instances: 2_14_eigenvalAfterQC.png
  5. Output file with the plot of the first two PCs from the PCA after removal of instances: 2_14_removedOutliers.png
  6. Output file with the IDs of the removed instances and the values of their eigenvectors, sorted ascending by the PC values: 2_14_eigenval4outliers.txt
inputPrefix <- "2_13_removedSnpHweFemaleX" 
outputPC4subjFile <- "2_14_eigenvalAfterQC.txt"
outputPCplotFile <- "2_14_eigenvalAfterQC.png" 
plotPCA4plink(gcta, inputPrefix, nThread=20, outputPC4subjFile, outputPCplotFile)  

cutoff <-  c(-0.4, 0.2)
cutoffSign <- "greater" ## not used if cutoff == NULL, and with two values 
inputPC4subjFile <- "2_14_eigenvalAfterQC.txt"
outputPC4outlierFile <- "2_14_eigenval4outliers.txt"
outputPCplotFile <- "2_14_removedOutliers.png"
outputPrefix <- "2_14_removedOutliers" 
removeOutlierByPCs(plink, gcta, inputPrefix, nThread=20, 
                   cutoff, cutoffSign, inputPC4subjFile, 
                   outputPC4outlierFile, outputPCplotFile, outputPrefix) 

Alignment with the reference

All files named in this subsection are in the directory 3-checkAlign/.

The function checkAlign2ref() performs the alignment against a given reference
panel by considering the following parameters: variant name, genomic position and
the allele presentation. Output files are generated in order.

Input files

Detailed steps

  1. If the genome build of already QC-ed is different from the genome build of the imputation reference files, then lift the data from the first genome build to the second. If the genome build is the same, then just copy the input files.

  2. Output files: 3_1_QCdata.#

  3. Remove SNPs for which the name has a different position (i.e. combination of base pair position and chromosome) in the imputation reference files.

  4. Primary output files: 3_2_removedSnpDiffNamePos.#

  5. Output file with the SNPs names for which a different position is contained in the imputation reference files: 3_2_snpDiffNamePos.txt

  6. Remove SNPs for which the combination of base pair position and chromosome is not contained in the imputation reference files (ignoring the SNP name).

  7. Primary output files: 3_3_removedSnpMissPos.#
  8. Output file with the SNPs names for which the combination of base pair position and chromosome is not contained in the imputation reference files: 3_3_snpMissPos.txt

  9. Remove SNPs from which the combination of base pair position and chromosome (ignoring the SNP name) has an allele which is not in the imputation reference files for that combination of base pair position and chromosome.

  10. Primary output files: 3_4_removedSnpDiffAlleles.#

  11. Output file with the removed SNP names: 3_4_snpDiffAlleles.txt
  12. Output file with the retained SNPs: 3_4_snpImpRefAlleles.txt

Processing example

renamePlinkBFile(inputPrefix="2_14_removedOutliers", 
                 outputPrefix="3_1_QCdata", action="move")
inputFile <- paste0(impRefDIR,"*.legend.gz")  
## To recode the intermediate disk usage 
## keep in the same directory.
bimReferenceFile <- "bimImputeRef.txt"
.prepareLegend2bim(inputFile, referencePanel, 
                   outputFile=bimReferenceFile, ncore=nCores) 

inputPrefix <- "3_1_QCdata" 
out2.snp <- "3_2_snpSameNameDiffPos"
out2 <- "3_2_removedSnpSameNameDiffPos"
out3 <- "3_3_removedSnpMissPos"
out3.snp <- "3_3_snpMissPos"
out4 <- "3_4_removedSnpDiffAlleles"
out4.snp <- "3_4_snpDiffAlleles"
out4.snpRetained <- "3_4_snpImpRefAlleles"
nCores <- detectCores() 
checkAlign2ref(plink, inputPrefix, referencePanel, bimReferenceFile, 
               out2, out2.snp, out3, out3.snp, 
               out4, out4.snp, out4.snpRetained, nCore=nCores)

Imputation

All files named in this subsection are in the directory 4-imputation/.

In this subsection, three major functions removedMonoSnp(), phaseImpute() and postImpQC() are used for the purpose of excluding monomorphic SNPs, phasing, imputation and post imputation QC.

Input files

Detailed steps and processing examples

  1. Remove monomorphic SNPs from the well aligned genotyping data.
  2. Primary output files: 4_1_removedMonoSnp.#
  3. Output file with the removed monomorphic SNPs: 4_1_snpMonoRemoved.txt
outputPrefix <- "4_1_removedMonoSnp"
outputMonoSNPfile <- "4_1_snpMonoRemoved.txt"  
removedMonoSnp(plink, inputPrefix="3_4_removedSnpDiffAlleles", 
                outputPrefix, outputSNPfile=outputMonoSNPfile)
  1. Use the imputation reference files to generate pre-phase haplotypes by SHAPEIT and then do the imputation by using IMPUTE2. GTOOL is used for format conversion. In the case of using IMPUTE4, QCTOOL is used instead of GTOOL.
  2. Output files: 4_2_imputedDataset.#
  3. Output file with the info scores of all SNPs, which consists of two columns, separated by whitespaces, the first the SNPs and the second the info scores: 4_2_snpImputedInfoScore.txt. This is also the output from last step: impute2infoUpdated.txt.
inputPrefix <- "4_1_removedMonoSnp"  
outputPrefix <- "4_2_imputedDataset"   
outputInfoFile <- "4_2_snpImputedInfoScore.txt"
tmpImputeDir <- paste0(imputeTool, "_", referencePanel)
nCores <- detectCores()
phaseImpute(inputPrefix, outputPrefix, autosome=TRUE,
            plink, shapeit, imputeTool, impute, qctool, gtool, 
            windowSize=3000000, effectiveSize=20000, 
            nCore=nCores, threshold=0.9, outputInfoFile, SNP=TRUE, 
            referencePanel, impRefDIR, tmpImputeDir, keepTmpDir=TRUE)
  1. Remove imputed SNPs with (info < 0.6) where the value info comes from the files *.impute2_info from the imputation.
  2. Output files: 4_3_wellImputeData.#

  3. Remove all SNPs (if any) which have the same position as a SNP in 4_1_snpMonoRemoved.txt has in aligned instance data.

  4. Output files: 4_4_removedMonoSnpAfter.#

  5. Add previous identified monomorphic SNPs (if any) in 4_1_snpMonoRemoved.txt with their values from the lifted instance data.

  6. Output files: 4_5_addedMonoSnpAfter.#

  7. Remove SNPs which have a non missing value for less then 20 instances.

  8. Primary output files: 4_6_removedSnpMissPostImp.#
  9. Output file with the removed SNP names: 4_6_snpRemovedMissPostImp.txt
  10. Output file with the remaining SNP names: 4_6_snpRetainMissPostImp.txt
inputPrefix <- "4_2_imputedDataset" 
out1 <- "4_3_wellImputeData"
out2 <- "4_4_removedMonoSnpAfter"
out3 <- "4_5_addedMonoSnpAfter"
out4 <- "4_6_removedSnpMissPostImp" 
outputInfoFile <- "4_2_snpImputedInfoScore.txt"
outputMonoSNPfile <- "4_1_snpMonoRemoved.txt" 
prefixAlign2ref <- "3_4_removedSnpDiffAlleles" 
outRemovedSNPfile <- "4_6_snpRemovedMissPostImp.txt"
outRetainSNPfile <- "4_6_snpRetainMissPostImp.txt"

postImpQC(plink, inputPrefix, out1, out2, out3, out4,
          outputInfoFile, infoScore=0.6, outputMonoSNPfile, 
          prefixAlign2ref, missCutoff=20, outRemovedSNPfile, 
          outRetainSNPfile, referencePanel)

Reduction and expansion

All files named in this subsection are in the directory 5-reductAndExpand/.

reductExpand() function extracts the genotypes from well imputed data so that PLINK data with only SNPs before imputation are remained. On the basis of extracted genotypes, a subset of non-imputed genotypes that are different from the imputation reference panel are then added, including SNPs with different alleles, missing positions, and different positions.

Input files

Detailed steps

  1. Firt of all, copy all the input files into the directory ./5-reductAndExpand/. Then do the reduction of the imputed dataset to the SNPs before imputation.
  2. Output files: 5_1_reducedToSpecific.#

  3. Add the SNPs (if any) with different alleles from the dataset before removing SNPs.

  4. Output files: 5_2_specificDiffAllele.#

  5. Add the SNPs (if any) with missing positions from the dataset before removing SNPs.

  6. Output files: 5_3_specificMissPos.#

  7. Add the SNPs (if any) with different positions from the dataset before removing SNPs.

  8. Output files: 5_4_specificDiffPos.#

Processing example

inputPrefix <- "4_6_removedSnpMissPostImp"
inputQCprefix <- "3_1_QCdata"
snpRefAlleleFile <- "3_4_snpImpRefAlleles.txt"
snpDiffAlleleFile <- "3_4_snpDiffAlleles.txt"
snpMissPosFile <- "3_3_snpMissPos.txt"
snpSameNameDifPosFile <- "3_2_snpSameNameDiffPos.txt" 

dir5 <- "./5-reductAndExpand/"
## imputed dataset
system(paste0("scp ./4-imputation/", inputPrefix, ".* ", dir5)) 
system(paste0("scp ./3-checkAlign/", inputQCprefix, ".* ", dir5))
system(paste0("scp ./3-checkAlign/", snpRefAlleleFile, " ", dir5))
system(paste0("scp ./3-checkAlign/", snpDiffAlleleFile, " ", dir5))
system(paste0("scp ./3-checkAlign/", snpMissPosFile, " ", dir5))
system(paste0("scp ./3-checkAlign/", snpSameNameDifPosFile, " ", dir5))

setwd(dir5)
reducedToSpecificfn <- "5_1_reducedToSpecific"
specificDiffAllelefn <- "5_2_specificDiffAllele"
specificMissPosfn <- "5_3_specificMissPos"
specificDiffPosfn <- "5_4_specificDiffPos" 

reductExpand(plink, referencePanel, inputPrefix, inputQCprefix, 
             snpRefAlleleFile, snpDiffAlleleFile, 
             snpMissPosFile, snpSameNameDifPosFile, 
             reducedToSpecificfn, specificDiffAllelefn, 
             specificMissPosfn, specificDiffPosfn)
setwd("..")

Final results

Copy and rename the results of interest to the directory ./6-finalResults.

The metadata information file, the final well imputed and a subset of imputed dataset (dataset-specific) are essential for the downstream analysis.

dir6 <- "./6-finalResults/"
system(paste0("scp ./1-genoUpdate/1_01_metaData.txt ", dir6))
system(paste0("scp ./4-imputation/4_6_removedSnpMissPostImp.* ", dir6))  
system(paste0("scp ./5-reductAndExpand/5_4_specificDiffPos.* ", dir6))
setwd(dir6)
renamePlinkBFile(inputPrefix="4_6_removedSnpMissPostImp", 
                 outputPrefix="imputedSnpsDataset", action="move")
renamePlinkBFile(inputPrefix="5_4_specificDiffPos", 
                 outputPrefix="specificSnpsDataset", action="move")

Extending pipelines

Gimpute's modular structure allows the integration of other existing imputation workflows, allowing users to choose their preferred imputation strategy. For demonstration, we have embedded Genipe as an external imputation tool [8].

Integrate with Genipe

Installation

To set up Genipe, please refer to Genipe's installation page for more details.

Additional configuration files

The imputation reference can be downloaded directly from IMPUTE2's reference page: 1000 Genome phase 3 or Genipe's page.

This reference genome is used for the strand check, you can download from Genipe's page.

Imputation with Genipe

  1. Prepare additional configuration files.
## example
impRefDIR4genipe <- "../1000GP_Phase3/"
fastaFile <- "../hg19/hg19.fasta"
  1. Imputing genotypes using Genipe.
chrs =22
inputPrefix <- "3_4_removedSnpDiffAlleles"
thread4impute2 <- 20 ## tune by yourself
thread4shapeit <- 30
segmentSize <- 3000000
imputeTool <- "impute2" ## only impute2 
imputedByGenipe(chrs, impRefDIR4genipe, inputPrefix, shapeit, impute, plink, fastaFile, segmentSize, thread4impute2, thread4shapeit) 
  1. Merge chunk-specific region using Genipe.
chr <- 2 
inputImpute2 <- 'chr2.33000001_36000000.impute2'
probability <- 0.9
completionRate <- 0.98
# info <- 0.6
outputPrefix <- paste0('imputedChr', chr)
mergeByGenipe(inputImpute2, chr, probability, completionRate, info, outputPrefix)
  1. Extract imputed results chromosome-wise by Genipe.
## extract imputed markers using Genipe 
chr <- 3
inputImpute2 <- paste0('chr', chr,'.imputed.impute2')
inputMAP <- paste0('chr', chr,'.imputed.map')
format <- 'bed'
prob <- 0.9
outputPrefix <- paste0('imputedChr', chr)  
extractByGenipe(inputImpute2, inputMAP, outputPrefix, format, prob)
## sessionInfo()

# R version 3.5.0 (2018-04-23)
# Platform: x86_64-redhat-linux-gnu (64-bit)
# Running under: CentOS Linux 7 (Core)

# Matrix products: default
# BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

# locale:
#  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
# [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

# attached base packages:
# [1] parallel  stats     graphics  grDevices utils     datasets  methods  
# [8] base     

# other attached packages:
# [1] doParallel_1.0.11 iterators_1.0.10  foreach_1.4.4     lattice_0.20-35  
# [5] Gimpute_0.99.9   

# loaded via a namespace (and not attached):
# [1] compiler_3.5.0   codetools_0.2-15 grid_3.5.0      

References

[1] Marchini, J., et al. (2010). Genotype imputation for genome-wide association studies. Nat Rev Genet 11(7): 499-511.

[2] Marchini, J., et al. (2007). A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet 39(7): 906-913.

[3] Schizophrenia Working Group of the Psychiatric Genomics, C. (2014). Biological insights from 108 schizophrenia-associated genetic loci. Nature 511(7510): 421-427.

[4] Purcell, Shaun, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics 81.3 (2007): 559-575.

[5] Howie, B., et al. (2012). Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet 44(8): 955-959.

[6] Howie, B. N., et al. (2009). A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet 5(6): e1000529.

[7] Bycroft, Clare, Colin Freeman, Desislava Petkova, Gavin Band, Lloyd T. Elliott, Kevin Sharp, Allan Motyer et al. Genome-wide genetic data on~ 500,000 UK Biobank participants. BioRxiv (2017): 166298.

[8] Lemieux Perreault, L. P., et al. (2016). genipe: an automated genome-wide imputation pipeline with automatic reporting and statistical tools. Bioinformatics, 32(23), 3661-3663.

Trouble-shooting

If you have any trouble or comment ? --> Contact: junfang.chen@zi-mannheim.de



transbioZI/Gimpute documentation built on April 10, 2022, 4:20 a.m.