In transbioZI/Gimpute: Gimpute: An efficient genetic data processing and imputation pipeline

BiocStyle::markdown()

knitr::opts_chunk$set(message=FALSE, error=FALSE, warning=FALSE, fig.width=10, fig.height=10)
options(width=80)

About Gimpute

Background

Genome-wide association studies (GWAS) have become an essential tool in assisting the improvement of understanding of neuropathology of diseases. Moreover, the imputation of genotyping data with missing information can enhance the power of GWAS by using haplotype relationships between genetic variants [1,2]. In the meantime, the imputation can substantially reduce the cost of genotyping large number of samples. Furthermore, the comparison of significant findings across different cohorts and genotyping platforms is also achievable. Currently large amounts of genotyping data are produced at an unprecedented rate to be analyzed; however, the preparation of ready-to-use data sets needs to be of high quality. Thus the implementation of an efficient and reliable genetic data pre-processing and imputation pipeline is desirable.

Scope of the pipeline

In order to ensure the reliability and reproducibility of genome-wide association study (GWAS) data, we set up an efficient, automatic and comprehensive genotype data processing and imputation pipeline termed Gimpute. It consists of pre-processing (genetic variant information updating/matching/liftOver, quality control of genetic variants and study samples, the alignment of variants to the imputation references), pre-phasing, imputation and post-imputation quality control [3], as well as an extension to the Genipe framework in this vignette, which is easy-to-follow and user-friendly.

Getting started

Prerequisites

Gimpute runs on any 64-bit x86 Linux distribution and it requires the following tools [4,5,6,7]:

You can install the tools from the command line as the following examples:

## PLINK
wget https://www.cog-genomics.org/static/bin/plink180717/plink_linux_x86_64.zip
unzip plink_linux_x86_64.zip

## GCTA64
wget http://cnsgenomics.com/software/gcta/gcta_1.91.5beta.zip
unzip gcta_1.91.5beta.zip

## SHAPEIT
wget https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.v2.r904.glibcv2.17.linux.tar.gz
tar -zxvf shapeit.v2.r904.glibcv2.17.linux.tar.gz

## IMPUTE2
wget https://mathgen.stats.ox.ac.uk/impute/impute_v2.3.2_x86_64_static.tgz
tar -zxvf impute_v2.3.2_x86_64_static.tgz

## GTOOL
wget http://www.well.ox.ac.uk/~cfreeman/software/gwas/gtool_v0.7.5_x86_64.tgz
tar -zxvf gtool_v0.7.5_x86_64.tgz

## QCTOOLv2 
wget http://www.well.ox.ac.uk/~gav/resources/qctool_v2.0-CentOS6.8-x86_64.tgz
tar -zxvf qctool_v2.0-CentOS6.8-x86_64.tgz

Note that, IMPUTE4 can be downloaded only after applying for access. IMPUTE4 and QCTOOLv2 are not required if you are using IMPUTE2 for the imputation.

Installation

Development version from Github:

Install Gimpute in R (Recommended)

install.packages("devtools")
library("devtools")
install_github("transbioZI/Gimpute", build_vignettes=TRUE)

Install Gimpute from the command line

git clone https://github.com/transbioZI/Gimpute
R CMD build Gimpute
R CMD INSTALL Gimpute_*.tar.gz

Gimpute runs on any 64-bit x86 Linux distribution. Additional dependencies are described below.

Experimental data

PLINK format files (PED/MAP or BED/BIM/FAM) are required as the input genotyping data for our pipeline. Free datasets are available under ./extdata/ in our package. The way for loading these PLINK binary files is shown as below. Since sharing of genetic genotyping data is strictly prohibited, we did not upload any private genotyping data for this demonstration.

library(Gimpute)
bedFile <- system.file("extdata", "dataChr21.bed", package="Gimpute")
bimFile <- system.file("extdata", "dataChr21.bim", package="Gimpute") 
famFile <- system.file("extdata", "dataChr21.fam", package="Gimpute")

Get it Setup

File naming conventions

The files processed and generated are named in a consistent way across all datasets as follows:

filename.# represents a set of the three PLINK files, i.e. the .bed, .fam and .bim file. Binary PLINK files are recommended due to their smaller file size and fast processing.
filename.txt denotes a file with additional information, such as metadata information, SNP names, in pure text format. The names of the output files begin with the number of the step which produces the respective file.

Directory naming conventions

The recommended directories are named as follows:

1-genoUpdate: Updated genotype information related files are generated in this directory.
2-genoQC: Quality controlled genotype data related files are generated in this directory.
3-checkAlign: Aligned with the imputation reference related files are generated in this directory.
4-imputation: All imputation related files are generated in this directory.
5-reductAndExpand: Filtered imputation data and expanded to the original non-imputed genotype files are generated in this directory.
6-finalResults: Final processed and filtered imputed genotype files are generated in this directory.

Required packages and tools

## Gimpute has the following dependencies:
library(lattice)
library(doParallel)
## Define required tools
toolDIR <- "/home/Gimpute/tools/"
plink <- paste0(toolDIR, "plink")
gcta <- paste0(toolDIR, "gcta64") 
shapeit <- paste0(toolDIR, "shapeit") 
gtool <- paste0(toolDIR, "gtool") 
qctool <- paste0(toolDIR, "qctool")
imputeTool <- "impute4" ## 
if (imputeTool == "impute2"){ 
    impute <- paste0(toolDIR, "impute2") 
} else if (imputeTool == "impute4"){ 
    impute <- paste0(toolDIR, "impute4.1_r291.2")
}

Configuration

Self-prepared configuration files
- metadataFile: A text file with at least the following content (column names are in parentheses): family ID in the PLINK files (FID), individual ID in the PLINK files (IID), ID in the description files (descID), self identified ancestry (ance; e.g. AFR: African, AMR: Ad Mixed American, EAS: East Asian, EUR: European, SAS: South Asian), sex (sex; 1 = male, 2 = female, 0 = missing), age (age), group (group; 0 = control/unaffected, 1 = case/affected). All unknown and missing values are represented by the value NA. Lines with a missing value for FID or IID are not contained.
- removedSampIDFile: A text file that stores sample IDs, each ID per line, which will be excluded. For instance, samples with missing sex information can be included in this file.
- excludedProbeIdsFile: A text file that stores probe IDs, each ID per line, which will be excluded.
Genotyping chip annotation file: Most of common chip annotation files can be directly downloaded from Dr. William Rayner'page. This annotation file is particularly important for strand checking and genomic information of genotypes on a variety of genome builds. Gimpute requires the chip annotation file that is organized into two different types:
- If the snp name of your study genotype data starts with "SNP_", then the chip type "SNPIDstudy" is used; Usually, Affymetrix chip belongs to this category. The prepared output annotation file must consist of at least the following column names : SNPIDstudy, rs, chr, pos, strand. The column "strand" must only have two kinds of values "-" and "+". Variants with all other values are excluded.
- If the snp name of your study genotype data starts with "rs", then the chip type "rsIDstudy" is used; The annotation file should be prepared in advance and must consist of at least the following column names : rsIDstudy, chr, pos, strand. Illumina chip is often specified in this format. The column "strand" must only have two kinds of values "-" and "+". Variants with all other values are excluded.
Imputation reference files: Publicly avaiable reference datasets can be downloaded from IMPUTE2's reference page. The reference files from "1000 Genome Project Phase1v3" are the default panel used in our package. 1000 Genomes Phase 3 is an alternative reference panel.

## Download imputation reference panels
## Reference panel 1: 1000Gphase1v3_macGT1
wget https://mathgen.stats.ox.ac.uk/impute/ALL_1000G_phase1integrated_v3_impute_macGT1.tgz
tar -zxvf ALL_1000G_phase1integrated_v3_impute_macGT1.tgz

## Reference panel 2: 1000Gphase3
wget https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.tgz
wget https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3_chrX.tgz
tar -zxvf 1000GP_Phase3.tgz
tar -zxvf 1000GP_Phase3_chrX.tgz

A string indicating the type of imputation reference panels is used: c("1000Gphase1v3_macGT1", "1000Gphase3"). This must be defined before running pipeline.

## Define the directory where you place the imputation reference files 
impRefDIR <- "/home/imputeRef/1000Gphase1v3_macGT1/"
referencePanel <- "1000Gphase1v3_macGT1" 
## Alternatively,
impRefDIR <- "/home/imputeRef/1000GP_Phase3/"
referencePanel <- "1000Gphase3"

Global parameters
- ancestrySymbol: A string symbol for self-identified ancestry (AFR: African, AMR: Ad Mixed American, EAS: East Asian, EUR: European, SAS: South Asian, or NULL)
- chipType: A string symbol to indicate the type of chip annotation file: c("rsIDstudy", "SNPIDstudy")
- hhCutOff (optional): the cutoff for removing male haploid heterozygous SNPs on the chromosome X. The default is 0.005.
- hhSubjCutOff (optional): the cutoff for removing male subjects with haploid heterozygous SNPs on the chromosome X. The default is 15.
- cutoff (optional): the cutoff that distinguishes the outliers from ordinary population using PCA. The default is NULL meaning that there are no outliers or outliers are not required to be removed.
- referencePanel: A string symbol to indicate the type of input imputation reference panel: c("1000Gphase1v3_macGT1", "1000Gphase3")
- imputeTool: A string symbol to indicate the type of used imputation tool: c("impute2", "impute4")

Gimpute flowchart

Running pipeline

Genotype information update

All files named in this subsection are in the directory 1-genoUpdate/.

In this subsection, the function updateGenoInfo() is mandatory and updates genotype information of original genotype data by the use of metadata and configuration files.

Input files

Raw genotype data in PLINK format: dataChr21.#
Metadata information file about the instances: metadataFile.
Chip annotation file for genotype information mapping and strand check: chipAnnoFile.
The file with the IDs of the excluded instances: removedSampIDFile.
The file with the IDs of the excluded probes of the chip: excludedProbeIdsFile.

Processing example

Locate the original PLINK files and other required files either from our package or your side.

## Not run
bedFile <- system.file("extdata", "dataChr21.bed", package="Gimpute")
bimFile <- system.file("extdata", "dataChr21.bim", package="Gimpute") 
famFile <- system.file("extdata", "dataChr21.fam", package="Gimpute")

metadataFile <- system.file("extdata", "1_01_metaData.txt", package="Gimpute")
removedSampIDFile <- system.file("extdata", "excludedSampIDs.txt", 
                                 package="Gimpute")
excludedProbeIdsFile <- system.file("extdata", "excludedProbeIDs.txt", 
                                    package="Gimpute")
chipAnnoFile <- system.file("extdata", "coriellAffyChip.txt", 
                                 package="Gimpute")

inputPrefix <- "dataChr21"
ancestrySymbol <- "EUR"
outputPrefix <- "1_11_removedYMtSnp" 
metaDataFile <- "1_01_metaData.txt"
chipType <- "rsIDstudy"
updateGenoInfo(plink, inputPrefix, metaDataFile, removedSampIDFile,
               ancestrySymbol, excludedProbeIdsFile, chipAnnoFile,
               chipType, outputPrefix, keepInterFile=TRUE)

Individual steps

Note that, the output files of each step are the input of the next step.

Use the raw genotype data and metadata information file to generate a metadata file. In the following example, the metadata file is already preprocessed.
Output file: 1_01_metaData.txt

system(paste0("scp ", metadataFile, " ."))

Remove all excluded instances (subjects, samples) from the raw chip data. Duplicated, related or useless instances are suggested to be removed.
Output files: 1_02_removedExclInst.#

inputPrefix <- "dataChr21"
outputPrefix2 <- "1_02_removedExclInst" 
removedSampIDFile <- system.file("extdata", "excludedSampIDs.txt", 
                                 package="Gimpute")
removeSampID(plink, removedSampIDFile, inputPrefix, 
             outputPrefix=outputPrefix2)

Replace all values for affection (phenotype, group) and sex in the PLINK fam file by those values in the file 1_01_metaData.txt. In regards to this, the missing value for sex is represented by the value 0 (zero), the missing value for affection (group) is represented by the value -9.
Output files: 1_03_replacedGroupAndSex.#

metaDataFile <- "1_01_metaData.txt" 
outputPrefix3 <- "1_03_replacedGroupAndSex"
updateGroupIdAndSex(plink, inputPrefix=outputPrefix2, 
                   metaDataFile, outputPrefix=outputPrefix3)

Remove instances which have the missing affection value (i.e. the value -9).
Output files: 1_04_removedNoGroupId.#

outputPrefix4 <- "1_04_removedNoGroupId"
removeNoGroupId(plink, inputPrefix=outputPrefix3, outputPrefix=outputPrefix4)

Remove instances which have a value for ance (self identified ancestry) in the file 1_01_metaData.txt that is excluded from the dataset (for the excluded ancestry values).
Output files: 1_05_removedWrongAnceInst.#

outputPrefix5 <- "1_05_removedWrongAnceInst"
removedWrongAnceInst(plink, inputPrefix=outputPrefix4, metaDataFile,  
                     ancestrySymbol, outputPrefix=outputPrefix5)

Remove the excluded probes of the chip defined in advance. Improper SNP name often starts with unexpected format such as AFFX, cnvi.
Output files: 1_06_removedExclProbe.#

outputPrefix6 <- "1_06_removedExclProbe"  
removedExclProbe(plink, inputPrefix=outputPrefix5, 
                 excludedProbeIdsFile, outputPrefix=outputPrefix6)

Remove probes which are not contained in the annotation file or which have missing values in the annotation file.
Primary output files: 1_07_removedUnmapProbes.#
Output file with the removed probes:1_07_probesUnmapped2ChipRef.txt

outputPrefix7 <- "1_07_removedUnmapProbes"   
outputSNPfile7 <- "1_07_probesUnmapped2ChipRef.txt"
removedUnmapProbes(plink, inputPrefix=outputPrefix6, chipAnnoFile,
                   outputPrefix=outputPrefix7, outputSNPfile=outputSNPfile7)

Remove probes which belong to a SNP or have a position (i.e. a base pair position and chromosome) in the annotation file which is the same as that of at least one other probe.
Primary output files: 1_08_removedDoubleProbes.#
Output file with the removed probes:1_08_probesDouble.txt

outputSNPdupFile8 <- "1_08_probesDouble.txt"
outputPrefix8 <- "1_08_removedDoubleProbes"   
removedDoubleProbes(plink, inputPrefix=outputPrefix7, chipAnnoFile, 
                    chipType, outputSNPdupFile=outputSNPdupFile8, 
                    outputPrefix=outputPrefix8)

Use the annotation file to replace the probe names by SNP names, in order to update the SNP chromosomal location and the SNP base pair position, as well as to convert all alleles to the positive strand. From this step on, chromosome codes are mapped as: X=23, Y=24, XY (pseudo-autosomal region of X)=25, MT (mitochondrial)=26.
Output files: 1_09_updatedSnpInfo.#

outputPrefix9 <- "1_09_updatedSnpInfo"   
updatedSnpInfo(plink, inputPrefix=outputPrefix8, 
               chipAnnoFile, chipType, outputPrefix=outputPrefix9)

Correct the chromosome of the SNPs in the pseudoautosomal region by using the plink option --split-x with the name of the genome build of the annotation file.
Output files: 1_10_changedXyChr.#

outputPrefix10 <- "1_10_splitXchr"
splitXchr(plink, inputPrefix=outputPrefix9, outputPrefix=outputPrefix10)

Remove the SNPs of the chromosomes 24 (Y) and 26 (MT).
Output files: 1_11_removedYMtSnp.#

outputPrefix11 <- "1_11_removedYMtSnp"
removedYMtSnp(plink, inputPrefix=outputPrefix10, outputPrefix=outputPrefix11)

Quality control

All files named in this subsection are in the directory 2-genoQC/.

In this subsection, the function genoQC() is mandatory and performs QC on the genotype data. removedSnpHetX(), removedMaleHetX(), and removeOutlierByPCs() are optional and can be used part of QC . removedSnpHetX() and removedMaleHetX() exclude heterozygous markers in male, and males with haploid heterozygous SNPs on chromosome X. Our default QC procedure does not remove any population outliers before imputation. However, if your study data includes significant population outliers and this may affect the qualtiy of subquent imputation analysis. For this, it is recommended to remove them by the function removeOutlierByPCs() after looking at PCA plot generated by plotPCA4plink().

Input files

Genotype data with updated genotype information:../1-genoUpdate/1_11_removedYMtSnp.#

Processing example

This step does QC in a single function, including all required substeps to ensure that the QC-ed genotype data is of high quality prior to imputation.

inputPrefix <- "2_02_removedInstHetX"
outputPrefix <- "2_13_removedSnpHweFemaleX" 
genoQC(plink, inputPrefix, 
       snpMissCutOffpre=0.05, 
       sampleMissCutOff=0.02, 
       Fhet=0.2, cutoffSubject=0.05, cutoffSNP=0.1,
       snpMissCutOffpost=0.02, 
       snpMissDifCutOff=0.02,
       femaleChrXmissCutoff=0.05, 
       pval4autoCtl=0.000001, 
       pval4femaleXctl=0.000001, outputPrefix, keepInterFile=TRUE)

Individual steps

Alternatively, we elaborate step by step instructions of QC procedure in genoQC() for better understanding. Note that, the output files of each step are the input of the next step.

Determine for each SNP of the chromosome 23 from the genotype data the number of male instances which have the value one as the minor allele count for that SNP. Then, remove all SNPs which number is higher than 0.5 % of the number of male instances. Note that 0.5% is just an example, this should be dataset specific.
Primary output files: 2_01_removedSnpHetX.#
Output file with the number of instances with heterozygous alleles for each SNP of the chromosome 23 before SNP removal (each line contains a SNP name and the respective number, lines are sorted descending by number): 2_01_snpHHfreqAll.txt
Output file with the number of instances with heterozygous alleles for each SNP of the chromosome 23 after SNP removal:2_01_snpHHfreqRetained.txt

inputPrefix <- "1_11_removedYMtSnp"
hhCutOff <- 0.005 ##  can be tuned
outputPrefix <- "2_01_removedSnpHetX" 
outputHetSNPfile <- "2_01_snpHHfreqAll.txt"
outputRetainSNPfile <- "2_01_snpHHfreqRetained.txt"
removedSnpHetX(plink, inputPrefix, hhCutOff, outputPrefix, 
               outputHetSNPfile, outputRetainSNPfile)

Determine for each male instance the number of SNPs of the chromosome 23 which have the value one as the minor allele count for that instance. Next, remove all instances which number is higher than 15. Note that 15 is just an example, this should be dataset specific.
Primary output files: 2_02_removedInstHetX.#
Output file stores male subjects that have heterozygous SNPs with their frequency (if any): 2_02_instHetXfreqAll.txt
Output file stores male subjects that have heterozygous SNPs with their frequency after 'improper' subject removal (if any):2_02_instHetXfreqRetained.txt
Output file stores all heterozygous SNPs with their frequency (the number of males for this SNP): 2_02_snpHHfreqAll.txt

inputPrefix <- "2_01_removedSnpHetX"
hhSubjCutOff <- 15
outputPrefix <- "2_02_removedInstHetX"
outputSubjHetFile <- "2_02_instHetXfreqAll.txt" 
outputRetainSubjectFile <- "2_02_instHetXfreqRetained.txt"  
outputHetSNPfile <- "2_02_snpHHfreqAll.txt"
removedMaleHetX(plink, inputPrefix, hhSubjCutOff, 
                outputPrefix, outputSubjHetFile, 
                outputRetainSubjectFile, outputHetSNPfile)

These two steps are automatically skipped if you are working on the autosomal data.

Set all heterozygous alleles of SNPs of the chromosome 23 for males (i.e. when the SNP has the value one as the minor allele count) as missing.
Output files: 2_03_setHeteroHaploMissing.#

outputPrefix3 <- "2_03_setHeteroHaploMissing" 
setHeteroHaploMissing(plink, inputPrefix, outputPrefix=outputPrefix3)

Remove SNPs with missingness >= 0.05 (before instance removal).
Output files: 2_04_removedSnpMissPre.#

outputPrefix4 <- "2_04_removedSnpMissPre" 
removedSnpMiss(plink, snpMissCutOff=0.05, 
               inputPrefix=outputPrefix3, outputPrefix=outputPrefix4)

Remove instances with missingness >= 0.02.
Output files: 2_05_removedInstMiss.#

outputPrefix5 <- "2_05_removedInstMiss" 
removedInstMiss(plink, sampleMissCutOff=0.02,  
                inputPrefix=outputPrefix4, outputPrefix=outputPrefix5)

Remove instances with great autosomal heterozygosity deviation, i.e. with |Fhet| >= 0.2
Output files: 2_06_removedInstFhet.#

outputPrefix6 <- "2_06_removedInstFhet" 
removedInstFhet(plink, Fhet=0.2, 
                inputPrefix=outputPrefix5, outputPrefix=outputPrefix6)

Replace the paternal ID and maternal ID of instances (childs) by the value zero if the paternal ID and the maternal ID do not belong to any instance (parent) with the same family ID as the child.
Output files: 2_07_removedParentIdsMiss.#

outputPrefix7 <- "2_07_removedParentIdsMiss" 
removedParentIdsMiss(plink, inputPrefix=outputPrefix6, 
                     outputPrefix=outputPrefix7)

Exclude subjects and/or genetic variants (SNPs) based on Mendel errors for the family data (trio/duo). (Optional, if no family data)
Output files: 2_08_removedMendelErr.#

outputPrefix8 <- "2_08_removedMendelErr" 
removedMendelErr(plink, inputPrefix=outputPrefix7, cutoffSubject, 
                 cutoffSNP, outputPrefix=outputPrefix8)

Remove SNPs with missingness >= 0.02 (after instance removal).
Output files: 2_09_removedSnpMissPost.#

outputPrefix9 <- "2_09_removedSnpMissPost" 
removedSnpMiss(plink, snpMissCutOff=0.02, 
               inputPrefix=outputPrefix8, outputPrefix=outputPrefix9)

Remove SNPs with difference >= 0.02 of SNP missingness between cases and controls.
Output files: 2_10_removedSnpMissDiff.#

outputPrefix10 <- "2_10_removedSnpMissDiff"
groupLabel <- getGroupLabel(inputFAMfile=paste0(outputPrefix8, ".fam"))
removedSnpMissDiff(plink, inputPrefix=outputPrefix9, snpMissDifCutOff=0.02, 
                   outputPrefix=outputPrefix10, groupLabel)

Remove chrX SNPs with missingness >= 0.05 in females. (Optional, if no chrX data)
Output files: 2_11_removedSnpFemaleChrXmiss.#

outputPrefix11 <- "2_11_removedSnpFemaleChrXmiss" 
removedSnpFemaleChrXmiss(plink, femaleChrXmissCutoff=0.05, 
                         inputPrefix=outputPrefix10, 
                         outputPrefix=outputPrefix11)

Remove autosomal SNPs with Hardy-Weinberg equilibrium p < 10-6 in controls.
Primary output files: 2_12_removedSnpHweAuto.#
Output file with HWE p-value for the autosomal SNPs before removing, sorted ascending by the p-values: 2_12_snpHwePvalAuto.txt
Output file with the removed SNP names: 2_12_snpRemovedHweAuto.txt

outputPvalFile <- "2_12_snpHwePvalAuto.txt" 
outputSNPfile <-  "2_12_snpRemovedHweAuto.txt" 
outputPrefix12 <- "2_12_removedSnpHweAuto" 
if (groupLabel == "control" | groupLabel == "caseControl"){
    removedSnpHWEauto(groupLabel="control", plink, 
                      inputPrefix=outputPrefix11, 
                      pval=0.000001, outputPvalFile, 
                      outputSNPfile, outputPrefix=outputPrefix12)
}

Remove chrX SNPs with HWE p < 10-6 in female controls. (Optional, if no chrX data)
Primary output files: 2_13_removedSnpHweFemaleXct.#
Output file with HWE p-value for the female chrX SNPs before removing, sorted ascending by the p-values:2_13_snpHwePvalfemaleXct.txt
Output file with the removed SNP names:2_13_snpRemovedHweFemaleXct.txt

outputPrefix <- "2_13_removedSnpHweFemaleX"  
outputPvalFile <- "2_13_snpHwePvalfemaleXct.txt" 
outputSNPfile <- "2_13_snpRemovedHweFemaleXct.txt"  
removedSnpFemaleChrXhweControl(plink, inputPrefix=outputPrefix12, 
                               pval=0.000001, outputPvalFile,
                               outputSNPfile, outputPrefix)

Calculate the PCA (with plots) for 2_13_removedSnpHweFemaleXct.# and remove the outliers from that dataset.
Primary output files: 2_14_removedOutliers.#
Output file with the IDs of the instances and the values of their eigenvectors from the PCA after removal of instances: 2_14_eigenvalAfterQC.txt
Output file with the plot of the first two PCs from the PCA before removal of instances: 2_14_eigenvalAfterQC.png
Output file with the plot of the first two PCs from the PCA after removal of instances: 2_14_removedOutliers.png
Output file with the IDs of the removed instances and the values of their eigenvectors, sorted ascending by the PC values: 2_14_eigenval4outliers.txt

inputPrefix <- "2_13_removedSnpHweFemaleX" 
outputPC4subjFile <- "2_14_eigenvalAfterQC.txt"
outputPCplotFile <- "2_14_eigenvalAfterQC.png" 
plotPCA4plink(gcta, inputPrefix, nThread=20, outputPC4subjFile, outputPCplotFile)  

cutoff <-  c(-0.4, 0.2)
cutoffSign <- "greater" ## not used if cutoff == NULL, and with two values 
inputPC4subjFile <- "2_14_eigenvalAfterQC.txt"
outputPC4outlierFile <- "2_14_eigenval4outliers.txt"
outputPCplotFile <- "2_14_removedOutliers.png"
outputPrefix <- "2_14_removedOutliers" 
removeOutlierByPCs(plink, gcta, inputPrefix, nThread=20, 
                   cutoff, cutoffSign, inputPC4subjFile, 
                   outputPC4outlierFile, outputPCplotFile, outputPrefix)

Alignment with the reference

All files named in this subsection are in the directory 3-checkAlign/.

The function checkAlign2ref() performs the alignment against a given reference
panel by considering the following parameters: variant name, genomic position and
the allele presentation. Output files are generated in order.

Input files

PLINK files after quality control and population outlier detection: 2_14_removedOutliers.#
Imputation reference files.

Detailed steps

If the genome build of already QC-ed is different from the genome build of the imputation reference files, then lift the data from the first genome build to the second. If the genome build is the same, then just copy the input files.
Output files: 3_1_QCdata.#
Remove SNPs for which the name has a different position (i.e. combination of base pair position and chromosome) in the imputation reference files.
Primary output files: 3_2_removedSnpDiffNamePos.#
Output file with the SNPs names for which a different position is contained in the imputation reference files: 3_2_snpDiffNamePos.txt
Remove SNPs for which the combination of base pair position and chromosome is not contained in the imputation reference files (ignoring the SNP name).
Primary output files: 3_3_removedSnpMissPos.#
Output file with the SNPs names for which the combination of base pair position and chromosome is not contained in the imputation reference files: 3_3_snpMissPos.txt
Remove SNPs from which the combination of base pair position and chromosome (ignoring the SNP name) has an allele which is not in the imputation reference files for that combination of base pair position and chromosome.
Primary output files: 3_4_removedSnpDiffAlleles.#
Output file with the removed SNP names: 3_4_snpDiffAlleles.txt
Output file with the retained SNPs: 3_4_snpImpRefAlleles.txt

Processing example

renamePlinkBFile(inputPrefix="2_14_removedOutliers", 
                 outputPrefix="3_1_QCdata", action="move")
inputFile <- paste0(impRefDIR,"*.legend.gz")  
## To recode the intermediate disk usage 
## keep in the same directory.
bimReferenceFile <- "bimImputeRef.txt"
.prepareLegend2bim(inputFile, referencePanel, 
                   outputFile=bimReferenceFile, ncore=nCores) 

inputPrefix <- "3_1_QCdata" 
out2.snp <- "3_2_snpSameNameDiffPos"
out2 <- "3_2_removedSnpSameNameDiffPos"
out3 <- "3_3_removedSnpMissPos"
out3.snp <- "3_3_snpMissPos"
out4 <- "3_4_removedSnpDiffAlleles"
out4.snp <- "3_4_snpDiffAlleles"
out4.snpRetained <- "3_4_snpImpRefAlleles"
nCores <- detectCores() 
checkAlign2ref(plink, inputPrefix, referencePanel, bimReferenceFile, 
               out2, out2.snp, out3, out3.snp, 
               out4, out4.snp, out4.snpRetained, nCore=nCores)

Imputation

All files named in this subsection are in the directory 4-imputation/.

In this subsection, three major functions removedMonoSnp(), phaseImpute() and postImpQC() are used for the purpose of excluding monomorphic SNPs, phasing, imputation and post imputation QC.

Input files

PLINK genotyping data after QC and the alignment with the imputation reference files: ../3-checkAlign/3_4_removedSnpDiffAlleles.#
Imputation reference files.

Detailed steps and processing examples

Remove monomorphic SNPs from the well aligned genotyping data.
Primary output files: 4_1_removedMonoSnp.#
Output file with the removed monomorphic SNPs: 4_1_snpMonoRemoved.txt

outputPrefix <- "4_1_removedMonoSnp"
outputMonoSNPfile <- "4_1_snpMonoRemoved.txt"  
removedMonoSnp(plink, inputPrefix="3_4_removedSnpDiffAlleles", 
                outputPrefix, outputSNPfile=outputMonoSNPfile)

Use the imputation reference files to generate pre-phase haplotypes by SHAPEIT and then do the imputation by using IMPUTE2. GTOOL is used for format conversion. In the case of using IMPUTE4, QCTOOL is used instead of GTOOL.
Output files: 4_2_imputedDataset.#
Output file with the info scores of all SNPs, which consists of two columns, separated by whitespaces, the first the SNPs and the second the info scores: 4_2_snpImputedInfoScore.txt. This is also the output from last step: impute2infoUpdated.txt.

inputPrefix <- "4_1_removedMonoSnp"  
outputPrefix <- "4_2_imputedDataset"   
outputInfoFile <- "4_2_snpImputedInfoScore.txt"
tmpImputeDir <- paste0(imputeTool, "_", referencePanel)
nCores <- detectCores()
phaseImpute(inputPrefix, outputPrefix, autosome=TRUE,
            plink, shapeit, imputeTool, impute, qctool, gtool, 
            windowSize=3000000, effectiveSize=20000, 
            nCore=nCores, threshold=0.9, outputInfoFile, SNP=TRUE, 
            referencePanel, impRefDIR, tmpImputeDir, keepTmpDir=TRUE)

Remove imputed SNPs with (info < 0.6) where the value info comes from the files *.impute2_info from the imputation.
Output files: 4_3_wellImputeData.#
Remove all SNPs (if any) which have the same position as a SNP in 4_1_snpMonoRemoved.txt has in aligned instance data.
Output files: 4_4_removedMonoSnpAfter.#
Add previous identified monomorphic SNPs (if any) in 4_1_snpMonoRemoved.txt with their values from the lifted instance data.
Output files: 4_5_addedMonoSnpAfter.#
Remove SNPs which have a non missing value for less then 20 instances.
Primary output files: 4_6_removedSnpMissPostImp.#
Output file with the removed SNP names: 4_6_snpRemovedMissPostImp.txt
Output file with the remaining SNP names: 4_6_snpRetainMissPostImp.txt

inputPrefix <- "4_2_imputedDataset" 
out1 <- "4_3_wellImputeData"
out2 <- "4_4_removedMonoSnpAfter"
out3 <- "4_5_addedMonoSnpAfter"
out4 <- "4_6_removedSnpMissPostImp" 
outputInfoFile <- "4_2_snpImputedInfoScore.txt"
outputMonoSNPfile <- "4_1_snpMonoRemoved.txt" 
prefixAlign2ref <- "3_4_removedSnpDiffAlleles" 
outRemovedSNPfile <- "4_6_snpRemovedMissPostImp.txt"
outRetainSNPfile <- "4_6_snpRetainMissPostImp.txt"

postImpQC(plink, inputPrefix, out1, out2, out3, out4,
          outputInfoFile, infoScore=0.6, outputMonoSNPfile, 
          prefixAlign2ref, missCutoff=20, outRemovedSNPfile, 
          outRetainSNPfile, referencePanel)

Reduction and expansion

All files named in this subsection are in the directory 5-reductAndExpand/.

reductExpand() function extracts the genotypes from well imputed data so that PLINK data with only SNPs before imputation are remained. On the basis of extracted genotypes, a subset of non-imputed genotypes that are different from the imputation reference panel are then added, including SNPs with different alleles, missing positions, and different positions.

Input files

The imputed dataset:../4-imputation/4_6_removedSnpMissPostImp.#
The SNPs before imputation:../3-checkAlign/3_4_snpImpRefAlleles.txt
The file with the SNPs with different alleles:../3-checkAlign/3_4_snpDiffAlleles.txt
The SNPs with missing positions: ../3-checkAlign/3_3_snpMissPos.txt
The SNPs with different positions: ../3-checkAlign/3_2_snpSameNameDiffPos.txt
The dataset before removing SNPs for the alignment with the imputation reference: ../3-checkAlign/3_1_QCdata.#

Detailed steps

Firt of all, copy all the input files into the directory ./5-reductAndExpand/. Then do the reduction of the imputed dataset to the SNPs before imputation.
Output files: 5_1_reducedToSpecific.#
Add the SNPs (if any) with different alleles from the dataset before removing SNPs.
Output files: 5_2_specificDiffAllele.#
Add the SNPs (if any) with missing positions from the dataset before removing SNPs.
Output files: 5_3_specificMissPos.#
Add the SNPs (if any) with different positions from the dataset before removing SNPs.
Output files: 5_4_specificDiffPos.#

Processing example

inputPrefix <- "4_6_removedSnpMissPostImp"
inputQCprefix <- "3_1_QCdata"
snpRefAlleleFile <- "3_4_snpImpRefAlleles.txt"
snpDiffAlleleFile <- "3_4_snpDiffAlleles.txt"
snpMissPosFile <- "3_3_snpMissPos.txt"
snpSameNameDifPosFile <- "3_2_snpSameNameDiffPos.txt" 

dir5 <- "./5-reductAndExpand/"
## imputed dataset
system(paste0("scp ./4-imputation/", inputPrefix, ".* ", dir5)) 
system(paste0("scp ./3-checkAlign/", inputQCprefix, ".* ", dir5))
system(paste0("scp ./3-checkAlign/", snpRefAlleleFile, " ", dir5))
system(paste0("scp ./3-checkAlign/", snpDiffAlleleFile, " ", dir5))
system(paste0("scp ./3-checkAlign/", snpMissPosFile, " ", dir5))
system(paste0("scp ./3-checkAlign/", snpSameNameDifPosFile, " ", dir5))

setwd(dir5)
reducedToSpecificfn <- "5_1_reducedToSpecific"
specificDiffAllelefn <- "5_2_specificDiffAllele"
specificMissPosfn <- "5_3_specificMissPos"
specificDiffPosfn <- "5_4_specificDiffPos" 

reductExpand(plink, referencePanel, inputPrefix, inputQCprefix, 
             snpRefAlleleFile, snpDiffAlleleFile, 
             snpMissPosFile, snpSameNameDifPosFile, 
             reducedToSpecificfn, specificDiffAllelefn, 
             specificMissPosfn, specificDiffPosfn)
setwd("..")

Final results

Copy and rename the results of interest to the directory ./6-finalResults.

The metadata information file, the final well imputed and a subset of imputed dataset (dataset-specific) are essential for the downstream analysis.

dir6 <- "./6-finalResults/"
system(paste0("scp ./1-genoUpdate/1_01_metaData.txt ", dir6))
system(paste0("scp ./4-imputation/4_6_removedSnpMissPostImp.* ", dir6))  
system(paste0("scp ./5-reductAndExpand/5_4_specificDiffPos.* ", dir6))
setwd(dir6)
renamePlinkBFile(inputPrefix="4_6_removedSnpMissPostImp", 
                 outputPrefix="imputedSnpsDataset", action="move")
renamePlinkBFile(inputPrefix="5_4_specificDiffPos", 
                 outputPrefix="specificSnpsDataset", action="move")

Extending pipelines

Gimpute's modular structure allows the integration of other existing imputation workflows, allowing users to choose their preferred imputation strategy. For demonstration, we have embedded Genipe as an external imputation tool [8].

Integrate with Genipe

Installation

To set up Genipe, please refer to Genipe's installation page for more details.

Additional configuration files

Imputation reference panels

The imputation reference can be downloaded directly from IMPUTE2's reference page: 1000 Genome phase 3 or Genipe's page.

The human reference files

This reference genome is used for the strand check, you can download from Genipe's page.

Imputation with Genipe

Prepare additional configuration files.

## example
impRefDIR4genipe <- "../1000GP_Phase3/"
fastaFile <- "../hg19/hg19.fasta"

Imputing genotypes using Genipe.

chrs =22
inputPrefix <- "3_4_removedSnpDiffAlleles"
thread4impute2 <- 20 ## tune by yourself
thread4shapeit <- 30
segmentSize <- 3000000
imputeTool <- "impute2" ## only impute2 
imputedByGenipe(chrs, impRefDIR4genipe, inputPrefix, shapeit, impute, plink, fastaFile, segmentSize, thread4impute2, thread4shapeit)

Merge chunk-specific region using Genipe.

chr <- 2 
inputImpute2 <- 'chr2.33000001_36000000.impute2'
probability <- 0.9
completionRate <- 0.98
# info <- 0.6
outputPrefix <- paste0('imputedChr', chr)
mergeByGenipe(inputImpute2, chr, probability, completionRate, info, outputPrefix)

Extract imputed results chromosome-wise by Genipe.

## extract imputed markers using Genipe 
chr <- 3
inputImpute2 <- paste0('chr', chr,'.imputed.impute2')
inputMAP <- paste0('chr', chr,'.imputed.map')
format <- 'bed'
prob <- 0.9
outputPrefix <- paste0('imputedChr', chr)  
extractByGenipe(inputImpute2, inputMAP, outputPrefix, format, prob)

## sessionInfo()

# R version 3.5.0 (2018-04-23)
# Platform: x86_64-redhat-linux-gnu (64-bit)
# Running under: CentOS Linux 7 (Core)

# Matrix products: default
# BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

# locale:
#  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
# [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

# attached base packages:
# [1] parallel  stats     graphics  grDevices utils     datasets  methods  
# [8] base     

# other attached packages:
# [1] doParallel_1.0.11 iterators_1.0.10  foreach_1.4.4     lattice_0.20-35  
# [5] Gimpute_0.99.9   

# loaded via a namespace (and not attached):
# [1] compiler_3.5.0   codetools_0.2-15 grid_3.5.0

References

[1] Marchini, J., et al. (2010). Genotype imputation for genome-wide association studies. Nat Rev Genet 11(7): 499-511.

[2] Marchini, J., et al. (2007). A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet 39(7): 906-913.

[3] Schizophrenia Working Group of the Psychiatric Genomics, C. (2014). Biological insights from 108 schizophrenia-associated genetic loci. Nature 511(7510): 421-427.

[4] Purcell, Shaun, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics 81.3 (2007): 559-575.

[5] Howie, B., et al. (2012). Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet 44(8): 955-959.

[6] Howie, B. N., et al. (2009). A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet 5(6): e1000529.

[7] Bycroft, Clare, Colin Freeman, Desislava Petkova, Gavin Band, Lloyd T. Elliott, Kevin Sharp, Allan Motyer et al. Genome-wide genetic data on~ 500,000 UK Biobank participants. BioRxiv (2017): 166298.

[8] Lemieux Perreault, L. P., et al. (2016). genipe: an automated genome-wide imputation pipeline with automatic reporting and statistical tools. Bioinformatics, 32(23), 3661-3663.

Trouble-shooting

If you have any trouble or comment ? --> Contact: junfang.chen@zi-mannheim.de

transbioZI/Gimpute documentation built on April 10, 2022, 4:20 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

transbioZI/Gimpute Gimpute: An efficient genetic data processing and imputation pipeline

In transbioZI/Gimpute: Gimpute: An efficient genetic data processing and imputation pipeline

About Gimpute

Background

Scope of the pipeline

Getting started

Prerequisites

Installation

Experimental data

Get it Setup

File naming conventions

Directory naming conventions

Required packages and tools

Configuration

Gimpute flowchart

Running pipeline

Genotype information update

Input files

Processing example

Individual steps

Quality control

Input files

Processing example

Individual steps

Alignment with the reference

Input files

Detailed steps

Processing example

Imputation

Input files

Detailed steps and processing examples

Reduction and expansion

Input files

Detailed steps

Processing example

Final results

Extending pipelines

Integrate with Genipe

Installation

Additional configuration files

Imputation with Genipe

References

Trouble-shooting

R Package Documentation

Browse R Packages

We want your feedback!

transbioZI/Gimpute
Gimpute: An efficient genetic data processing and imputation pipeline