mig: Memory-efficient Implementation of Gabriel et al. 2002...

Description Usage Arguments Haplotype Blocks Modeling the D' Distribution Search Space Pruning Output File Author(s) References Examples

Description

Function for the efficient whole-genome haplotype block partitioning. Haplotype blocks are defined based on D' coefficient of linkage disequilibrium (Gabriel et al., 2002).

Usage

1
2
3
4
	mig(phase_file, output_file, phase_file_format = "VCF", 
	map_file = NULL, region = NULL, maf = 0.0, ci_method = "WP",
	l_density = 100, ld_ci = c(0.7, 0.98), ehr_ci = 0.9, 
	ld_fraction = 0.95, pruning_method = "MIG++", window = NULL)

Arguments

phase_file

Name of the input file with phased genotypes in VCF, HAPMAP2 or IMPUTE2 format.

output_file

Name of the output file where to store the haplotype blocks.

phase_file_format

Format of the phase_file: VCF (default), HAPMAP2 or IMPUTE2. If VCF, then only SNPs with "PASS" or "." in the FILTER field are considered.

map_file

Name of the map file with base-pair positions of each SNP. Mandatory when file_format = HAPMAP2.

region

Numeric vector with start and end positions (in base-pairs) of the chromosomal region to be partitioned. If NULL (default), then the whole chromosome is processed.

maf

Minor Allele Frequency (MAF) threshold: SNPs with MAF <= maf will not be considered. The threshold may vary from 0 (default) to 0.5.

ci_method

Confidence interval (CI) estimation method. Supported methods are WP (default) = Wall and Pritchard (2003) method; AV = approximate variance estimator by Zapata et al. (1997).

l_density

Number of points at which to evaluate the likelihood (applies only to the WP method). Default is 100. The higher the number the longer the runtime. The lower the number the lower the precision.

ld_ci

Numeric vector with 2 values: thresholds for the lower bound (CL) and upper bound (CU) of the 90% CI of D'. Following Gabriel et al. (2002), default is c(0.7, 0.98).

ehr_ci

Threshold value for the evidence of historical recombination. Following Gabriel et al. (2002), default is 0.9.

ld_fraction

Fraction of strong LD SNP pairs over all informative pairs that is needed to classify a sequence of SNP as a haplotype block. Following Gabriel et al. (2002), default is 0.95.

pruning_method

Name of a search space pruning method. Supported methods are MIG, MIG+ and MIG++ (default).

window

Number of SNPs within the window in MIG++ search space pruning method. If NULL (default), it is calculated on the fly based on the region length and ld_fraction.

Haplotype Blocks

The haplotype blocks are defined based on D' coefficient of linkage disequilibrium (LD) between a pair of SNPs (Gabriel et al., 2002).

SNP Pairs Classification

According to the lower (CL) and upper (CU) bounds of the 90% confidence interval (CI) of the |D'| statistic, LD between any SNP pair is classified as follows:
(1) strong LD if CL >= 0.7 and CU >= 0.98;
(2) strong evidence of historical recombination (strong EHR) if CU < 0.9.

All SNP pairs satisfying conditions (1) and (2) are called informative.

Haplotype Block Definition

A chromosomal region is a haplotype block if:
(a) the two outer-most SNPs are in strong LD;
(b) at least 95% of informative SNP pairs are in strong LD.

Modeling the D' Distribution

Wall and Pritchard (WP) Method

This method was described in Wall and Pritchard (2003) and is implemented in the Haploview (Barrett et al., 2005). The true allele frequencies of each SNP are assumed to be equal to the observed allele frequencies. The likelihood of the data in the four-fold table obtained by crossing any SNP pair, conditional to the |D'| value, can be expressed as l = P(data | |D'|). It is evaluated at each value of |D'| = 0.01 * p, with p = 0, 1, ..., 100. The CL is defined as the largest value of |D'| such that ∑_{i=0}^{p-1} l(i) / ∑_{i=0}^{100} l(i) ≤ α. Similarly, the CU is defined as the smallest value of |D'| such that ∑_{i=p+1}^{1000} l(i) / ∑_{i=0}^{100} l(i) ≤ α.

Approximate Variance (AV) Method

The CI of D' is computed based on the approximate variance of D' (Zapata et al., 1997). When D' is equal to 1 or -1, then the approximate variance V(\hat{D'}) = 0. Otherwise, the 1 - α CI of D' is equal to \hat{D'} \pm Z_{α / 2} √{V(\hat{D'})}, where Z_{α / 2} is the 1 - α / 2 percentile of the standard normal distribution.

Search Space Pruning

MIG

The MIG is a memory efficient implementation of Gabriel et al. 2002 haplotype block definition. It has extremely low memory requirements (linear memory complexity) and is able to represent LD structure of the entire chromosome (millions of SNPs) in the main memory. The MIG doesn't prune the search space i.e. it computes LD for all possible SNP pairs. Therefore, its runtime is equivalent to the runtime of Haploview (Barrett et al., 2005) algorithm.

MIG+

The MIG+ identifies and omits computations in regions that can never become haplotype blocks. The chromosome is processed from left to right. Assuming all unprocessed SNP pairs to the right being in strong LD, an early termination point is determined using already known LD structure to the left. It is theoretically guaranteed, that haplotype blocks produced by MIG and MIG+ are identical. The MIG+ has low memory requirements and improved runtime compared to MIG.

MIG++

The MIG++ is the improved version of MIG+ search space pruning method. The chromosome is processed in several iterations. At every iteration it is scanned from the very start to the very end and the computations are restricted by a user-defined window. The window is expressed as a maximal number of SNPs that will be considered when computing LD with the currently considered SNP. In every new iteration, the partially calculated LD structure across the whole chromosome allows to predict termination points better. It is theoretically guaranteed, that haplotype blocks produced by MIG, MIG+ and MIG++ are identical. The MIG++ has low memory requirements and significantly better runtime compared to MIG and MIG+.

Output File

The output file consists of the following columns:

BLOCK_NAME Generated unique block name
FIRST_SNP Name of the first SNP in block
LAST_SNP Name of the last SNP in block
FIRST_SNP_ID Index of the first SNP in block with respect to the filtered SNPs
LAST_SNP_ID Index of the last SNP in block with respect to the filtered SNPs
START_BP The base-pair position of the first SNP in block
END_BP The base-pair position of the last SNP in block
N_SNPS Number of SNPs in block
N_HAPS Number of haplotypes in block
N_UNIQUE_HAPS Number of unique haplotypes in block
N_COMMON_HAPS Number of common (which appear more than once) haplotypes in block
N_HAPS_DIVERSITY The haplotype diversity in block (Patil et al., 2001). 1 - low diversity, 0 - high diversity.

Author(s)

Daniel Taliun, Johann Gamper, Cristian Pattaro

References

Zapata, C., Alvarez, G., Carollo, C. (1997) Approximate variance of the standardized measure of gametic disequilibrium D'. American Journal of Human Genetics, 61(3), 771–774.

Patil, N. et al. (2001) Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21. Science, 294(5547), 1719–1723.

Gabriel, S. B. et al. (2002) The Structure of Haplotype Blocks in the Human Genome. Science, 296(5576), 2225–2229.

Wall, J. D. and Pritchard, J. K. (2003) Assessing the performance of the haplotype block model of linkage disequilibrium. American Journal of Human Genetics, 73(3), 502–515.

Barrett, J. C. et al. (2005) Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics, 21(2), 263–265.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
	
    # load LDExplorer library
    library(LDExplorer)
	
    # run mig() function on HapMap phase II CEU data with default arguments.
    mig(
     phase_file = "HapMapII_r22_CEU_chr20_31767872_33700401.phase.gz", 
     output_file = "HapMapII_r22_CEU_chr20_31767872_33700401.blocks.txt",
     phase_file_format = "HAPMAP2",
     map_file = "HapMapII_r22_CEU_chr20_31767872_33700401.legend.txt.gz"
    )
	
    # show contents of the output file
    file.show(
     "HapMapII_r22_CEU_chr20_31767872_33700401.blocks.txt",
     title="HapMapII_r22_CEU_chr20_31767872_33700401.blocks.txt"
    )
	
    # run mig() function on 1000 Genomes Project CEU data with default arguments.
    mig(
     phase_file = "1000G_phase1_v3_20101123_CEU_chr2_89153688_89307566.vcf.gz", 
     output_file = "1000G_phase1_v3_20101123_CEU_chr2_89153688_89307566.blocks.txt"
    )
	
    # show contents of the output file
    file.show(
     "1000G_phase1_v3_20101123_CEU_chr2_89153688_89307566.blocks.txt",
     title="1000G_phase1_v3_20101123_CEU_chr2_89153688_89307566.blocks.txt"
    )
	

LDExplorer documentation built on May 2, 2019, 5:54 p.m.

Related to mig in LDExplorer...