Home

/

R-Forge

/

LDExplorer

/

mig: Memory-efficient Implementation of Gabriel et al. 2002...

mig: Memory-efficient Implementation of Gabriel et al. 2002...
In LDExplorer: Efficient Whole-Genome LD Based Haplotpe Block Recognition

Description Usage Arguments Haplotype Blocks Modeling the D' Distribution Search Space Pruning Output File Author(s) References Examples

Function for the efficient whole-genome haplotype block partitioning. Haplotype blocks are defined based on D' coefficient of linkage disequilibrium (Gabriel et al., 2002).

	mig(phase_file, output_file, phase_file_format = "VCF", 
	map_file = NULL, region = NULL, maf = 0.0, ci_method = "WP",
	l_density = 100, ld_ci = c(0.7, 0.98), ehr_ci = 0.9, 
	ld_fraction = 0.95, pruning_method = "MIG++", window = NULL)

`phase_file`	Name of the input file with phased genotypes in VCF, HAPMAP2 or IMPUTE2 format.
`output_file`	Name of the output file where to store the haplotype blocks.
`phase_file_format`	Format of the phase_file: VCF (default), HAPMAP2 or IMPUTE2. If VCF, then only SNPs with "PASS" or "." in the FILTER field are considered.
`map_file`	Name of the map file with base-pair positions of each SNP. Mandatory when file_format = HAPMAP2.
`region`	Numeric vector with start and end positions (in base-pairs) of the chromosomal region to be partitioned. If NULL (default), then the whole chromosome is processed.
`maf`	Minor Allele Frequency (MAF) threshold: SNPs with MAF <= maf will not be considered. The threshold may vary from 0 (default) to 0.5.
`ci_method`	Confidence interval (CI) estimation method. Supported methods are WP (default) = Wall and Pritchard (2003) method; AV = approximate variance estimator by Zapata et al. (1997).
`l_density`	Number of points at which to evaluate the likelihood (applies only to the WP method). Default is 100. The higher the number the longer the runtime. The lower the number the lower the precision.
`ld_ci`	Numeric vector with 2 values: thresholds for the lower bound (CL) and upper bound (CU) of the 90% CI of D'. Following Gabriel et al. (2002), default is c(0.7, 0.98).
`ehr_ci`	Threshold value for the evidence of historical recombination. Following Gabriel et al. (2002), default is 0.9.
`ld_fraction`	Fraction of strong LD SNP pairs over all informative pairs that is needed to classify a sequence of SNP as a haplotype block. Following Gabriel et al. (2002), default is 0.95.
`pruning_method`	Name of a search space pruning method. Supported methods are MIG, MIG+ and MIG++ (default).
`window`	Number of SNPs within the window in MIG++ search space pruning method. If NULL (default), it is calculated on the fly based on the region length and ld_fraction.

The haplotype blocks are defined based on D' coefficient of linkage disequilibrium (LD) between a pair of SNPs (Gabriel et al., 2002).

SNP Pairs Classification

According to the lower (CL) and upper (CU) bounds of the 90% confidence interval (CI) of the |D'| statistic, LD between any SNP pair is classified as follows:
(1) strong LD if CL >= 0.7 and CU >= 0.98;
(2) strong evidence of historical recombination (strong EHR) if CU < 0.9.

All SNP pairs satisfying conditions (1) and (2) are called informative.

Haplotype Block Definition

A chromosomal region is a haplotype block if:
(a) the two outer-most SNPs are in strong LD;
(b) at least 95% of informative SNP pairs are in strong LD.

Wall and Pritchard (WP) Method

This method was described in Wall and Pritchard (2003) and is implemented in the Haploview (Barrett et al., 2005). The true allele frequencies of each SNP are assumed to be equal to the observed allele frequencies. The likelihood of the data in the four-fold table obtained by crossing any SNP pair, conditional to the |D'| value, can be expressed as l = P(data | |D'|). It is evaluated at each value of |D'| = 0.01 * p, with p = 0, 1, ..., 100. The CL is defined as the largest value of |D'| such that ∑_{i=0}^{p-1} l(i) / ∑_{i=0}^{100} l(i) ≤ α. Similarly, the CU is defined as the smallest value of |D'| such that ∑_{i=p+1}^{1000} l(i) / ∑_{i=0}^{100} l(i) ≤ α.

Approximate Variance (AV) Method

The CI of D' is computed based on the approximate variance of D' (Zapata et al., 1997). When D' is equal to 1 or -1, then the approximate variance V(\hat{D'}) = 0. Otherwise, the 1 - α CI of D' is equal to \hat{D'} \pm Z_{α / 2} √{V(\hat{D'})}, where Z_{α / 2} is the 1 - α / 2 percentile of the standard normal distribution.

MIG

The MIG is a memory efficient implementation of Gabriel et al. 2002 haplotype block definition. It has extremely low memory requirements (linear memory complexity) and is able to represent LD structure of the entire chromosome (millions of SNPs) in the main memory. The MIG doesn't prune the search space i.e. it computes LD for all possible SNP pairs. Therefore, its runtime is equivalent to the runtime of Haploview (Barrett et al., 2005) algorithm.

MIG+

The MIG+ identifies and omits computations in regions that can never become haplotype blocks. The chromosome is processed from left to right. Assuming all unprocessed SNP pairs to the right being in strong LD, an early termination point is determined using already known LD structure to the left. It is theoretically guaranteed, that haplotype blocks produced by MIG and MIG+ are identical. The MIG+ has low memory requirements and improved runtime compared to MIG.

MIG++

The MIG++ is the improved version of MIG+ search space pruning method. The chromosome is processed in several iterations. At every iteration it is scanned from the very start to the very end and the computations are restricted by a user-defined window. The window is expressed as a maximal number of SNPs that will be considered when computing LD with the currently considered SNP. In every new iteration, the partially calculated LD structure across the whole chromosome allows to predict termination points better. It is theoretically guaranteed, that haplotype blocks produced by MIG, MIG+ and MIG++ are identical. The MIG++ has low memory requirements and significantly better runtime compared to MIG and MIG+.

The output file consists of the following columns:

BLOCK_NAME	Generated unique block name
FIRST_SNP	Name of the first SNP in block
LAST_SNP	Name of the last SNP in block
FIRST_SNP_ID	Index of the first SNP in block with respect to the filtered SNPs
LAST_SNP_ID	Index of the last SNP in block with respect to the filtered SNPs
START_BP	The base-pair position of the first SNP in block
END_BP	The base-pair position of the last SNP in block
N_SNPS	Number of SNPs in block
N_HAPS	Number of haplotypes in block
N_UNIQUE_HAPS	Number of unique haplotypes in block
N_COMMON_HAPS	Number of common (which appear more than once) haplotypes in block
N_HAPS_DIVERSITY	The haplotype diversity in block (Patil et al., 2001). 1 - low diversity, 0 - high diversity.

Daniel Taliun, Johann Gamper, Cristian Pattaro

Zapata, C., Alvarez, G., Carollo, C. (1997) Approximate variance of the standardized measure of gametic disequilibrium D'. American Journal of Human Genetics, 61(3), 771–774.

Patil, N. et al. (2001) Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21. Science, 294(5547), 1719–1723.

Gabriel, S. B. et al. (2002) The Structure of Haplotype Blocks in the Human Genome. Science, 296(5576), 2225–2229.

Wall, J. D. and Pritchard, J. K. (2003) Assessing the performance of the haplotype block model of linkage disequilibrium. American Journal of Human Genetics, 73(3), 502–515.

Barrett, J. C. et al. (2005) Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics, 21(2), 263–265.

	
    # load LDExplorer library
    library(LDExplorer)
	
    # run mig() function on HapMap phase II CEU data with default arguments.
    mig(
     phase_file = "HapMapII_r22_CEU_chr20_31767872_33700401.phase.gz", 
     output_file = "HapMapII_r22_CEU_chr20_31767872_33700401.blocks.txt",
     phase_file_format = "HAPMAP2",
     map_file = "HapMapII_r22_CEU_chr20_31767872_33700401.legend.txt.gz"
    )
	
    # show contents of the output file
    file.show(
     "HapMapII_r22_CEU_chr20_31767872_33700401.blocks.txt",
     title="HapMapII_r22_CEU_chr20_31767872_33700401.blocks.txt"
    )
	
    # run mig() function on 1000 Genomes Project CEU data with default arguments.
    mig(
     phase_file = "1000G_phase1_v3_20101123_CEU_chr2_89153688_89307566.vcf.gz", 
     output_file = "1000G_phase1_v3_20101123_CEU_chr2_89153688_89307566.blocks.txt"
    )
	
    # show contents of the output file
    file.show(
     "1000G_phase1_v3_20101123_CEU_chr2_89153688_89307566.blocks.txt",
     title="1000G_phase1_v3_20101123_CEU_chr2_89153688_89307566.blocks.txt"
    )

LDExplorer documentation built on May 2, 2019, 5:54 p.m.

LDExplorer index

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

LDExplorer
Efficient Whole-Genome LD Based Haplotpe Block Recognition

mig: Memory-efficient Implementation of Gabriel et al. 2002...
In LDExplorer: Efficient Whole-Genome LD Based Haplotpe Block Recognition

Description

Usage

Arguments

Haplotype Blocks

SNP Pairs Classification

Haplotype Block Definition

Modeling the D' Distribution

Wall and Pritchard (WP) Method

Approximate Variance (AV) Method

Search Space Pruning

MIG

MIG+

MIG++

Output File

Author(s)

References

Examples

Related to mig in LDExplorer...

R Package Documentation

Browse R Packages

We want your feedback!

LDExplorer Efficient Whole-Genome LD Based Haplotpe Block Recognition

mig: Memory-efficient Implementation of Gabriel et al. 2002... In LDExplorer: Efficient Whole-Genome LD Based Haplotpe Block Recognition

Description

Usage

Arguments

Haplotype Blocks

SNP Pairs Classification

Haplotype Block Definition

Modeling the D' Distribution

Wall and Pritchard (WP) Method

Approximate Variance (AV) Method

Search Space Pruning

MIG

MIG+

MIG++

Output File

Author(s)

References

Examples

Related to mig in LDExplorer...

R Package Documentation

Browse R Packages

We want your feedback!

LDExplorer
Efficient Whole-Genome LD Based Haplotpe Block Recognition

mig: Memory-efficient Implementation of Gabriel et al. 2002...
In LDExplorer: Efficient Whole-Genome LD Based Haplotpe Block Recognition