Description Usage Arguments Haplotype Blocks Search Space Pruning Output File Author(s) References Examples
Function for the efficient whole-genome haplotype block partitioning. Haplotype blocks are defined based on r^2 coefficient of linkage disequilibrium.
1 2 3 |
phase_file |
Name of the input file with phased genotypes in VCF, HAPMAP2 or IMPUTE2 format. |
output_file |
Name of the output file where to store the haplotype blocks. |
phase_file_format |
Format of the phase_file: VCF (default), HAPMAP2 or IMPUTE2. If VCF, then only SNPs with "PASS" or "." in the FILTER field are considered. |
map_file |
Name of the map file with base-pair positions of each SNP. Mandatory when file_format = HAPMAP2. |
region |
Numeric vector with start and end positions (in base-pairs) of the chromosomal region to be partitioned. If NULL (default), then the whole chromosome is processed. |
maf |
Minor Allele Frequency (MAF) threshold: SNPs with MAF <= maf will not be considered. The threshold may vary from 0 (default) to 0.5. |
weak_rsq |
r^2 threshold value for the SNP pairs in weak LD. A SNP pair is in weak LD if r^2 < weak_rsq. By default, weak_rsq = 0.5. |
strong_rsq |
r^2 threshold value for the SNP pairs in strong LD. A SNP pair is in strong LD if r^2 >= strong_rsq. By default, strong_rsq = 0.8. |
fraction |
Fraction of strong LD SNP pairs over all informative SNP pairs that is needed to classify a sequence of SNP as a haplotype block. Informative SNP pairs are those that are in strong LD or in weak LD. Default is 0.95. |
pruning_method |
Name of a search space pruning method. Supported methods are MIG, MIG+ and MIG++ (default). |
window |
Number of SNPs within the window in MIG++ search space pruning method. If NULL (default), it is calculated on the fly based on the region length and ld_fraction. |
The haplotype blocks are defined based on r^2 coefficient of linkage disequilibrium (LD) between a pair of SNPs following the logic suggested by Gabriel et al., 2002.
According to the r^2 statistic, LD between any SNP pair is classified as follows:
(1) strong LD if r^2 >= 0.8;
(2) weak LD if r^2 < 0.5;
All SNP pairs satisfying conditions (1) and (2) are called informative.
A chromosomal region is a haplotype block if:
(a) the two outer-most SNPs are in strong LD;
(b) at least 95% of informative SNP pairs within the region are in strong LD.
The MIG is a memory efficient implementation of Gabriel et al. 2002 haplotype block definition using the r^2 coefficient. It has extremely low memory requirements (linear memory complexity) and is able to represent LD structure of the entire chromosome (millions of SNPs) in the main memory. The MIG doesn't prune the search space i.e. it computes LD for all possible SNP pairs.
The MIG+ identifies and omits computations in regions that can never become haplotype blocks. The chromosome is processed from left to right. Assuming all unprocessed SNP pairs to the right being in strong LD, an early termination point is determined using already known LD structure to the left. It is theoretically guaranteed, that haplotype blocks produced by MIG and MIG+ are identical. The MIG+ has low memory requirements and improved runtime compared to MIG.
The MIG++ is the improved version of MIG+ search space pruning method. The chromosome is processed in several iterations. At every iteration it is scanned from the very start to the very end and the computations are restricted by a user-defined window. The window is expressed as a maximal number of SNPs that will be considered when computing LD with the currently considered SNP. In every new iteration, the partially calculated LD structure across the whole chromosome allows to predict termination points better. It is theoretically guaranteed, that haplotype blocks produced by MIG, MIG+ and MIG++ are identical. The MIG++ has low memory requirements and significantly better runtime compared to MIG and MIG+.
The output file consists of the following columns:
BLOCK_NAME | Generated unique block name |
FIRST_SNP | Name of the first SNP in block |
LAST_SNP | Name of the last SNP in block |
FIRST_SNP_ID | Index of the first SNP in block with respect to the filtered SNPs |
LAST_SNP_ID | Index of the last SNP in block with respect to the filtered SNPs |
START_BP | The base-pair position of the first SNP in block |
END_BP | The base-pair position of the last SNP in block |
N_SNPS | Number of SNPs in block |
N_HAPS | Number of haplotypes in block |
N_UNIQUE_HAPS | Number of unique haplotypes in block |
N_COMMON_HAPS | Number of common (which appear more than once) haplotypes in block |
N_HAPS_DIVERSITY | The haplotype diversity in block (Patil et al., 2001). 1 - low diversity, 0 - high diversity. |
Daniel Taliun, Johann Gamper, Cristian Pattaro
Patil, N. et al. (2001) Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21. Science, 294(5547), 1719–1723.
Gabriel, S. B. et al. (2002) The Structure of Haplotype Blocks in the Human Genome. Science, 296(5576), 2225–2229.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
# load LDExplorer library
library(LDExplorer)
# run mig_rsq() function on HapMap phase II CEU data with default arguments.
mig_rsq(
phase_file = "HapMapII_r22_CEU_chr20_31767872_33700401.phase.gz",
output_file = "HapMapII_r22_CEU_chr20_31767872_33700401.rsq_blocks.txt",
phase_file_format = "HAPMAP2",
map_file = "HapMapII_r22_CEU_chr20_31767872_33700401.legend.txt.gz"
)
# show contents of the output file
file.show(
"HapMapII_r22_CEU_chr20_31767872_33700401.rsq_blocks.txt",
title="HapMapII_r22_CEU_chr20_31767872_33700401.rsq_blocks.txt"
)
# run mig_rsq() function on 1000 Genomes Project CEU data with default arguments.
mig_rsq(
phase_file = "1000G_phase1_v3_20101123_CEU_chr2_89153688_89307566.vcf.gz",
output_file = "1000G_phase1_v3_20101123_CEU_chr2_89153688_89307566.rsq_blocks.txt"
)
# show contents of the output file
file.show(
"1000G_phase1_v3_20101123_CEU_chr2_89153688_89307566.rsq_blocks.txt",
title="1000G_phase1_v3_20101123_CEU_chr2_89153688_89307566.rsq_blocks.txt"
)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.