snpgdsLDpruning: Linkage Disequilibrium (LD) based SNP pruning

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/LD.R

Description

Recursively removes SNPs within a sliding window

Usage

1
2
3
4
5
snpgdsLDpruning(gdsobj, sample.id=NULL, snp.id=NULL, autosome.only=TRUE,
    remove.monosnp=TRUE, maf=NaN, missing.rate=NaN,
    method=c("composite", "r", "dprime", "corr"), slide.max.bp=500000L,
    slide.max.n=NA, ld.threshold=0.2, start.pos=c("random", "first", "last"),
    num.thread=1L, verbose=TRUE)

Arguments

gdsobj

an object of class SNPGDSFileClass, a SNP GDS file

sample.id

a vector of sample id specifying selected samples; if NULL, all samples are used

snp.id

a vector of snp id specifying selected SNPs; if NULL, all SNPs are used

autosome.only

if TRUE, use autosomal SNPs only; if it is a numeric or character value, keep SNPs according to the specified chromosome

remove.monosnp

if TRUE, remove monomorphic SNPs

maf

to use the SNPs with ">= maf" only; if NaN, no MAF threshold

missing.rate

to use the SNPs with "<= missing.rate" only; if NaN, no missing threshold

method

"composite", "r", "dprime", "corr", see details

slide.max.bp

the maximum basepairs in the sliding window

slide.max.n

the maximum number of SNPs in the sliding window

ld.threshold

the LD threshold

start.pos

"random": a random starting position; "first": start from the first position; "last": start from the last position

num.thread

the number of (CPU) cores used; if NA, detect the number of cores automatically

verbose

if TRUE, show information

Details

The minor allele frequency and missing rate for each SNP passed in snp.id are calculated over all the samples in sample.id.

Four methods can be used to calculate linkage disequilibrium values: "composite" for LD composite measure, "r" for R coefficient (by EM algorithm assuming HWE, it could be negative), "dprime" for D', and "corr" for correlation coefficient. The method "corr" is equivalent to "composite", when SNP genotypes are coded as: 0 – BB, 1 – AB, 2 – AA. The argument ld.threshold is the absolute value of measurement.

It is useful to generate a pruned subset of SNPs that are in approximate linkage equilibrium with each other. The function snpgdsLDpruning recursively removes SNPs within a sliding window based on the pairwise genotypic correlation. SNP pruning is conducted chromosome by chromosome, since SNPs in a chromosome can be considered to be independent with the other chromosomes.

The pruning algorithm on a chromosome is described as follows (n is the total number of SNPs on that chromosome):

1) Randomly select a starting position i (start.pos="random"), i=1 if start.pos="first", or i=last if start.pos="last"; and let the current SNP set S={ i };

2) For each right position j from i+1 to n: if any LD between j and k is greater than ld.threshold, where k belongs to S, and both of j and k are in the sliding window, then skip j; otherwise, let S be S + { j };

3) For each left position j from i-1 to 1: if any LD between j and k is greater than ld.threshold, where k belongs to S, and both of j and k are in the sliding window, then skip j; otherwise, let S be S + { j };

4) Output S, the final selection of SNPs.

Value

Return a list of SNP IDs stratified by chromosomes.

Author(s)

Xiuwen Zheng

References

Weir B: Inferences about linkage disequilibrium. Biometrics 1979; 35: 235-254.

Weir B: Genetic Data Analysis II. Sunderland, MA: Sinauer Associates, 1996.

Weir BS, Cockerham CC: Complete characterization of disequilibrium at two loci; in Feldman MW (ed): Mathematical Evolutionary Theory. Princeton, NJ: Princeton University Press, 1989.

See Also

snpgdsLDMat, snpgdsLDpair

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# open an example dataset (HapMap)
genofile <- snpgdsOpen(snpgdsExampleFileName())

set.seed(1000)
snpset <- snpgdsLDpruning(genofile)
str(snpset)
names(snpset)
#  [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9"
# [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
# ......

# get SNP ids
snp.id <- unlist(unname(snpset))

# close the genotype file
snpgdsClose(genofile)

Example output

Loading required package: gdsfmt
SNPRelate -- supported by Streaming SIMD Extensions 2 (SSE2)
SNP pruning based on LD:
Excluding 365 SNPs on non-autosomes
Excluding 1 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
    # of samples: 279
    # of SNPs: 8,722
    using 1 thread
    sliding window: 500,000 basepairs, Inf SNPs
    |LD| threshold: 0.2
    method: composite
Chromosome 1: 76.12%, 545/716
Chromosome 2: 72.78%, 540/742
Chromosome 3: 74.71%, 455/609
Chromosome 4: 73.49%, 413/562
Chromosome 5: 76.86%, 435/566
Chromosome 6: 75.75%, 428/565
Chromosome 7: 75.42%, 356/472
Chromosome 8: 71.11%, 347/488
Chromosome 9: 77.88%, 324/416
Chromosome 10: 74.12%, 358/483
Chromosome 11: 77.85%, 348/447
Chromosome 12: 76.81%, 328/427
Chromosome 13: 76.16%, 262/344
Chromosome 14: 76.60%, 216/282
Chromosome 15: 76.34%, 200/262
Chromosome 16: 72.66%, 202/278
Chromosome 17: 73.91%, 153/207
Chromosome 18: 73.68%, 196/266
Chromosome 19: 85.00%, 102/120
Chromosome 20: 71.62%, 164/229
Chromosome 21: 76.98%, 97/126
Chromosome 22: 75.86%, 88/116
6,557 markers are selected in total.
List of 22
 $ chr1 : int [1:545] 1 2 4 5 7 10 12 14 15 16 ...
 $ chr2 : int [1:540] 717 718 719 720 721 723 724 725 726 727 ...
 $ chr3 : int [1:455] 1459 1460 1461 1464 1466 1468 1469 1471 1472 1473 ...
 $ chr4 : int [1:413] 2068 2069 2070 2071 2072 2074 2075 2076 2077 2078 ...
 $ chr5 : int [1:435] 2630 2631 2633 2635 2636 2637 2638 2640 2642 2643 ...
 $ chr6 : int [1:428] 3196 3197 3198 3200 3201 3204 3205 3206 3207 3208 ...
 $ chr7 : int [1:356] 3761 3762 3763 3766 3767 3768 3770 3771 3772 3773 ...
 $ chr8 : int [1:347] 4233 4234 4235 4236 4237 4238 4239 4240 4241 4242 ...
 $ chr9 : int [1:324] 4721 4722 4724 4727 4728 4730 4731 4732 4733 4735 ...
 $ chr10: int [1:358] 5138 5139 5140 5143 5144 5145 5146 5147 5148 5149 ...
 $ chr11: int [1:348] 5620 5621 5623 5624 5625 5626 5628 5629 5630 5631 ...
 $ chr12: int [1:328] 6067 6068 6069 6070 6073 6074 6075 6077 6078 6079 ...
 $ chr13: int [1:262] 6494 6497 6498 6499 6500 6501 6503 6505 6507 6509 ...
 $ chr14: int [1:216] 6840 6841 6842 6843 6844 6845 6846 6847 6848 6850 ...
 $ chr15: int [1:200] 7120 7121 7122 7124 7125 7126 7127 7128 7129 7130 ...
 $ chr16: int [1:202] 7382 7383 7384 7385 7387 7388 7389 7391 7392 7394 ...
 $ chr17: int [1:153] 7660 7661 7662 7663 7664 7665 7666 7667 7668 7669 ...
 $ chr18: int [1:196] 7867 7868 7869 7870 7871 7872 7873 7874 7875 7877 ...
 $ chr19: int [1:102] 8133 8135 8136 8137 8138 8139 8140 8141 8142 8144 ...
 $ chr20: int [1:164] 8253 8254 8257 8258 8259 8260 8261 8262 8265 8266 ...
 $ chr21: int [1:97] 8482 8484 8485 8486 8487 8488 8489 8490 8491 8492 ...
 $ chr22: int [1:88] 8608 8609 8610 8612 8613 8614 8615 8617 8618 8619 ...
 [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
[19] "chr19" "chr20" "chr21" "chr22"

SNPRelate documentation built on Nov. 8, 2020, 5:31 p.m.