snpgdsLDpruning: Linkage Disequilibrium (LD) based SNP pruning

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/LD.r

Description

Recursively removes SNPs within a sliding window

Usage

1
2
3
4
snpgdsLDpruning(gdsobj, sample.id = NULL, snp.id = NULL, autosome.only = TRUE,
	remove.monosnp = TRUE, maf = NaN, missing.rate = NaN,
	method = c("composite", "r", "dprime", "corr"), slide.max.bp = 500000,
	slide.max.n = NA, ld.threshold = 0.2, num.thread = 1, verbose = TRUE)

Arguments

gdsobj

a GDS file object (gds.class)

sample.id

a vector of sample id specifying selected samples; if NULL, all samples are used

snp.id

a vector of snp id specifying selected SNPs; if NULL, all SNPs are used

autosome.only

if TRUE, use autosomal SNPs only

remove.monosnp

if TRUE, remove monomorphic SNPs

maf

to use the SNPs with ">= maf" only; if NaN, no MAF threshold

missing.rate

to use the SNPs with "<= missing.rate" only; if NaN, no missing threshold

method

"composite", "r", "dprime", "corr", see details

slide.max.bp

the maximum basepairs in the sliding window

slide.max.n

the maximum number of SNPs in the sliding window

ld.threshold

the LD threshold

num.thread

the number of CPU cores used

verbose

if TRUE, show information

Details

The minor allele frequency and missing rate for each SNP passed in snp.id are calculated over all the samples in sample.id.

Four methods can be used to calculate linkage disequilibrium values: "composite" for LD composite measure, "r" for R coefficient (by EM algorithm assuming HWE, it could be negative), "dprime" for D', and "corr" for correlation coefficient. The method "corr" is equivalent to "composite", when SNP genotypes are coded as: 0 – BB, 1 – AB, 2 – AA. The argument ld.threshold is the absolute value of measurement.

It is useful to generate a pruned subset of SNPs that are in approximate linkage equilibrium with each other. The function snpgdsLDpruning recursively removes SNPs within a sliding window based on the pairwise genotypic correlation. SNP pruning is conducted chromosome by chromosome, since SNPs in a chromosome can be considered to be independent with the other chromosomes.

The pruning algorithm on a chromosome is described as follows (n is the total number of SNPs on that chromosome):

1) Randomly select a starting position i, and let the current SNP set S = { i };

2) For each right position j from i+1 to n: if any LD between j and k is greater than ld.threshold, where k belongs to S, and both of j and k are in the sliding window, then skip j; otherwise, let S be S + { j };

3) For each left position j from i-1 to 1: if any LD between j and k is greater than ld.threshold, where k belongs to S, and both of j and k are in the sliding window, then skip j; otherwise, let S be S + { j };

4) Output S, the final selection of SNPs.

Value

Return a list of SNP IDs stratified by chromosomes.

Author(s)

Xiuwen Zheng

References

Weir B: Inferences about linkage disequilibrium. Biometrics 1979; 35: 235-254.

Weir B: Genetic Data Analysis II. Sunderland, MA: Sinauer Associates, 1996.

Weir BS, Cockerham CC: Complete characterization of disequilibrium at two loci; in Feldman MW (ed): Mathematical Evolutionary Theory. Princeton, NJ: Princeton University Press, 1989.

See Also

snpgdsLDMat, snpgdsLDpair

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# open an example dataset (HapMap)
genofile <- openfn.gds(snpgdsExampleFileName())

snpset <- snpgdsLDpruning(genofile)
names(snpset)
#  [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9"
# [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
# ......
head(snpset$chr1)
# [1] 1 2 3 4 5 6

# get SNP ids
snp.id <- unlist(snpset)

# close the genotype file
closefn.gds(genofile)

Example output

Loading required package: gdsfmt
SNPRelate -- supported by Streaming SIMD Extensions 2 (SSE2)
Hint: it is suggested to call `snpgdsOpen' to open a SNP GDS file instead of `openfn.gds'.
SNP pruning based on LD:
Excluding 365 SNPs on non-autosomes
Excluding 1 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
Working space: 279 samples, 8,722 SNPs
    using 1 (CPU) core
    sliding window: 500,000 basepairs, Inf SNPs
    |LD| threshold: 0.2
    method: composite
Chromosome 1: 75.98%, 544/716
Chromosome 2: 72.78%, 540/742
Chromosome 3: 75.21%, 458/609
Chromosome 4: 73.31%, 412/562
Chromosome 5: 76.86%, 435/566
Chromosome 6: 75.58%, 427/565
Chromosome 7: 75.42%, 356/472
Chromosome 8: 71.31%, 348/488
Chromosome 9: 77.64%, 323/416
Chromosome 10: 73.71%, 356/483
Chromosome 11: 77.85%, 348/447
Chromosome 12: 76.35%, 326/427
Chromosome 13: 76.45%, 263/344
Chromosome 14: 76.95%, 217/282
Chromosome 15: 76.34%, 200/262
Chromosome 16: 73.02%, 203/278
Chromosome 17: 73.91%, 153/207
Chromosome 18: 73.68%, 196/266
Chromosome 19: 85.00%, 102/120
Chromosome 20: 71.62%, 164/229
Chromosome 21: 76.19%, 96/126
Chromosome 22: 75.86%, 88/116
6,555 markers are selected in total.
 [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
[19] "chr19" "chr20" "chr21" "chr22"
[1]  1  2  4  5  7 10

SNPRelate documentation built on May 2, 2019, 4:56 p.m.