Description Details Author(s) References Examples
Genome-wide association studies are widely used to investigate the genetic basis of diseases and traits, but they pose many computational challenges. We developed SNPRelate (R package for multi-core symmetric multiprocessing computer architectures) to accelerate two key computations on SNP data: principal component analysis (PCA) and relatedness analysis using identity-by-descent measures. The kernels of our algorithms are written in C/C++ and highly optimized.
Package: | SNPRelate |
Type: | Package |
License: | GPL version 3 |
Depends: | gdsfmt (>= 1.0.4) |
The genotypes stored in GDS format can be analyzed by the R functions in SNPRelate, which utilize the multi-core feature of machine for a single computer.
Webpage: http://github.com/zhengxwen/SNPRelate, http://corearray.sourceforge.net/
Tutorial: http://corearray.sourceforge.net/tutorials/SNPRelate/
Xiuwen Zheng zhengxwen@gmail.com
Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS. A High-performance Computing Toolset for Relatedness and Principal Component Analysis of SNP Data. Bioinformatics (2012); doi: 10.1093/bioinformatics/bts610
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | ####################################################################
# Convert the PLINK BED file to the GDS file
#
# PLINK BED files
bed.fn <- system.file("extdata", "plinkhapmap.bed.gz", package="SNPRelate")
fam.fn <- system.file("extdata", "plinkhapmap.fam.gz", package="SNPRelate")
bim.fn <- system.file("extdata", "plinkhapmap.bim.gz", package="SNPRelate")
# convert
snpgdsBED2GDS(bed.fn, fam.fn, bim.fn, "HapMap.gds")
####################################################################
# Principal Component Analysis
#
# open
genofile <- snpgdsOpen("HapMap.gds")
RV <- snpgdsPCA(genofile)
plot(RV$eigenvect[,2], RV$eigenvect[,1], xlab="PC 2", ylab="PC 1",
col=rgb(0,0,150, 50, maxColorValue=255), pch=19)
# close the file
snpgdsClose(genofile)
####################################################################
# Identity-By-Descent (IBD) Analysis
#
# open
genofile <- snpgdsOpen(snpgdsExampleFileName())
RV <- snpgdsIBDMoM(genofile)
flag <- lower.tri(RV$k0)
plot(RV$k0[flag], RV$k1[flag], xlab="k0", ylab="k1",
col=rgb(0,0,150, 50, maxColorValue=255), pch=19)
abline(1, -1, col="red", lty=4)
# close the file
snpgdsClose(genofile)
####################################################################
# Identity-By-State (IBS) Analysis
#
# open
genofile <- snpgdsOpen(snpgdsExampleFileName())
RV <- snpgdsIBS(genofile)
m <- 1 - RV$ibs
colnames(m) <- rownames(m) <- RV$sample.id
GeneticDistance <- as.dist(m[1:45, 1:45])
HC <- hclust(GeneticDistance, "ave")
plot(HC)
# close the file
snpgdsClose(genofile)
####################################################################
# Linkage Disequilibrium (LD) Analysis
#
# open an example dataset (HapMap)
genofile <- snpgdsOpen(snpgdsExampleFileName())
snpset <- read.gdsn(index.gdsn(genofile, "snp.id"))[1:200]
L1 <- snpgdsLDMat(genofile, snp.id=snpset, method="composite", slide=-1)
# plot
image(abs(L1$LD), col=terrain.colors(64))
# close the file
snpgdsClose(genofile)
|
Loading required package: gdsfmt
SNPRelate -- supported by Streaming SIMD Extensions 2 (SSE2)
Start snpgdsBED2GDS ...
BED file: "/usr/local/lib/R/site-library/SNPRelate/extdata/plinkhapmap.bed.gz" in the SNP-major mode (Sample X SNP)
FAM file: "/usr/local/lib/R/site-library/SNPRelate/extdata/plinkhapmap.fam.gz", DONE.
BIM file: "/usr/local/lib/R/site-library/SNPRelate/extdata/plinkhapmap.bim.gz", DONE.
Sun Feb 10 13:31:39 2019 store sample id, snp id, position, and chromosome.
start writing: 60 samples, 5000 SNPs ...
Sun Feb 10 13:31:39 2019 0%
Sun Feb 10 13:31:39 2019 100%
Sun Feb 10 13:31:39 2019 Done.
Optimize the access efficiency ...
Clean up the fragments of GDS file:
open the file 'HapMap.gds' (104.6K)
# of fragments: 38
save to 'HapMap.gds.tmp'
rename 'HapMap.gds.tmp' (104.4K, reduced: 240B)
# of fragments: 18
Principal Component Analysis (PCA) on genotypes:
Excluding 203 SNPs on non-autosomes
Excluding 28 SNPs (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
Working space: 60 samples, 4,769 SNPs
using 1 (CPU) core
PCA: the sum of all selected genotypes (0,1,2) = 124273
CPU capabilities: Double-Precision SSE2
Sun Feb 10 13:31:39 2019 (internal increment: 1908)
[..................................................] 0%, ETC: ---
[==================================================] 100%, completed in 0s
Sun Feb 10 13:31:39 2019 Begin (eigenvalues and eigenvectors)
Sun Feb 10 13:31:39 2019 Done.
IBD analysis (PLINK method of moment) on genotypes:
Excluding 365 SNPs on non-autosomes
Excluding 1 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
Working space: 279 samples, 8,722 SNPs
using 1 (CPU) core
PLINK IBD: the sum of all selected genotypes (0,1,2) = 2446510
Sun Feb 10 13:31:39 2019 (internal increment: 13056)
[..................................................] 0%, ETC: ---
[==================================================] 100%, completed in 0s
Sun Feb 10 13:31:39 2019 Done.
Identity-By-State (IBS) analysis on genotypes:
Excluding 365 SNPs on non-autosomes
Excluding 1 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
Working space: 279 samples, 8,722 SNPs
using 1 (CPU) core
IBS: the sum of all selected genotypes (0,1,2) = 2446510
Sun Feb 10 13:31:40 2019 (internal increment: 13056)
[..................................................] 0%, ETC: ---
[==================================================] 100%, completed in 0s
Sun Feb 10 13:31:40 2019 Done.
Linkage Disequilibrium (LD) estimation on genotypes:
Working space: 279 samples, 200 SNPs
using 1 (CPU) core.
method: composite
LD matrix: the sum of all selected genotypes (0,1,2) = 55417
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.