snpgdsPCA | R Documentation |
To calculate the eigenvectors and eigenvalues for principal component analysis in GWAS.
snpgdsPCA(gdsobj, sample.id=NULL, snp.id=NULL,
autosome.only=TRUE, remove.monosnp=TRUE, maf=NaN, missing.rate=NaN,
algorithm=c("exact", "randomized"),
eigen.cnt=ifelse(identical(algorithm, "randomized"), 16L, 32L),
num.thread=1L, bayesian=FALSE, need.genmat=FALSE,
genmat.only=FALSE, eigen.method=c("DSPEVX", "DSPEV"),
aux.dim=eigen.cnt*2L, iter.num=10L, verbose=TRUE)
## S3 method for class 'snpgdsPCAClass'
plot(x, eig=c(1L,2L), ...)
gdsobj |
an object of class |
sample.id |
a vector of sample id specifying selected samples; if NULL, all samples are used |
snp.id |
a vector of snp id specifying selected SNPs; if NULL, all SNPs are used |
autosome.only |
if |
remove.monosnp |
if TRUE, remove monomorphic SNPs |
maf |
to use the SNPs with ">= maf" only; if NaN, no MAF threshold |
missing.rate |
to use the SNPs with "<= missing.rate" only; if NaN, no missing threshold |
eigen.cnt |
output the number of eigenvectors; if eigen.cnt <= 0, then return all eigenvectors |
algorithm |
"exact", traditional exact calculation; "randomized", fast PCA with randomized algorithm introduced in Galinsky et al. 2016 |
num.thread |
the number of (CPU) cores used; if |
bayesian |
if TRUE, use bayesian normalization |
need.genmat |
if TRUE, return the genetic covariance matrix |
genmat.only |
return the genetic covariance matrix only, do not compute the eigenvalues and eigenvectors |
eigen.method |
"DSPEVX" – compute the top |
aux.dim |
auxiliary dimension used in fast randomized algorithm |
iter.num |
iteration number used in fast randomized algorithm |
verbose |
if TRUE, show information |
x |
a |
eig |
indices of eigenvectors, like |
... |
the arguments passed to or from other methods, like
|
The minor allele frequency and missing rate for each SNP passed in
snp.id
are calculated over all the samples in sample.id
.
Return a snpgdsPCAClass
object, and it is a list:
sample.id |
the sample ids used in the analysis |
snp.id |
the SNP ids used in the analysis |
eigenval |
eigenvalues |
eigenvect |
eigenvactors, "# of samples" x "eigen.cnt" |
varprop |
variance proportion for each principal component |
TraceXTX |
the trace of the genetic covariance matrix |
Bayesian |
whether use bayerisan normalization |
genmat |
the genetic covariance matrix |
Xiuwen Zheng
Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006 Dec;2(12):e190.
Galinsky KJ, Bhatia G, Loh PR, Georgiev S, Mukherjee S, Patterson NJ, Price AL. Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. Am J Hum Genet. 2016 Mar 3;98(3):456-72. doi: 10.1016/j.ajhg.2015.12.022. Epub 2016 Feb 25.
snpgdsPCACorr
, snpgdsPCASNPLoading
,
snpgdsPCASampLoading
, snpgdsAdmixProp
,
snpgdsEIGMIX
# open an example dataset (HapMap)
genofile <- snpgdsOpen(snpgdsExampleFileName())
# run PCA
RV <- snpgdsPCA(genofile)
RV
# eigenvalues
head(RV$eigenval)
# variance proportion (%)
head(round(RV$varprop*100, 2))
# [1] 12.23 5.84 1.01 0.95 0.84 0.74
# draw
plot(RV)
plot(RV, 1:4)
#### there is no population information ####
# make a data.frame
tab <- data.frame(sample.id = RV$sample.id,
EV1 = RV$eigenvect[,1], # the first eigenvector
EV2 = RV$eigenvect[,2], # the second eigenvector
stringsAsFactors = FALSE)
head(tab)
# sample.id EV1 EV2
# 1 NA19152 -0.08411287 -0.01226860
# 2 NA19139 -0.08360644 -0.01085849
# 3 NA18912 -0.08110808 -0.01184524
# 4 NA19160 -0.08680864 -0.01447106
# 5 NA07034 0.03109761 0.07709255
# 6 NA07055 0.03228450 0.08155730
# draw
plot(tab$EV2, tab$EV1, xlab="eigenvector 2", ylab="eigenvector 1")
#### there are population information ####
# get population information
# or pop_code <- scan("pop.txt", what=character())
# if it is stored in a text file "pop.txt"
pop_code <- read.gdsn(index.gdsn(genofile, "sample.annot/pop.group"))
# get sample id
samp.id <- read.gdsn(index.gdsn(genofile, "sample.id"))
# assume the order of sample IDs is as the same as population codes
cbind(samp.id, pop_code)
# samp.id pop_code
# [1,] "NA19152" "YRI"
# [2,] "NA19139" "YRI"
# [3,] "NA18912" "YRI"
# [4,] "NA19160" "YRI"
# [5,] "NA07034" "CEU"
# ...
# make a data.frame
tab <- data.frame(sample.id = RV$sample.id,
pop = factor(pop_code)[match(RV$sample.id, samp.id)],
EV1 = RV$eigenvect[,1], # the first eigenvector
EV2 = RV$eigenvect[,2], # the second eigenvector
stringsAsFactors = FALSE)
head(tab)
# sample.id pop EV1 EV2
# 1 NA19152 YRI -0.08411287 -0.01226860
# 2 NA19139 YRI -0.08360644 -0.01085849
# 3 NA18912 YRI -0.08110808 -0.01184524
# 4 NA19160 YRI -0.08680864 -0.01447106
# 5 NA07034 CEU 0.03109761 0.07709255
# 6 NA07055 CEU 0.03228450 0.08155730
# draw
plot(tab$EV2, tab$EV1, col=as.integer(tab$pop),
xlab="eigenvector 2", ylab="eigenvector 1")
legend("bottomright", legend=levels(tab$pop), pch="o", col=1:4)
# close the file
snpgdsClose(genofile)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.