In zhengxwen/SeqArray: Data Management of Large-Scale Whole-Genome Sequence Variant Calls

The SeqArray package is designed for R programming environment, and enables high-performance computing in the multi-core symmetric multiprocessing and loosely coupled computer cluster framework. The features of SeqArray are extended with other existing R packages for WGS data analyses, and the R codes for demonstration are available in the package vignette R Integration.

Figure 1: SeqArray framework and flowchart. The SeqArray format is built on top of the Genomic Data Structure (GDS) format, and GDS is a generic data container with hierarchical structure for storing multiple array-oriented data sets. A high-level R interface to GDS files is provided in the gdsfmt package with a C++ library, and the SeqArray package offers functionalities specific to sequencing data. At a minimum a SeqArray file contains sample and variant identifiers, position, chromosome, reference and alternate alleles for each variant. Parallel computing environments, like multi-core computer clusters, are enabled with SeqArray. The functionality of SeqArray is extended by SeqVarTools, SNPRelate, GENESIS and other R/Bioconductor packages for WGS analyses.

options(width=110)

library(SeqArray)

# open a SeqArray file in the package (1000 Genomes Phase1, chromosome 22)
file <- seqOpen(seqExampleFileName("KG_Phase1"))

seqSummary(file)

SeqArray Functions

Key R Functions

Table 1: The key functions in the SeqArray package.

| Function | Description | |:-------------|:-------------------------------------------| | seqVCF2GDS | Reformat VCF files. » | | seqSetFilter | Define a data subset of samples or variants. » | | seqGetData | Get data from a SeqArray file with a defined filter. » | | seqApply | Apply a user-defined function over array margins. » | | seqParallel | Apply functions in parallel. » |

Genotypic data and annotations are stored in an array-oriented manner, providing efficient data access using the R programming language. Table 1 lists five key functions provided in the SeqArray package and many data analyses can be done using just these functions.

seqVCF2GDS() converts VCF files to SeqArray format. Multiple cores in an SMP architecture within one or more compute nodes in a compute cluster can be used to simultaneously reformat the data. seqVCF2GDS() utilizes R's connection interface to read VCF files incrementally, and it is able to import data from http/ftp texts and the standard output of a command-line tool via a pipe.

seqSetFilter() and seqGetData() can be used together to retrieve data for a selected set of samples from a defined genomic region. GRanges and GRangesList objects defined in the Bioconductor core packages are supported via seqSetFilter() [@Gentleman:2004aa; @Lawrence2013].

seqApply() applies a user-defined function to array margins of genotypes and annotations. The function that is applied can be defined in R as is typical, or via C/C++ code using the Rcpp package [@eddelbuettel2011rcpp]. seqParallel() utilizes the facilities in the parallel package [@Rossini2007; @R2016] to perform calculations on a SeqArray file in parallel.

Calculating Allele Frequencies

We illustrate the SeqArray functions by implementing an example to calculate the frequency of reference allele across all chromosomes. If a genomic region is specified via seqSetFilter(), the calculation is performed within the region instead of using all variants. seqApply() enables applying a user-defined function to the margin of genotypes, and the R code is shown as follows:

af <- seqApply(file, "genotype", as.is="double", margin="by.variant",
    FUN=function(x) { mean(x==0L, na.rm=TRUE) })
head(af)

where file is a SeqArray file, as.is indicates returning a numeric vector, margin is specified for applying the function by variant. The variable x in the user-defined function is an allele-by-sample integer matrix at a site and 0L denotes the reference allele where the suffix L indicates the number is an integer.

The Rcpp package simplifies integration of compiled C++ code with R [@eddelbuettel2011rcpp], and the function can be dynamically defined with inlined C/C++ codes:

library(Rcpp)

cppFunction("
    double CalcAlleleFreq(IntegerVector x)
    {
        int len=x.size(), n=0, n0=0;
        for (int i=0; i < len; i++)
        {
            int g = x[i];
            if (g != NA_INTEGER)
            {
                n++;
                if (g == 0) n0++;
            }
        }
        return double(n0) / n;
    }")

where IntegerVector indicates the input variable x is an integer vector, NA_INTEGER is missing value and the function counts how many zeros and non-missing values for calculating frequency. The name CalcAlleleFreq can be passed to seqApply() directly:

af <- seqApply(file, "genotype", as.is="double", margin="by.variant", FUN=CalcAlleleFreq)
head(af)

The C++ integration is several times faster than the R implementation, suggesting an efficient approach with C/C++ when real-time performance is required.

It is able to run the calculation in parallel. The genotypes of a SeqArray file are automatically split into non-overlapping parts according to different variants or samples, and the results from client processes collected internally:

af <- seqApply(file, "genotype", as.is="double",
    margin="by.variant", FUN=function(x) { mean(x==0L, na.rm=TRUE) }, parallel=2)
head(af)

Here parallel specifies the number of cores.

PCA R Implementation

Principal Component Analysis (PCA) is a common tool used in exploratory data analysis for high-dimensional data. PCA is often involved with the calculation of covariance matrix, and the following R code implements the calculation proposed in [@Patterson:2006:PLoS-Genet:17194218]. The user-defined function computes the covariance matrix for each variant and adds up to a total matrix s. The argument .progress=TRUE enables the display of progress information during the calculation.

# covariance variable with an initial value
s <- 0

seqApply(file, "$dosage", function(x)
    {
        p <- 0.5 * mean(x, na.rm=TRUE)      # allele frequency
        g <- (x - 2*p) / sqrt(p*(1-p))      # normalized by allele frequency
        g[is.na(g)] <- 0                    # correct missing values
        s <<- s + (g %o% g)                 # update the cov matrix s in the parent environment
    }, margin="by.variant", .progress=TRUE)

# scaled by the number of samples over the trace
s <- s * (nrow(s) / sum(diag(s)))

# eigen-decomposition
eig <- eigen(s)

[..................................................]  0%, ETC: --- 
[==================>...............................] 36%, ETC: 4.1m
...
[==================================================] 100%, completed in 14.3m

# covariance variable with an initial value
s <- 0

seqBlockApply(file, "$dosage", function(x)
    {
        p <- 0.5 * colMeans(x, na.rm=TRUE)     # allele frequencies (a vector)
        g <- (t(x) - 2*p) / sqrt(p*(1-p))      # normalized by allele frequency
        g[is.na(g)] <- 0                       # correct missing values
        s <<- s + crossprod(g)                 # update the cov matrix s in the parent environment
    }, margin="by.variant", .progress=TRUE)

# scaled by the number of samples over the trace
s <- s * (nrow(s) / sum(diag(s)))

# eigen-decomposition
eig <- eigen(s, symmetric=TRUE)

seqParallel() utilizes the facilities offered by the R parallel package to perform calculations within a cluster or SMP environment, and the genotypes are automatically split into non-overlapping parts. The parallel implementation with R is shown as follows, and the C optimized function is also available in the SNPRelate package.

# the datasets are automatically split into four non-overlapping parts
genmat <- seqParallel(2, file, FUN = function(f)
    {
        s <- 0  # covariance variable with an initial value
        seqBlockApply(f, "$dosage", function(x)
            {
                p <- 0.5 * colMeans(x, na.rm=TRUE)     # allele frequencies (a vector)
                g <- (t(x) - 2*p) / sqrt(p*(1-p))      # normalized by allele frequency
                g[is.na(g)] <- 0                       # correct missing values
                s <<- s + crossprod(g)                 # update the cov matrix s in the parent environment
            }, margin="by.variant")
        s  # output
    }, .combine = "+",    # sum "s" of different processes together
    split = "by.variant")

# scaled by the number of samples over the trace
genmat <- genmat * (nrow(genmat) / sum(diag(genmat)))

# eigen-decomposition
eig <- eigen(genmat, symmetric=TRUE)

# figure
plot(eig$vectors[,1], eig$vectors[,2], xlab="PC 1", ylab="PC 2")

More examples can be found: SeqArray Data Format and Access

Parallel Implementation

The default setting for the analysis functions in the SeqArray package is serial implementation, but users can setup a cluster computing environment manually via seqParallelSetup() and distribute the calculations to multiple cores or even more than 100 cluster nodes.

# use 2 cores for demonstration
seqParallelSetup(2)

# numbers of distinct alleles per site
table(seqNumAllele(file))

# reference allele frequencies
summary(seqAlleleFreq(file, ref.allele=0L))

# close the cluster environment
seqParallelSetup(FALSE)

Bioconductor Features

GRanges and GRangesList

In this section, we illustrate how to work with Bioconductor core packages for performing common queries to retrieve data from a SeqArray file. The GRanges and GRangesList classes manipulate genomic range data and can be used in the function seqSetFilter() to define a data subset. For example, the annotation information of each exon, the coding range and transcript ID are stored in the TxDb.Hsapiens.UCSC.hg19.knownGene object for the UCSC known gene annotations on hg19.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)

# get the exons grouped by gene
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txs <- exonsBy(txdb, "gene")

where exonsBy() returns a GRangesList object for all known genes in the database.

seqSetFilter(file, txs)  # define an exon filter

## # of selected variants: 1,050

# VCF export with exon variants
seqGDS2VCF(file, "exons.vcf.gz")

## Tue Jan  3 15:43:55 2017
## VCF Export: exons.vcf.gz
##     1,092 samples, 1,050 variants
##     INFO Field: <none>
##     FORMAT Field: <none>
## 
[..................................................]  0%, ETC: ---    
[==================================================] 100%, completed in 0s    
## Tue Jan  3 15:43:55 2017    Done.

If random-access memory is sufficiently large, users could load all exon variants via seqGetData(file, "genotype"); otherwise, data have to be loaded by chunk or a user-defined function is applied over variants by seqApply().

VariantAnnotation

SeqArray can also export data with selected variants and samples as a VCF object for use with the VariantAnnotation package [@Obenchain2014]:

library(VariantAnnotation)

# select a region [10Mb, 30Mb] on chromosome 22
seqSetFilterChrom(file, 22, from.bp=10000000, to.bp=30000000)

vcf <- seqAsVCF(file, chr.prefix="chr")
vcf

locateVariants(vcf, txdb, CodingVariants())

## GRanges object with 524 ranges and 9 metadata columns:
##       seqnames               ranges strand | LOCATION  LOCSTART    LOCEND   QUERYID        TXID         CDSID
##          <Rle>            <IRanges>  <Rle> | <factor> <integer> <integer> <integer> <character> <IntegerList>
##     1    chr22 [17071862, 17071862]      - |   coding      1579      1579       128       74436        216505
##     2    chr22 [17073170, 17073170]      - |   coding       271       271       129       74436        216505
##     3    chr22 [17589225, 17589225]      + |   coding      1116      1116       377       73481        214034
##     4    chr22 [17601466, 17601466]      - |   coding       552       552       382       74444        216522
##     5    chr22 [17629357, 17629357]      - |   coding       424       424       394       74446        216528
##   ...      ...                  ...    ... .      ...       ...       ...       ...         ...           ...
##   520    chr22 [29913278, 29913278]      - |   coding      1567      1567      7023       74771        217273
##   521    chr22 [29924156, 29924156]      - |   coding       977       977      7030       74768        217279
##   522    chr22 [29924156, 29924156]      - |   coding       977       977      7030       74769        217279
##   523    chr22 [29924156, 29924156]      - |   coding       977       977      7030       74770        217279
##   524    chr22 [29924156, 29924156]      - |   coding       977       977      7030       74771        217279
##            GENEID       PRECEDEID        FOLLOWID
##       <character> <CharacterList> <CharacterList>
##     1      150160                                
##     2      150160                                
##     3       23765                                
##     4       27439                                
##     5       27440                                
##   ...         ...             ...             ...
##   520        8563                                
##   521        8563                                
##   522        8563                                
##   523        8563                                
##   524        8563                                
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

unlink("exons.vcf.gz", force=TRUE)

Integration with SeqVarTools

The SeqVarTools package is available on Bioconductor, which defines S4 classes and methods for other common operations and analyses on SeqArray datasets. The vignette of SeqVarTools is http://www.bioconductor.org/packages/release/bioc/vignettes/SeqVarTools/inst/doc/SeqVarTools.pdf.

Linear Regression

The SeqVarTools package extends SeqArray by providing methods for many tasks common to quality control and analysis of sequence data. Methods include: transition/transversion ratio, heterozygosity and homozygosity rates, singleton counts, Hardy-Weinberg equilibrium, Mendelian error checking, and linear and logistic regression. Additionally, SeqVarTools defines a new class to link the information present in the SeqArray file to additional sample and variant annotation provided by the user, such as sex and phenotype. One could select a subset of samples in a file and run a linear regression on all variants:

library(Biobase)

library(SeqVarTools)

data(KG_P1_SampData)
KG_P1_SampData

head(pData(KG_P1_SampData))  # show KG_P1_SampData

# link sample data to SeqArray file
seqData <- SeqVarData(file, sample.data)

# set sample and variant filters
female <- sampleData(seqData)$sex == "female"
seqSetFilter(seqData, sample.sel=female) 

# run linear regression
res <- regression(seqData, outcome="phenotype", covar="age")
head(res)

##   variant.id   n      freq          Est        SE   Wald.Stat  Wald.Pval
## 1          1 567 0.6887125 -0.090555715 0.0699074 1.677974724 0.19519378
## 2          2 567 0.9400353  0.009685877 0.1321824 0.005369459 0.94158602
## 3          3 567 0.9991182 -0.378945215 1.0238102 0.136997920 0.71128392
## 4          4 567 1.0000000           NA        NA          NA         NA
## 5          5 567 0.9356261 -0.009732930 0.1281883 0.005764880 0.93947733
## 6          6 567 0.9982363 -1.379486822 0.7233651 3.636804912 0.05651529

Integration with SNPRelate

Parallel implementations of relatedness and principal component analysis with SeqArray format are enabled in the package SNPRelate, to detect and adjust for population structure and cryptic relatedness in association studies. The kernel of SNPRelate was optimized with SIMD instructions and multi-thread algorithms, and it was designed for bi-allelic SNP data originally [@Zheng2012]. In order to analyze sequence variant calls, and SNPRelate has been rewritten to take the dosages of reference alleles as an input genotype matrix with distinct integers 0, 1, 2 and NA for SeqArray files. Therefore no format conversion is required for WGS analyses.

Principal component analysis is implemented in the SNPRelate function snpgdsPCA(), and the exact and new randomized algorithms are both provided [@Patterson:2006:PLoS-Genet:17194218; @Galinsky2016]. The randomized matrix algorithm is designed to reduce the running time for large number of study individuals (e.g., greater than 10,000 samples). Relatedness analyses include PLINK method of moment (MoM), KING robust methods, GCTA genetic relationship matrix (GRM) and individual-perspective beta estimator [@Purcell:2007:Am-J-Hum-Genet:17701901; @Manichaikul2010; @Yang2011; @weir2015snps], and these algorithms are computationally efficient and optimized with SIMD instructions. In addition, fixation index ($F_\text{st}$) has been widely used to measure the genetic difference between populations, and the calculations of moment estimators are available in the SNPRelate package with all variants or a sliding window [@Weir:1984:Evolution; @Weir:2002:Annu-Rev-Genet:12359738; @Weir2005].

LD-based Marker Pruning

It is suggested to perform marker pruning before running PCA and IBD analyses on WGS variant data, to reduce the influence of linkage disequilibrium and rare variants.

library(SNPRelate)

set.seed(1000)

# may try different LD thresholds for sensitivity analysis
snpset <- snpgdsLDpruning(file, ld.threshold=0.2, maf=0.01)
names(snpset)
head(snpset$chr22)  # variant.id

# get all selected variant id
snpset.id <- unlist(snpset)

Principal Component Analysis

# Run PCA
pca <- snpgdsPCA(file, snp.id=snpset.id, num.thread=2)

# variance proportion (%)
pc.percent <- pca$varprop*100
head(round(pc.percent, 2))

# plot the first 4 eigenvectors with character=20 and size=0.5
plot(pca, eig=1:4, pch=20, cex=0.5)

Population information are available:

pop.code <- factor(seqGetData(file, "sample.annotation/Population"))
head(pop.code)

popgroup <- list(
    EastAsia = c("CHB", "JPT", "CHS", "CDX", "KHV", "CHD"),
    European = c("CEU", "TSI", "GBR", "FIN", "IBS"),
    African  = c("ASW", "ACB", "YRI", "LWK", "GWD", "MSL", "ESN"),
    SouthAmerica = c("MXL", "PUR", "CLM", "PEL"),
    India = c("GIH", "PJL", "BEB", "STU", "ITU"))

colors <- sapply(levels(pop.code), function(x) {
    for (i in 1:length(popgroup)) {
        if (x %in% popgroup[[i]])
            return(names(popgroup)[i])
    }
    NA
    })
colors <- as.factor(colors)
legend.text <- sapply(levels(colors), function(x) paste(levels(pop.code)[colors==x], collapse=","))
legend.text

# make a data.frame
tab <- data.frame(sample.id = pca$sample.id,
    EV1 = pca$eigenvect[,1],    # the first eigenvector
    EV2 = pca$eigenvect[,2],    # the second eigenvector
    Population = pop.code, stringsAsFactors = FALSE)
head(tab)

# draw
plot(pca, pch=20, cex=0.75, main="1KG Phase 1, chromosome 22", col=colors[tab$Population])
legend("topright", legend=legend.text, col=1:length(legend.text), pch=19, cex=0.75)

Relatedness Analysis

For relatedness analysis, Identity-By-Descent (IBD) estimation in SNPRelate can be done by the method of moments (MoM) [@Purcell:2007:Am-J-Hum-Genet:17701901].

# YRI samples
sample.id <- seqGetData(file, "sample.id")
CEU.id <- sample.id[pop.code == "CEU"]

# Estimate IBD coefficients
ibd <- snpgdsIBDMoM(file, sample.id=CEU.id, snp.id=snpset.id, num.thread=2)

# Make a data.frame
ibd.coeff <- snpgdsIBDSelection(ibd)
head(ibd.coeff)

plot(ibd.coeff$k0, ibd.coeff$k1, xlim=c(0,1), ylim=c(0,1), xlab="k0", ylab="k1", main="CEU samples (MoM)")
lines(c(0,1), c(1,0), col="red", lty=2)

Identity-By-State Analysis

For $n$ study individuals, snpgdsIBS() can be used to create a $n \times n$ matrix of genome-wide average IBS pairwise identities. To perform cluster analysis on the $n \times n$ matrix of genome-wide IBS pairwise distances, and determine the groups by a permutation score:

set.seed(1000)
ibs.hc <- snpgdsHCluster(snpgdsIBS(file, snp.id=snpset.id, num.thread=2))

Here is the population information we have known:

# Determine groups of individuals by population information
rv <- snpgdsCutTree(ibs.hc, samp.group=as.factor(colors[pop.code]))

plot(rv$dendrogram, leaflab="none", main="1KG Phase 1, chromosome 22",
    edgePar=list(col=rgb(0.5,0.5,0.5,0.75), t.col="black"))
legend("bottomleft", legend=legend.text, col=1:length(legend.text), pch=19, cex=0.75, ncol=4)

Fixation Index ($F_\text{st}$)

Fixation index (Fst) has been widely used to measure the genetic difference between populations, and the calculations of moment estimators are available in the SNPRelate package with all variants or a sliding window.

# sliding windows (window size: 500kb)
sw <- snpgdsSlidingWindow(file, winsize=500000, shift=100000,
    FUN="snpgdsFst", as.is="numeric", population=pop.code)

plot(sw$chr22.pos/1000, sw$chr22.val, xlab="genome coordinate (kb)", ylab="population-average Fst",
    main="1KG Phase 1, chromosome 22")
abline(h=mean(sw$chr22.val), lty=3, col="red", lwd=2)

GENESIS

The GENESIS package offers methodology for estimating and accounting for population and pedigree structure in genetic analyses. The current implementation provides functions to perform PC-AiR and PC-Relate [@Conomos2015; @Conomos2016]. PC-AiR performs PCA on whole-genome genotypes taking into account known or cryptic relatedness in the study sample. PC-Relate uses ancestry representative principal components to estimate measures of recent genetic relatedness. In addition, GENESIS includes support for SeqArray files in mixed model association testing and aggregate tests of rare variants like burden and SKAT tests.

Resources

gdsfmt R package: https://github.com/zhengxwen/gdsfmt, http://bioconductor.org/packages/gdsfmt
SeqArray R package: https://github.com/zhengxwen/SeqArray, http://bioconductor.org/packages/SeqArray
SeqVarTools R package: http://bioconductor.org/packages/SeqVarTools
SNPRelate R package: https://github.com/zhengxwen/SNPRelate, http://bioconductor.org/packages/SNPRelate
GENESIS R package: http://bioconductor.org/packages/GENESIS

Session Information

seqClose(file)

sessionInfo()

References

zhengxwen/SeqArray documentation built on Feb. 13, 2025, 5:43 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

zhengxwen/SeqArray
Data Management of Large-Scale Whole-Genome Sequence Variant Calls

In zhengxwen/SeqArray: Data Management of Large-Scale Whole-Genome Sequence Variant Calls

SeqArray Functions

Key R Functions

Calculating Allele Frequencies

PCA R Implementation

Parallel Implementation

Bioconductor Features

GRanges and GRangesList

VariantAnnotation

Integration with SeqVarTools

Linear Regression

Integration with SNPRelate

LD-based Marker Pruning

Principal Component Analysis

Relatedness Analysis

Identity-By-State Analysis

Fixation Index ($F_\text{st}$)

GENESIS

Resources

Session Information

References

R Package Documentation

Browse R Packages

We want your feedback!

zhengxwen/SeqArray Data Management of Large-Scale Whole-Genome Sequence Variant Calls

In zhengxwen/SeqArray: Data Management of Large-Scale Whole-Genome Sequence Variant Calls

SeqArray Functions

Key R Functions

Calculating Allele Frequencies

PCA R Implementation

Parallel Implementation

Bioconductor Features

GRanges and GRangesList

VariantAnnotation

Integration with SeqVarTools

Linear Regression

Integration with SNPRelate

LD-based Marker Pruning

Principal Component Analysis

Relatedness Analysis

Identity-By-State Analysis

Fixation Index ($F_\text{st}$)

GENESIS

Resources

Session Information

References

R Package Documentation

Browse R Packages

We want your feedback!

zhengxwen/SeqArray
Data Management of Large-Scale Whole-Genome Sequence Variant Calls