BGData-package: A Suite of Packages for Analysis of Big Genomic Data

BGData-packageR Documentation

A Suite of Packages for Analysis of Big Genomic Data

Description

Modern genomic datasets are big (large n), high-dimensional (large p), and multi-layered. The challenges that need to be addressed are memory requirements and computational demands. Our goal is to develop software that will enable researchers to carry out analyses with big genomic data within the R environment.

Details

We have identified several approaches to tackle those challenges within R:

  • File-backed matrices: The data is stored in on the hard drive and users can read in smaller chunks when they are needed.

  • Linked arrays: For very large datasets a single file-backed array may not be enough or convenient. A linked array is an array whose content is distributed over multiple file-backed nodes.

  • Multiple dispatch: Methods are presented to users so that they can treat these arrays pretty much as if they were RAM arrays.

  • Multi-level parallelism: Exploit multi-core and multi-node computing.

  • Inputs: Users can create these arrays from standard formats (e.g., PLINK .bed).

The BGData package is an umbrella package that comprises several packages: BEDMatrix, LinkedMatrix, and symDMatrix. It features scalable and efficient computational methods for large genomic datasets such as genome-wide association studies (GWAS) or genomic relationship matrices (G matrix). It also contains a container class called BGData that holds genotypes, sample information, and variant information.

Example dataset

The extdata folder contains example files that were generated from the 250k SNP and phenotype data in Atwell et al. (2010). Only the first 300 SNPs of chromosome 1, 2, and 3 were included to keep the size of the example dataset small. PLINK was used to convert the data to .bed and .raw files. FT10 has been chosen as a phenotype and is provided as an alternate phenotype file. The file is intentionally shuffled to demonstrate that the additional phenotypes are put in the same order as the rest of the phenotypes.

See Also

BEDMatrix-package, LinkedMatrix-package, and symDMatrix-package for an introduction to the respective packages.

file-backed-matrices for more information on file-backed matrices. multi-level-parallelism for more information on multi-level parallelism.


QuantGen/BGData documentation built on Sept. 30, 2023, 1:01 p.m.