bed: travers variants in a PLINK1 BED fileset

scanBEDR Documentation

travers variants in a PLINK1 BED fileset

Description

Sequentially visits variants in a PLINK1 BED fileset with a stepping window matrix, and process each window matrix with user scripts either in function or expression form, meant for data to big to fit in the memory.

To read the entire BED into a R matrix, use ⁠[readBED]()⁠ instead.

Usage

scanBED(
  pfx,
  FUN,
  ...,
  win = 1,
  iid = 1,
  vid = 1,
  vfr = NULL,
  vto = NULL,
  buf = 2^24,
  simplify = TRUE
)

loopBED(
  pfx,
  EXP,
  GVR = "g",
  win = 1,
  iid = 1,
  vid = 1,
  vfr = NULL,
  vto = NULL,
  buf = 2^24,
  simplify = TRUE
)

Arguments

pfx

prefix of PLINK BED.

FUN

a function to process each window of variants;

...

additional argument for FUN when scanBED is used.

win

reading window size (def=100 variants per window)

iid

option to read N IID as row names (def=1, see readIID()).

vid

option to read P VID as col names (def=1, see readVID()).

vfr

variant-wise, from where to read (number/proportion, def=1)?

vto

varinat-wise, to where then stop (number/proportion, def=P)?

buf

buffer size in byptes (def=2^24, or 16 MB).

simplify

try simplifying the results into an array, or leave them in a list, or specify a function to simplify the said list.

EXP

a R expression to evaluate with each window of variants;

GVR

a R variable name to assign the window to (def="g").

Value

results of all windows processed by the user script.

Functions

  • scanBED(): apply a function to variants in a PLINK1 BED fileset

    Travers P variants via a sliding window while calling a function on each window of variants without side effects on the calling environment, mimicking various R apply utilities.

  • loopBED(): evaluate an expression on variants in a PLINK1 BED

    Travers P variants via a sliding window and evaluate an R expression given each window of variants, with side effects on the calling environment, mimicking the syntax of R for loop.

BED PLINK1 Binary Pedigree fileset

A popular format to store biallelic dosage genotype, with three files,

  • pfx.fam: text table for N individuals, detailed in readFAM;

  • pfx.bim: text table for P variants, detailed in readBIM;

  • pfx.bed: transposed genotype matrix (P x N ) in binary format.

The triplets are commonly referred by the shared prefix (pfx), e.g., the X chromosome represented by "chrX.bed", "chrX.fam", and "chrX.bim" are refered by "chrX".

The binary file "pfx.bed" represent each dosage value with two bits - just enough to encode all four possiblities: 0, 1, or 2 alleles, or missing.

The number of variants (P) and samples (N) equals to the number of lines in text file "pfx.bim" and "pfx.fam", respectively.

For the detailed specification of PLINK1 BED genotype format, see the lagecy PLINK v1.07 page at: \ https://zzz.bwh.harvard.edu/plink/binary.shtml. \ For the modern use and management of PLINK1 BED, see the PLINK v1.9 page: \ https://www.cog-genomics.org/plink/1.9/input#bed.

detailed arguments

  • win: visiting window size.

    the number of variants per window, that is, the number of columns in each window matrix passed to the user script.

    For example, a size one window means the user script will be dealing with only one variant at a time, received from in a matrix of a single column – a manner similar to genome wide association analysis (GWAS). However, a larger, multi-variant window coupled with R language's vector and matrix syntax can significantly boost efficiency.

    The default size is 1000 variants / columns per window.

  • buf: buffer size in bytes

    a large buffer reduces the frequency of hard disk visits when traversing a PLINK1 BED file, which in turn reduces non-computation overhead.

    The default size is 2^24 bytes, or 16 MB.

  • simplify:

    when FALSE: resuts of user script processing each window of variants are returned in a list;

    when TRUE, use simplify2array to put the results into an array, if it fails, fallback and return a list.

    when a function is specified, it is then used to simplify the results, if an execption is thrown, fallback and return a list.

    e.g., the window script returns a data frame of estimate, standard error, t-statistic, and p-value for each variant, simplify = rbind to combine results of all windows into one data frame of P rows and four columns of statistics.

genotype context

context infomation such the number of variants and samples are updated in the window processing environment to ease user scripting, which includes:

  • .i: indies of variants in the current visiting window;

  • .p: number of variants in the current visiting window.

  • .P: total number of variants;

  • .w: index of the current window;

  • .W: total number of windows to go through;

  • .N: number of individuals.

  • .b: index of the current buffer.

  • .B: number of buffers to be swapped.

e.g. (1) print percentage progress with print(.w / .W * 100); \ e.g. (2) use inf <- readBIM(pfx) to read the table of variants before the window visits, later use inf[.i, ] to access meta-data for variants in each window.

See Also

⁠[readBED]⁠

Examples

## traverse genotype, apply R function without side effects
pfx <- file.path(system.file("extdata", package="plinkFile"), "000")
ret <- scanBED(pfx, function(g)
{
    .af <- colMeans(g, na.rm=TRUE) / 2
    maf <- pmin(.af, 1 - .af)
    mis <- colSums(is.na(g)) / .N
    pct <- round(.w / .W * 100, 2)
    cbind(buf=.b, wnd=.w, idx=.i, MAF=maf, MIS=mis, PCT=pct)
},
vfr=NULL, vto=NULL, win=13, simplify=rbind, buf=2^18)
head(ret)
tail(ret)

## traversing genotype, evaluate R expression with side effects
pfx <- file.path(system.file("extdata", package="plinkFile"), "000.bed")
ret <- list() # use side effect to keep the result of each window.
loopBED(pfx,
{
    af <- colMeans(gt, na.rm=TRUE) / 2
    sg <- af * (1 - af)
    ret[[.w]] <- cbind(wnd=.w, alf=af, var=sg)
},
win=13, GVR="gt", vid=3, buf=2^18)
head(ret)
tail(ret)


plinkFile documentation built on Nov. 24, 2023, 5:10 p.m.

Related to bed in plinkFile...