scanBED | R Documentation |
Sequentially visits variants in a PLINK1 BED fileset with a stepping window matrix, and process each window matrix with user scripts either in function or expression form, meant for data to big to fit in the memory.
To read the entire BED into a R matrix, use [readBED]()
instead.
scanBED(
pfx,
FUN,
...,
win = 1,
iid = 1,
vid = 1,
vfr = NULL,
vto = NULL,
buf = 2^24,
simplify = TRUE
)
loopBED(
pfx,
EXP,
GVR = "g",
win = 1,
iid = 1,
vid = 1,
vfr = NULL,
vto = NULL,
buf = 2^24,
simplify = TRUE
)
pfx |
prefix of PLINK BED. |
FUN |
a function to process each window of variants; |
... |
additional argument for |
win |
reading window size (def=100 variants per window) |
iid |
option to read |
vid |
option to read |
vfr |
variant-wise, from where to read (number/proportion, def=1)? |
vto |
varinat-wise, to where then stop (number/proportion, def=P)? |
buf |
buffer size in byptes (def=2^24, or 16 MB). |
simplify |
try simplifying the results into an array, or leave them in a list, or specify a function to simplify the said list. |
EXP |
a R expression to evaluate with each window of variants; |
GVR |
a R variable name to assign the window to (def="g"). |
results of all windows processed by the user script.
scanBED()
: apply a function to variants in a PLINK1 BED fileset
Travers P
variants via a sliding window while calling a function on each
window of variants without side effects on the calling environment, mimicking
various R apply
utilities.
loopBED()
: evaluate an expression on variants in a PLINK1 BED
Travers P
variants via a sliding window and evaluate an R expression given
each window of variants, with side effects on the calling environment,
mimicking the syntax of R for
loop.
A popular format to store biallelic dosage genotype, with three files,
pfx.fam: text table for N
individuals, detailed in readFAM;
pfx.bim: text table for P
variants, detailed in readBIM;
pfx.bed: transposed genotype matrix (P
x N
) in binary format.
The triplets are commonly referred by the shared prefix (pfx
), e.g., the X
chromosome represented by "chrX.bed", "chrX.fam", and "chrX.bim" are refered
by "chrX"
.
The binary file "pfx.bed" represent each dosage value with two bits - just enough to encode all four possiblities: 0, 1, or 2 alleles, or missing.
The number of variants (P
) and samples (N
) equals to the number of lines
in text file "pfx.bim" and "pfx.fam", respectively.
For the detailed specification of PLINK1 BED genotype format, see the lagecy PLINK v1.07 page at: \ https://zzz.bwh.harvard.edu/plink/binary.shtml. \ For the modern use and management of PLINK1 BED, see the PLINK v1.9 page: \ https://www.cog-genomics.org/plink/1.9/input#bed.
win
: visiting window size.
the number of variants per window, that is, the number of columns in each window matrix passed to the user script.
For example, a size one window means the user script will be dealing with only one variant at a time, received from in a matrix of a single column – a manner similar to genome wide association analysis (GWAS). However, a larger, multi-variant window coupled with R language's vector and matrix syntax can significantly boost efficiency.
The default size is 1000 variants / columns per window.
buf
: buffer size in bytes
a large buffer reduces the frequency of hard disk visits when traversing a PLINK1 BED file, which in turn reduces non-computation overhead.
The default size is 2^24
bytes, or 16 MB.
simplify
:
when FALSE: resuts of user script processing each window of variants are returned in a list;
when TRUE, use simplify2array
to put the results into an array, if it
fails, fallback and return a list.
when a function is specified, it is then used to simplify the results, if an execption is thrown, fallback and return a list.
e.g., the window script returns a data frame of estimate, standard error,
t-statistic, and p-value for each variant, simplify = rbind
to combine
results of all windows into one data frame of P
rows and four columns of
statistics.
context infomation such the number of variants and samples are updated in the window processing environment to ease user scripting, which includes:
.i
: indies of variants in the current visiting window;
.p
: number of variants in the current visiting window.
.P
: total number of variants;
.w
: index of the current window;
.W
: total number of windows to go through;
.N
: number of individuals.
.b
: index of the current buffer.
.B
: number of buffers to be swapped.
e.g. (1) print percentage progress with print(.w / .W * 100)
; \
e.g. (2) use inf <- readBIM(pfx)
to read the table of variants before the
window visits, later use inf[.i, ]
to access meta-data for variants in
each window.
[readBED]
## traverse genotype, apply R function without side effects
pfx <- file.path(system.file("extdata", package="plinkFile"), "000")
ret <- scanBED(pfx, function(g)
{
.af <- colMeans(g, na.rm=TRUE) / 2
maf <- pmin(.af, 1 - .af)
mis <- colSums(is.na(g)) / .N
pct <- round(.w / .W * 100, 2)
cbind(buf=.b, wnd=.w, idx=.i, MAF=maf, MIS=mis, PCT=pct)
},
vfr=NULL, vto=NULL, win=13, simplify=rbind, buf=2^18)
head(ret)
tail(ret)
## traversing genotype, evaluate R expression with side effects
pfx <- file.path(system.file("extdata", package="plinkFile"), "000.bed")
ret <- list() # use side effect to keep the result of each window.
loopBED(pfx,
{
af <- colMeans(gt, na.rm=TRUE) / 2
sg <- af * (1 - af)
ret[[.w]] <- cbind(wnd=.w, alf=af, var=sg)
},
win=13, GVR="gt", vid=3, buf=2^18)
head(ret)
tail(ret)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.