Nothing
### ======================================================================
### Testing Rplinkseq and VariantAnnotation
### ======================================================================
###
### February 2014
### 'rhino03' Ubuntu server, 387 Gb RAM
### 16 processors with the following configuration:
###
### vendor_id : GenuineIntel
### cpu family : 6
### model : 45
### model name : Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz
### stepping : 7
### microcode : 0x70d
### cpu MHz : 2900.142
### cache size : 20480 KB
### physical id : 1
### siblings : 8
### core id : 0
### cpu cores : 8
### apicid : 32
### initial apicid : 32
### fpu : yes
### fpu_exception : yes
### cpuid level : 13
### wp : yes
### flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
### pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb
### rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc
### aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr
### pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx
### lahf_lm ida arat xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
### bogomips : 5799.87
### clflush size : 64
### cache_alignment : 64
### address sizes : 46 bits physical, 48 bits virtual
### power management:
### --------------------------------------------------------------------------
### Objectives
### --------------------------------------------------------------------------
### Compare runtimes between Rplinkseq and VariantAnnotation packages.
###
###
### Test functions:
###
### load.vcf: Data are loaded from a vcf file into a list of lists
### and accessed with x.consensus* functions. (Rplinkseq)
###
### var.fetch: Data are loaded from a PLINK/Seq 'project' into a list of
### lists and accessed with x.consensus* functions. (Rplinkseq)
###
### meta.fetch: Data are loaded from a PLINK/Seq 'project' and parsed into
### a data.frame. (Rplinkseq)
###
### var.iterate: Applies a function to data from a PLINK/Seq 'project'
### while iterating. Result of function is returned.
### (Rplinkseq)
###
### scanVcf: Data are loaded from a vcf file. Info and geno fields are
### parsed into a list of lists; other 'core' fields are
### returned as a GRAnges object. (VariantAnnotation)
###
###
### Test cases:
###
### Test I: Query by range.
### Import data by randomly chosen genomic range.
### (Functions: load.vcf, var.fetch, scanVcf)
###
### Test II: Query by range and fields.
### Import data by genomic range and 4 randomly
### chosen info (2) and geno (2) fields.
### (Functions: meta.fetch, scanVcf)
###
### Test III. Iterate through file.
### A simple function is applied to all records in the
### file in an iterative fashion.
### (Functions: var.iterate, scanVcf)
###
### --------------------------------------------------------------------------
### Data
### --------------------------------------------------------------------------
### Test file:
### Size on disk: 1.8G compressed
### Contents: 494328 variants, 1092 samples, 22 INFO, and 3 GENO fields
### Download location:
### ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.chr22.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz
###
### Test subset:
### An arbitrary subset of data was chosen for testing because the full file
### requires >250G in memory. The subset range is 2e7 to 2.5e7 and contains
### 63088 records. This subset is the 'range' in the mask and param objects.
###
### PLINK/Seq 'project':
### From the raw vcf on disk we creted a PLINK/Seq 'project' so we could use
### the var.iterate function. A 'project' creates compressed, indexed, SQLite
### database from input files. This one took ~30 minutes to create.
### pseq proj new-project
### pseq proj load-vcf --vcf 'ALL.chr22.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz'
### --------------------------------------------------------------------------
### Set up
### --------------------------------------------------------------------------
library(microbenchmark)
library(VariantAnnotation)
library(Rplinkseq) ## version 0.08
fl <- "/loc/no-backup/vobencha/ALL.chr22.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz"
### Attach the 'project' to the R session:
pseq.project("proj")
### --------------------------------------------------------------------------
### Test I. Query by range
### --------------------------------------------------------------------------
### load.vcf:
mask <- "reg=22:20000000..25000000"
loadvcf_1 <- function()
load.vcf(fl, mask=mask, limit=200000)
### var.fetch:
varfetch_1 <- function()
var.fetch(mask=mask, limit=200000)
### scanVcf:
tfile <- TabixFile(fl)
which <- GRanges("22", IRanges(2e7, 2.5e7))
param <- ScanVcfParam(which=which)
scanvcf_1 <- function()
scanVcf(tfile, param=param)
micro <- microbenchmark(loadvcf_1(), varfetch_1(), scanvcf_1(), times=5)
micro
###Unit: seconds
### expr min lq median uq max neval
### loadvcf_1() 313.1729 335.5738 359.7979 368.4054 369.9397 5
### varfetch_1() 264.4101 283.2388 291.7533 305.7028 318.7879 5
### scanvcf_1() 300.5585 308.5507 359.0814 400.2049 678.2296 5
### --------------------------------------------------------------------------
### Test II. Query by range and fields
### --------------------------------------------------------------------------
### Randomly select 2 INFO and 2 GENO fields:
info_var <- c("RSQ", "THETA")
geno_var <- c("GT", "DS")
### Other fields parsed by scanVcf:
other_var <- c("CHROM", "POS", "ID", "REF", "ALT", "QUAL", "FILTER")
### load.vcf: To the best of our knowledge, this function can import
### by range but not by select fields. To isolate a field,
### all fields are read in and parsed with the x.consensus
### functions. There is no equivalent comparision for
### load.vcf in this test case.
### meta.fetch:
mask <- "reg=22:20000000..25000000"
metafetch_2 <- function()
meta.fetch(c(info_var, geno_var, other_var), mask=mask)
### scanVcf: NOTE: Fields in 'other_var' are imported and parsed by
### scanVcf() under names of 'rowRanges' and 'fixed'.
tfile <- TabixFile(fl)
which <- GRanges("22", IRanges(2e7, 2.5e7))
param <- ScanVcfParam(which=which, info=info_var, geno=geno_var)
scanvcf_2 <- function()
scanVcf(tfile, param=param)
micro <- microbenchmark(metafetch_2(), scanvcf_2(), times=5)
micro
###Unit: seconds
### expr min lq median uq max neval
### metafetch_2() 118.14090 120.65009 120.87346 120.88615 121.13340 5
### scanvcf_2() 35.45512 35.46179 35.51305 35.59804 37.41809 5
### --------------------------------------------------------------------------
### Test III. Iterate through file
### --------------------------------------------------------------------------
### load.vcf: To the best of our knowledge Rplinkseq cannot iterate on
### raw vcf files. Iteration must be done on a 'project' with
### var.iterate. There is no equivalent comparision for load.vcf
### in this test case.
### var.iterate:
simple_ct <- function(v)
ct <<- ct + length(v$ID)
variterate_3 <- function()
{
ct <<- 0
var.iterate(simple_ct)
}
### scanVcf:
tfile <- TabixFile(fl, yieldSize=10000)
param <- ScanVcfParam(fixed=c("ALT"), info=NA, geno=NA)
scanvcf_3 <- function()
{
ct2 <<- 0
open(tfile)
while((len <- length(scanVcf(tfile, param=param)[[1]]$rowRanges)) > 0)
ct2 <<- ct2 + len
close(tfile)
}
micro <- microbenchmark(variterate_3(), scanvcf_3(), times=5)
micro
##Unit: seconds
## expr min lq median uq max neval
## variterate_3() 1552.88520 1576.55856 1583.13120 1585.65281 1591.96375 5
## scanvcf_3() 50.04566 50.06945 50.27194 50.56145 50.70112 5
### --------------------------------------------------------------------------
### Scaling with number of variants and samples (scanVcf only)
### --------------------------------------------------------------------------
### Linear scaling by variants:
tfile <- TabixFile(fl)
yieldSize <- c(100000, 200000, 300000, 400000, 500000)
param <- ScanVcfParam(info=NA, geno=NA)
fun0 <- function(tfile, param) scanVcf(tfile, param=param)
res <- lapply(yieldSize, function(i) {
tf <- TabixFile(fl, yieldSize=i)
microbenchmark(fun0(tf, param), times=5)})
res
### [[1]]
### Unit: seconds
### expr min lq median uq max neval
### fun0(tf, param) 9.883719 10.13574 10.15165 10.17318 10.70124 5
###
### [[2]]
### Unit: seconds
### expr min lq median uq max neval
### fun0(tf, param) 19.385 19.6715 19.71968 19.72206 20.23362 5
###
### [[3]]
### Unit: seconds
### expr min lq median uq max neval
### fun0(tf, param) 29.24268 29.33945 29.36927 29.38207 30.01012 5
###
### [[4]]
### Unit: seconds
### expr min lq median uq max neval
### fun0(tf, param) 38.71675 38.73608 38.77965 38.8696 39.27382 5
###
### [[5]]
### Unit: seconds
### expr min lq median uq max neval
### fun0(tf, param) 47.94605 48.76884 49.03272 49.03859 49.08618 5
### Linear scaling by sample:
tf <- TabixFile(fl)
ids <- samples(scanVcfHeader(fl))
sampleSize <- c(200, 400, 600, 800, 1000)
which <- GRanges("22", IRanges(2e7, 2.5e7)) ## 63088 records
fun0 <- function(tf, param) scanVcf(tf, param=param)
res <- lapply(sampleSize, function(i) {
param <- ScanVcfParam(which=which, samples=ids[1:i])
microbenchmark(fun0(tf, param), times=5)})
res
### [[1]]
### Unit: seconds
### expr min lq median uq max neval
### fun0(tf, param) 61.80377 61.91141 69.62993 70.10218 103.1949 5
###
### [[2]]
### Unit: seconds
### expr min lq median uq max neval
### fun0(tf, param) 128.3595 132.1819 135.7386 135.966 150.7588 5
###
### [[3]]
### Unit: seconds
### expr min lq median uq max neval
### fun0(tf, param) 204.0067 211.0546 219.2069 232.3388 249.188 5
###
### [[4]]
### Unit: seconds
### expr min lq median uq max neval
### fun0(tf, param) 266.7383 269.8461 273.7986 299.7805 330.6505 5
###
### [[5]]
### Unit: seconds
### expr min lq median uq max neval
### fun0(tf, param) 329.6348 337.1841 348.7968 359.5972 364.2781 5
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.