knitr::opts_chunk$set(echo = TRUE, fig.width = 7, fig.height = 5)
The motivation is to reduce memory usage. Matt told us that memory is his greatest obstacle, when running QTL browsers because he cannot add more R/plumber
workers. Also, if you have a cheap Windows laptop you cannot even open many DO .RData
projects simply because haplotype probability object is too big.
The solution would be to store data in some database and load just a slice you need at a time. I experimented with SQLite
but it was too slow. Then I try feather and it run fine (R script attached – crunching 1000 SNPs at a time). It just does not use qtl2scan
effectively because the package is not prepared for that.
Task: refactor this Rmd to use DOex and run some examples.
suppressPackageStartupMessages({ library(qtl2) library(qtl2feather) })
DOex <- read_cross2( file.path("https://raw.githubusercontent.com/rqtl", "qtl2data/master/DOex", "DOex.zip"))
Calculate genotype probabilities and convert to allele probabilities
pr <- calc_genoprob(DOex, error_prob=0.002) apr <- genoprob_to_alleleprob(pr)
Write feather database of pr.
tmpdir <- tempdir()
fpr <- feather_genoprob(pr, "pr", tmpdir) fapr <- feather_genoprob(apr, "apr", tmpdir)
list.files(tmpdir)
Methods names()
and length()
both work properly.
names(fpr)
length(fpr)
Methods dim()
and dimnames()
give dimensions across the chromosomes.
Notice that fpr
has different genotypes for the X
chromosome.
dim(fpr)
str(dimnames(fpr))
Selecting a chromosome causes reading from feather database and creation of an array.
dim(fapr[["X"]])
Alternate form.
dim(fapr$X)
Subsetting using subset(object,ind,chr,mar)
or [ind,chr,mar]
only adjusts the vector of
individuals, chromosomes and markers, but does not alter the feather database.
dim(subset(fapr, ind=1:20, chr=c("2","3")))
You can also use []
to subset.
dim(fapr[1:20, c("2","3")][["3"]])
f2 <- fapr[,"2"] f23 <- fapr[,c("2","3")] fx <- fapr[,"X"]
There is a third dimension for markers. However, be careful that if you select a subset of markers that excludes one or more chromosomes, those will be dropped.
dim(subset(fapr, mar=1:30))
dim(fapr[ , , dimnames(fapr)$mar$X[1:20]])
Binding by columns (chromosomes) or rows (individuals) may cause creation of a new feather database if input objects arose from different feather databases. However, if objects are subsets of the same feather_genoprob
object, then it reuses the one feather database. Further, if objects have the same directory and file basename for their feather databases, they will be combined without creation of any new feather databases.
See example(cbind.feather_genoprob)
and example(rbind.feather_genoprob)
with objects having distinct feather databases.
newf23 <- cbind(f2,f23)
Row bind.
f23a <- fapr[1:20, c("2","3")] f23b <- fapr[40:79, c("2","3")] f23 <- rbind(f23a, f23b)
Subset on markers. This way only extracts the selected markers
from feather database before creating the array.
markers <- dimnames(fapr$X)[[3]][1:10] dim(fapr[,,markers]$X)
This way extracts all markers on X
, creates the array, then subsets on selected markers
.
markers <- dimnames(fapr$X)[[3]][1:10] dim(fapr$X[,,markers])
Two feather_genoprob
objects using same area. Combine using cbind
. Notice that the order of chromosomes is reversed by joining fapr2
to fapr3
. Be sure to not overwrite existing feather databases!
fapr2 <- feather_genoprob(subset(apr, chr="2"), "aprx", tmpdir) fapr3 <- feather_genoprob(subset(apr, chr="3"), "aprx", tmpdir)
fapr32 <- cbind(fapr3,fapr2)
dim(fapr32)
list.files(tmpdir)
feather_genoprob
objectHere are the names of elements in a feather_genoprob
object:
names(unclass(fapr))
unclass(fapr)$feather
sapply(unclass(fapr)[c("ind","chr","mar")], length)
A feather_genoprob
object has all the original information. Thus, it is possible to restore the original object from a subset
(but not necessarily from a cbind
or rbind
). Here is an example.
fapr23 <- subset(fapr, chr=c("2","3")) dim(fapr23)
dim(feather_genoprob_restore(fapr23))
Compare times to create pr
once and to Read fpr
100 times.
system.time(pr <- calc_genoprob(DOex, error_prob=0.002))
system.time(fpr <- feather_genoprob(pr, "pr", tmpdir, verbose = FALSE))
tmpfile <- tempfile() saveRDS(pr, file = tmpfile) size_pr <- file.size(tmpfile) * 1e-6 unlink(tmpfile)
The pr
object is r format(object.size(pr), units = "MB")
in R
or r size_pr
Mb if saved as RDS file, while the fpr
object is r format(object.size(fpr), units = "MB")
in R with additional
r sum(file.size(unclass(fpr)$feather)) * 1e-6
Mb for saved feather databases.
Extract chromosome "2"
100 times.
system.time({ for(i in seq(100)) tmp <- fpr[["2"]] })
Extract first 50 markers from chromosome "2"
100 times.
markers <- dimnames(fpr)$mar[["2"]][1:50] system.time({ for(i in seq(100)) tmp <- fpr[["2"]] })
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.