knitr::opts_chunk$set(echo = TRUE, fig.width = 7, fig.height = 5)

The motivation is to reduce memory usage. Matt told us that memory is his greatest obstacle, when running QTL browsers because he cannot add more R/plumber workers. Also, if you have a cheap Windows laptop you cannot even open many DO .RData projects simply because haplotype probability object is too big.

The solution would be to store data in some database and load just a slice you need at a time. I experimented with SQLite but it was too slow. Then I try feather and it run fine (R script attached – crunching 1000 SNPs at a time). It just does not use qtl2scan effectively because the package is not prepared for that.

Task: refactor this Rmd to use DOex and run some examples.

suppressPackageStartupMessages({
  library(qtl2)
  library(qtl2feather)
})
DOex <- 
  read_cross2(
    file.path("https://raw.githubusercontent.com/rqtl",
              "qtl2data/master/DOex",
              "DOex.zip"))

Calculate genotype probabilities and convert to allele probabilities

pr <- calc_genoprob(DOex, error_prob=0.002)
apr <- genoprob_to_alleleprob(pr)

Write feather database of pr.

tmpdir <- tempdir()
fpr <- feather_genoprob(pr, "pr", tmpdir)
fapr <- feather_genoprob(apr, "apr", tmpdir)
list.files(tmpdir)

Methods names() and length() both work properly.

names(fpr)
length(fpr)

Methods dim() and dimnames() give dimensions across the chromosomes. Notice that fpr has different genotypes for the X chromosome.

dim(fpr)
str(dimnames(fpr))

Selecting one chromosome

Selecting a chromosome causes reading from feather database and creation of an array.

dim(fapr[["X"]])

Alternate form.

dim(fapr$X)

Subsetting by ind, chr, mar

Subsetting using subset(object,ind,chr,mar) or [ind,chr,mar] only adjusts the vector of individuals, chromosomes and markers, but does not alter the feather database.

dim(subset(fapr, ind=1:20, chr=c("2","3")))

You can also use [] to subset.

dim(fapr[1:20, c("2","3")][["3"]])
f2 <- fapr[,"2"]
f23 <- fapr[,c("2","3")]
fx <- fapr[,"X"]

There is a third dimension for markers. However, be careful that if you select a subset of markers that excludes one or more chromosomes, those will be dropped.

dim(subset(fapr, mar=1:30))
dim(fapr[ , , dimnames(fapr)$mar$X[1:20]])

Binding by columns or rows.

Binding by columns (chromosomes) or rows (individuals) may cause creation of a new feather database if input objects arose from different feather databases. However, if objects are subsets of the same feather_genoprob object, then it reuses the one feather database. Further, if objects have the same directory and file basename for their feather databases, they will be combined without creation of any new feather databases. See example(cbind.feather_genoprob) and example(rbind.feather_genoprob) with objects having distinct feather databases.

newf23 <- cbind(f2,f23)

Row bind.

f23a <- fapr[1:20, c("2","3")]
f23b <- fapr[40:79, c("2","3")]
f23 <- rbind(f23a, f23b)

Subset on markers. This way only extracts the selected markers from feather database before creating the array.

markers <- dimnames(fapr$X)[[3]][1:10]
dim(fapr[,,markers]$X)

This way extracts all markers on X, creates the array, then subsets on selected markers.

markers <- dimnames(fapr$X)[[3]][1:10]
dim(fapr$X[,,markers])

Two feather_genoprob objects using same area. Combine using cbind. Notice that the order of chromosomes is reversed by joining fapr2 to fapr3. Be sure to not overwrite existing feather databases!

fapr2 <- feather_genoprob(subset(apr, chr="2"), "aprx", tmpdir)
fapr3 <- feather_genoprob(subset(apr, chr="3"), "aprx", tmpdir)
fapr32 <- cbind(fapr3,fapr2)
dim(fapr32)
list.files(tmpdir)

Looking under the hood at feather_genoprob object

Here are the names of elements in a feather_genoprob object:

names(unclass(fapr))
unclass(fapr)$feather
sapply(unclass(fapr)[c("ind","chr","mar")], length)

A feather_genoprob object has all the original information. Thus, it is possible to restore the original object from a subset (but not necessarily from a cbind or rbind). Here is an example.

fapr23 <- subset(fapr, chr=c("2","3"))
dim(fapr23)
dim(feather_genoprob_restore(fapr23))

Time trial simulation

Compare times to create pr once and to Read fpr 100 times.

system.time(pr <- calc_genoprob(DOex, error_prob=0.002))
system.time(fpr <- feather_genoprob(pr, "pr", tmpdir, verbose = FALSE))
tmpfile <- tempfile()
saveRDS(pr, file = tmpfile)
size_pr <- file.size(tmpfile) * 1e-6
unlink(tmpfile)

The pr object is r format(object.size(pr), units = "MB") in R or r size_pr Mb if saved as RDS file, while the fpr object is r format(object.size(fpr), units = "MB") in R with additional r sum(file.size(unclass(fpr)$feather)) * 1e-6 Mb for saved feather databases.

Time to extract chromosome information.

Extract chromosome "2" 100 times.

system.time({
  for(i in seq(100))
    tmp <- fpr[["2"]]
})

Extract first 50 markers from chromosome "2" 100 times.

markers <- dimnames(fpr)$mar[["2"]][1:50]
system.time({
  for(i in seq(100))
    tmp <- fpr[["2"]]
})


byandell/qtl2feather documentation built on May 13, 2019, 9:30 a.m.