knitr::opts_chunk$set(echo = TRUE)
This gives a brief look at the compressed data used in qtl2shiny. See qtl2shinyData.Rmd for details about how the data are constructed. This document is ShinyCompress.Rmd.
The key compressed data files use SQL or FST compression. For details on FST, see
For more info on fst, which was developed for R, see
Assume here that ResearchDrive /Volumes/byandell
.
The data for this project are in /Volumes/byandell/byandell/qtl2shinyApp/
, which contains
app.R
: The shiny app fill to execute with shiny::runApp()
qtl2shinyData
: folder with dataCCmouse
: taxa-specific dataprojects.csv
: master file for data used in the shiny appIn the CCmouse
folder are the following files of import right here:
cc_variants.sqlite
: SQL of DNA variants (3G)mouse_genes.sqlite
: SQL of mouse genes (677M)Recla/genoprob
: folder of genotype probabilities (1.4G)Recla/genoprob/aprobs_fstindex.rds
: (69K)Recla/genoprob/aprobs_11.fst
: (6.3M)This is the small problem (Recla
). The larger problem (AttieDOv2
) is much larger
AttieDOv2/genoprob
: folder of genotype probabilities (11G)AttieDOv2/genoprob/aprobs_fstindex.rds
: (644K)AttieDOv2/genoprob/aprobs_11.fst
: (116M)datapath <- params$datapath dir(datapath)
project_name <- as.character(params$projectName) # "Recla" or "AttieDOv2" chri <- as.character(params$chr) start_val <- as.numeric(params$start) end_val <- as.numeric(params$end)
ccpath <- file.path(datapath, "qtl2shinyData", "CCmouse")
Size in GB of two project genoprob folders:
size_files <- NULL for(i in c("Recla","AttieDOv2")) { files<-list.files(file.path(ccpath, i, "genoprob"),full.names = T) vect_size <- sapply(files, file.size) size_files[i] <- sum(vect_size) } round(size_files / 1e9, 2)
Here we show several ways to access the genotype probabilities, which are stored
in FST files. We begin with a simple call to R/fst
, followed by a call using
R/qtl2fst
, then a call using R/qtl2pattern
.
map <- readRDS(file.path(file.path(ccpath, project_name), "pmap.rds")) # Select chromosome chri map <- map[chri] # Select markers between start_val and end_val wh <- which(map[[chri]] >= start_val & map[[chri]] <= end_val) (map[[chri]] <- map[[chri]][wh])
R/fst
The FST files are compressed 2D tables. Fof qtl2 data, the 3D genotype probabilities are stored with rows having genotypes and subjects, and the columns having the markers. Here is a raw R/fst
read of the file to pull out the r length(wh)
markers (columns), and the first genotype and first 8 subjects (rows):
fstfile <- file.path(ccpath, project_name ,"genoprob", paste0("aprobs_", chri, ".fst")) fst::read_fst(fstfile, columns = names(map[[chri]]), from = 1, to = 8)
R/RSQLite
dbfile <- file.path(ccpath, "cc_variants.sqlite") db <- RSQLite::dbConnect(RSQLite::SQLite(), dbfile) start <- round(start_val * 1e+06) end <- round(end_val * 1e+06) # Shorten distance to reduce time for demo ave <- (start + end) / 2 start <- ave + (start - ave) / 10 end <- ave + (end - ave) / 10 chrselect <- paste0("chr == '", chri, "'") startselect <- paste0("pos >= ", start) endselect <- paste0("pos <= ", end) query <- paste0("SELECT * FROM variants WHERE ", chrselect, " AND ", startselect, " AND ", endselect) variants <- RSQLite::dbGetQuery(db, query) RSQLite::dbDisconnect(db)
str(variants)
The following approaches subsets by chromosome r chri
and by markers between r start_val
and r end_val
positions for project names r project_name
.
R/qtl2fst
A R/qtl2fst
object has a master object typically stored in an RDS file. This does not read the genotype data compressed vai FST,
but includes the information about what is in those FST files.
It is good practice to replace the path to where data are stored.
For information, see https://kbroman.org/qtl2/assets/vignettes/qtl2fst.html.
probs <- readRDS(file.path(ccpath, project_name, "genoprob", "aprobs_fstindex.rds")) (probs <- qtl2fst::replace_path(probs, file.path(ccpath, project_name, "genoprob", "aprobs")))
print(object.size(probs), units = "MB")
We now subset by chromosome r chri
, and by a range of positions
on that chromosome. This changes the master object probs
but does not change the underlying FST files.
# Which markers in saved data agree with map. wh <- match(names(map[[chri]]), dimnames(probs)$mar[[chri]]) (probs <- qtl2fst::subset_fst_genoprob(probs, chr = chri, mar = wh))
print(object.size(probs), units = "MB")
This is the size of the FST master object in R, while the data still resides at `r project_name
/genoprob/aprobs_r chri
.fst`.
paste( round( file.size( file.path(ccpath, project_name, "genoprob", paste0("aprobs_", chri, ".fst"))) / 1e6, 2), "Mb")
qtl2fst
objectThe probs
object is a small metadata file with information about the genoprob data, such as dimensions and dimnames of list elements, but no data.
We can extract data from the object by accessing elements of the 3D array on chromosome chri
. This is generally what is done in the R/qtl2
package when using this class of genotype probabilities. The following shows the 3 markers, first genotype, and first 8 subjects:
probs[[chri]][1:8,1,]
It is possible to extract the calc_genoprob
from the metadata object and compressed FST file(s).
If you try to do this for the whole chromosome, or all chromosomes, it will take some time.
probsq <- qtl2fst::fst_extract(probs) class(probsq)
print(object.size(probsq), units = "MB")
Here is a fake (unevaluated) extraction of all of chromosome r
chri` to show the system time and object size.
probs <- qtl2fst::subset_fst_genoprob(probs, chr = chri) system.time(probsq <- qtl2fst::fst_extract(probs))
## user system elapsed ## 0.102 0.251 44.674
print(object.size(probsq), units = "MB")
## 6.3 Mb
R/qtl2pattern
The following use functions in package
qtl2pattern.
The basics is the function read_probs
to read the genotype probabilities.
It actually gets both the probs
and the map
.
pr <- qtl2pattern:::read_probs(chr=chri, start_val=start_val, end_val=end_val, file.path(ccpath, project_name), allele = TRUE, method = "fast", probdir = "genoprob")
names(pr)
Again, we need to replace the path to access the data properly
(pr$probs <- qtl2fst::replace_path( pr$probs, file.path(ccpath, project_name, "genoprob", "aprobs")))
print(object.size(pr$probs), units = "MB")
R/qtl2pattern
Query functions provide a way to embed details about the data in a local function, and just focues on identifying the chromosome and marker positions.
(query_probs <- qtl2pattern::create_probs_query_func(file.path(ccpath, project_name)))
pr <- query_probs(chri, start_val, end_val)
The query function sets up a call to read_probs
, which is an internal function in qtl2pattern package described earlier.
Internally, we use a query function that embeds the read_probs
function call:
(query_variants <- qtl2::create_variant_query_func(file.path(ccpath, "cc_variants.sqlite")))
ls(envir = environment(query_variants))
get("dbfile", envir = environment(query_variants))
# Shorten distance to reduce time for demo ave <- (start_val + end_val) / 2 start <- ave + (start_val - ave) / 10 end <- ave + (end_val - ave) / 10
variants <- query_variants(chri, start, end)
str(variants)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.