BiocStyle::markdown() suppressPackageStartupMessages({ library(rtracklayer) library(BiocParallel) library(GenomicFiles) library(TxDb.Hsapiens.UCSC.hg19.knownGene) })
Efficient R code
r Biocpkg("DESeq2")
,
r Biocpkg("GenomicRanges")
, r Biocpkg("Biostrings")
, ...,
r CRANpkg("dplyr")
, r CRANpkg("data.table")
, r CRANpkg("Rcpp")
system.time()
, Rprof()
, r CRANpkg("microbenchmark")
Iteration
open()
, read chunk(s), close()
.yieldSize
argument to Rsamtools::BamFile()
Restriction
Rsamtools::ScanBamParam()
Sampling
ShortRead::FastqSampler()
Parallel evaluation
lapply()
-like operations| Type | Example use | Name | Package |
|-------|-----------------------|-----------------------------|----------------------------------|
| .bed | Range annotations | BedFile()
| r Biocpkg("rtracklayer")
|
| .wig | Coverage | WigFile()
, BigWigFile()
| r Biocpkg("rtracklayer")
|
| .gtf | Transcript models | GTFFile()
| r Biocpkg("rtracklayer")
|
| | | makeTxDbFromGFF()
| r Biocpkg("GenomicFeatures")
|
| .2bit | Genomic Sequence | TwoBitFile()
| r Biocpkg("rtracklayer")
|
| .fastq | Reads & qualities | FastqFile()
| r Biocpkg("ShortRead")
|
| .bam | Aligned reads | BamFile()
| r Biocpkg("Rsamtools")
|
| .tbx | Indexed tab-delimited | TabixFile()
| r Biocpkg("Rsamtools")
|
| .vcf | Variant calls | VcfFile()
| r Biocpkg("VariantAnnotation")
|
## rtracklayer menagerie library(rtracklayer) names(getClass("RTLFile")@subclasses)
Notes
open()
, close()
, import()
/ yield()
/ read*()
.bai
, bam index);
selection ('columns'); restriction ('rows')*FileList()
classes
reduceByYield()
-- iterate through a single large filebplapply()
(r Biocpkg("BiocParallel")
) -- perform independent
operations on several files, in parallelGenomicFiles()
reduceByRange()
, reduceByFile()
: collapse maps into summary
representationStandardized interface for simple parallel evaluation.
bplapply()
instead of lapply()
Argument BPPARAM
influences how parallel evaluation occurs
MulticoreParam()
: threads on a single (non-Windows) machineSnowParam()
: processes on the same or different machinesBatchJobsParam()
: resource scheduler on a clusterOther resources
r Biocpkg("GoogleGenomics")
to interact with google compute cloud
and resources
Define following as a function.
f0 <- function(n) { ## inefficient! ans <- numeric() for (i in seq_len(n)) ans <- c(ans, exp(i)) ans }
Use system.time()
to explore how long this takes to execute as n
increases from 100 to 10000. Use identical()
and
r CRANpkg("microbenchmark")
to compare alternatives f1()
, f2()
, and
f3()
for both correctness and performance of these three different
functions. What strategies are these functions using?
f1 <- function(n) { ans <- numeric(n) for (i in seq_len(n)) ans[[i]] <- exp(i) ans } f2 <- function(n) sapply(seq_len(n), exp) f3 <- function(n) exp(seq_len(n))
Go to sleep for 1 second, then return i
. This takes 8 seconds.
library(BiocParallel) fun <- function(i) { Sys.sleep(1) i } ## serial f0 <- function(n) lapply(seq_len(n), fun) ## parallel f1 <- function(n) bplapply(seq_len(n), fun)
This exercise uses the following packages:
library(GenomicAlignments) library(GenomicFiles) library(BiocParallel) library(Rsamtools) library(GenomeInfoDb)
This is a re-implementation of the basic r Biocpkg("csaw")
binned
counts algorithm. It supposes that ChIP fragment lengths are 110 nt,
and that we bin coverage in windows of width 50. We focus on chr1.
frag.len <- 110 spacing <- 50 chr <- "chr1"
Here we point to the bam files, indicating that we'll process the files in chunks of size 1,000,000.
fls <- dir("~/UseBioconductor-data/ChIPSeq/NFYA/", ".BAM$", full=TRUE) names(fls) <- sub(".fastq.*", "", basename(fls)) bfl <- BamFileList(fls, yieldSize=1000000)
We'll creating the counting bins using tileGenome()
, focusing the
'standard' chromosomes'
len <- seqlengths(keepStandardChromosomes(seqinfo(bfl)))[chr] tiles <- tileGenome(len, tilewidth=spacing, cut.last.tile.in.chrom=TRUE)
We'll use reduceByYield()
to iterate through a single file. We read
to tell this function we'll YIELD
a chunk of the file, how we'll
MAP
the chunk from it's input representation to the per-window
counts, and finally how we'll REDUCE
successive chunks into a final
representation.
YIELD
is supposed to be a function that takes one argument, the
input source, and returns a chunk of records
yield <- function(x, ...) readGAlignments(x)
MAP
must take the output of yield and perhaps additional arguments,
and return a vector of counts. We'll resize the genomic ranges
describing the alignment so that they have a width equal to the
fragment length
map <- function(x, tiles, frag.len, ...) { gr <- keepStandardChromosomes(granges(x)) countOverlaps(tiles, resize(gr, frag.len)) }
REDUCE
takes two results from MAP
(in our case, vectors of counts)
and combines them into a single result. We simply add our vectors (+
is actually a function!)
reduce <- `+`
To process one file, we use reduceByYield()
, passing the file we
want to process, the yield, map, and reduce functions. Our 'wrapper'
function passes any additional arguments through to reduceByYield()
using ...
:
count1file <- function(bf, ...) reduceByYield(bf, yield, map, reduce, ...)
Using yieldSize
and reduceByYield()
means that we do not consume
too much memory processing each file, so that we can process files in
parallel using r Biocpkg("BiocParallel")
. The simplify2array()
function transforms a list-of-vectors to a matrix.
counts <- bplapply(bfl, count1file, tiles=tiles, frag.len=frag.len) counts <- simplify2array(counts) dim(counts) colSums(counts)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.