BiocStyle::markdown()
suppressPackageStartupMessages({
    library(rtracklayer)
    library(BiocParallel)
    library(GenomicFiles)
    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
})

Scalabe computing

Efficient R code

Iteration

Restriction

Sampling

Parallel evaluation

File management

File classes

| Type | Example use | Name | Package | |-------|-----------------------|-----------------------------|----------------------------------| | .bed | Range annotations | BedFile() | r Biocpkg("rtracklayer") | | .wig | Coverage | WigFile(), BigWigFile() | r Biocpkg("rtracklayer") | | .gtf | Transcript models | GTFFile() | r Biocpkg("rtracklayer") | | | | makeTxDbFromGFF() | r Biocpkg("GenomicFeatures") | | .2bit | Genomic Sequence | TwoBitFile() | r Biocpkg("rtracklayer") | | .fastq | Reads & qualities | FastqFile() | r Biocpkg("ShortRead") | | .bam | Aligned reads | BamFile() | r Biocpkg("Rsamtools") | | .tbx | Indexed tab-delimited | TabixFile() | r Biocpkg("Rsamtools") | | .vcf | Variant calls | VcfFile() | r Biocpkg("VariantAnnotation") |

## rtracklayer menagerie
library(rtracklayer)
names(getClass("RTLFile")@subclasses)

Notes

Managing a collection of files

*FileList() classes

GenomicFiles()

Parallel evaluation with BiocParallel

Standardized interface for simple parallel evaluation.

Other resources

Practical

Efficient code

Define following as a function.

f0 <- function(n) {
    ## inefficient!
    ans <- numeric()
    for (i in seq_len(n))
        ans <- c(ans, exp(i))
    ans
}

Use system.time() to explore how long this takes to execute as n increases from 100 to 10000. Use identical() and r CRANpkg("microbenchmark") to compare alternatives f1(), f2(), and f3() for both correctness and performance of these three different functions. What strategies are these functions using?

f1 <- function(n) {
    ans <- numeric(n)
    for (i in seq_len(n))
        ans[[i]] <- exp(i)
    ans
}

f2 <- function(n)
    sapply(seq_len(n), exp)

f3 <- function(n)
    exp(seq_len(n))

Sleeping serially and in parallel

Go to sleep for 1 second, then return i. This takes 8 seconds.

library(BiocParallel)

fun <- function(i) {
    Sys.sleep(1)
    i
}

## serial
f0 <- function(n)
    lapply(seq_len(n), fun)

## parallel
f1 <- function(n)
    bplapply(seq_len(n), fun)

Reads overlapping windows

This exercise uses the following packages:

library(GenomicAlignments)
library(GenomicFiles)
library(BiocParallel)
library(Rsamtools)
library(GenomeInfoDb)

This is a re-implementation of the basic r Biocpkg("csaw") binned counts algorithm. It supposes that ChIP fragment lengths are 110 nt, and that we bin coverage in windows of width 50. We focus on chr1.

frag.len <- 110
spacing <- 50
chr <- "chr1"

Here we point to the bam files, indicating that we'll process the files in chunks of size 1,000,000.

fls <- dir("~/UseBioconductor-data/ChIPSeq/NFYA/", ".BAM$", full=TRUE)
names(fls) <- sub(".fastq.*", "", basename(fls))
bfl <- BamFileList(fls, yieldSize=1000000)

We'll creating the counting bins using tileGenome(), focusing the 'standard' chromosomes'

len <- seqlengths(keepStandardChromosomes(seqinfo(bfl)))[chr]
tiles <- tileGenome(len, tilewidth=spacing, cut.last.tile.in.chrom=TRUE)

We'll use reduceByYield() to iterate through a single file. We read to tell this function we'll YIELD a chunk of the file, how we'll MAP the chunk from it's input representation to the per-window counts, and finally how we'll REDUCE successive chunks into a final representation.

YIELD is supposed to be a function that takes one argument, the input source, and returns a chunk of records

yield <- function(x, ...)
    readGAlignments(x)

MAP must take the output of yield and perhaps additional arguments, and return a vector of counts. We'll resize the genomic ranges describing the alignment so that they have a width equal to the fragment length

map <- function(x, tiles, frag.len, ...) {
   gr <- keepStandardChromosomes(granges(x))
   countOverlaps(tiles, resize(gr, frag.len))
}

REDUCE takes two results from MAP (in our case, vectors of counts) and combines them into a single result. We simply add our vectors (+ is actually a function!)

reduce <- `+`

To process one file, we use reduceByYield(), passing the file we want to process, the yield, map, and reduce functions. Our 'wrapper' function passes any additional arguments through to reduceByYield() using ...:

count1file <- function(bf, ...)
    reduceByYield(bf, yield, map, reduce, ...)

Using yieldSize and reduceByYield() means that we do not consume too much memory processing each file, so that we can process files in parallel using r Biocpkg("BiocParallel"). The simplify2array() function transforms a list-of-vectors to a matrix.

counts <- bplapply(bfl, count1file, tiles=tiles, frag.len=frag.len)
counts <- simplify2array(counts)
dim(counts)
colSums(counts)

Resources



Bioconductor/UseBioconductor documentation built on May 6, 2019, 7:52 a.m.