BiocStyle::markdown()
options(width=100, max.print=1000)
knitr::opts_chunk$set(
    eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
    cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
suppressPackageStartupMessages({
    library(BiocEMBO2015)
    library(SummarizedExperiment)
    library(airway)
})

Version: r packageDescription("BiocEMBO2015")$Version
Compiled: r date()

Objectives

Time | Topic
------------- | ----- 09:15 - 10:15 | Sequencing work flows and file types 10:15 | Tea/Coffee break
10:30 - 12:30 | Introduction to R and Bioconductor 12:30 | Lunch
13:30 -14:00 | Scalable computing

Sequencing work flows

  1. Experimental design
    • Keep it simple, e.g., 'control' and 'treatment' groups
    • Replicate within treatments!
  2. Wet-lab sequence preparation (figure from http://rnaseq.uoregon.edu/)

    • Record covariates, including processing day -- likely 'batch effects'
  3. (Illumina) Sequencing (Bentley et al., 2008, doi:10.1038/nature07517

  4. Alignment

    • Choose to match task, e.g., Rsubread, Bowtie2 good for ChIPseq, some forms of RNAseq; BWA, GMAP better for variant calling
    • Primary output: BAM files of aligned reads
  5. Reduction
    • e.g., RNASeq 'count table' (simple spreadsheets), DNASeq called variants (VCF files), ChIPSeq peaks (BED, WIG files)
  6. Analysis
    • Differential expression, peak identification, ...
  7. Comprehension
    • Biological context

Alt Sequencing Ecosystem

Sequence data representations

DNA / amino acid sequences: FASTA files

Input & manipulation: Biostrings

>NM_078863_up_2000_chr2L_16764737_f chr2L:16764737-16766736
gttggtggcccaccagtgccaaaatacacaagaagaagaaacagcatctt
gacactaaaatgcaaaaattgctttgcgtcaatgactcaaaacgaaaatg
...
atgggtatcaagttgccccgtataaaaggcaagtttaccggttgcacggt
>NM_001201794_up_2000_chr2L_8382455_f chr2L:8382455-8384454
ttatttatgtaggcgcccgttcccgcagccaaagcactcagaattccggg
cgtgtagcgcaacgaccatctacaaggcaatattttgatcgcttgttagg
...

Whole genomes: 2bit and .fa formats: rtracklayer, Rsamtools; BSgenome

Reads: FASTQ files

Input & manipulation: ShortRead readFastq(), FastqStreamer(), FastqSampler()

@ERR127302.1703 HWI-EAS350_0441:1:1:1460:19184#0/1
CCTGAGTGAAGCTGATCTTGATCTACGAAGAGAGATAGATCTTGATCGTCGAGGAGATGCTGACCTTGACCT
+
HHGHHGHHHHHHHHDGG<GDGGE@GDGGD<?B8??ADAD<BE@EE8EGDGA3CB85*,77@>>CE?=896=:
@ERR127302.1704 HWI-EAS350_0441:1:1:1460:16861#0/1
GCGGTATGCTGGAAGGTGCTCGAATGGAGAGCGCCAGCGCCCCGGCGCTGAGCCGCAGCCTCAGGTCCGCCC
+
DE?DD>ED4>EEE>DE8EEEDE8B?EB<@3;BA79?,881B?@73;1?########################

Aligned reads: BAM files (e.g., ERR127306_chr14.bam)

Input & manipulation: 'low-level' Rsamtools, scanBam(), BamFile(); 'high-level' GenomicAlignments

Called variants: VCF files

Input and manipulation: VariantAnnotation readVcf(), readInfo(), readGeno() selectively with ScanVcfParam().

Genome annotations: BED, WIG, GTF, etc. files

Input: rtracklayer import()

R

Language and environment for statistical computing and graphics

Vector, class, object

Function, generic, method

Introspection

Help

Example

x <- rnorm(1000)                   # atomic vectors
y <- x + rnorm(1000, sd=.5)
df <- data.frame(x=x, y=y)         # object of class 'data.frame'
plot(y ~ x, df)                    # generic plot, method plot.formula
fit <- lm(y ~x, df)                # object of class 'lm'
methods(class=class(fit))          # introspection

Bioconductor

Overview

Analysis and comprehension of high-throughput genomic data

Packages, vignettes, work flows

Objects

Example

require(Biostrings)                     # Biological sequences
data(phiX174Phage)                      # sample data, see ?phiX174Phage
phiX174Phage
m <- consensusMatrix(phiX174Phage)[1:4,] # nucl. x position counts
polymorphic <- which(colSums(m != 0) > 1)
m[, polymorphic]
methods(class=class(phiX174Phage))
selectMethod(reverseComplement, class(phiX174Phage))

Alt Sequencing Ecosystem

A sequence analysis package tour

This very open-ended topic points to some of the most prominent Bioconductor packages for sequence analysis. Use the opportunity in this lab to explore the package vignettes and help pages highlighted below; many of the material will be covered in greater detail in subsequent labs and lectures.

Basics

Domain-specific analysis -- explore the landing pages, vignettes, and reference manuals of two or three of the following packages.

Working with sequences, alignments, common web file formats, and raw data; these packages rely very heavily on the IRanges / GenomicRanges infrastructure that we will encounter later in the course.

Visualization

DNA or amino acid sequences: Biostrings, ShortRead, BSgenome

Classes

Methods --

Related packages

Example

  require(BSgenome.Hsapiens.UCSC.hg19)
  chr14_range = GRanges("chr14", IRanges(1, seqlengths(Hsapiens)["chr14"]))
  chr14_dna <- getSeq(Hsapiens, chr14_range)
  letterFrequency(chr14_dna, "GC", as.prob=TRUE)

Ranges: GenomicRanges, IRanges

Ranges represent: - Data, e.g., aligned reads, ChIP peaks, SNPs, CpG islands, ... - Annotations, e.g., gene models, regulatory elements, methylated regions - Ranges are defined by chromosome, start, end, and strand - Often, metadata is associated with each range, e.g., quality of alignment, strength of ChIP peak

Many common biological questions are range-based - What reads overlap genes? - What genes are ChIP peaks nearest? - ...

The GenomicRanges package defines essential classes and methods

Alt

Alt

Range operations

Alt Ranges Algebra

Ranges - IRanges - start() / end() / width() - List-like -- length(), subset, etc. - 'metadata', mcols() - GRanges - 'seqnames' (chromosome), 'strand' - Seqinfo, including seqlevels and seqlengths

Intra-range methods - Independent of other ranges in the same object - GRanges variants strand-aware - shift(), narrow(), flank(), promoters(), resize(), restrict(), trim() - See ?"intra-range-methods"

Inter-range methods - Depends on other ranges in the same object - range(), reduce(), gaps(), disjoin() - coverage() (!) - see ?"inter-range-methods"

Between-range methods - Functions of two (or more) range objects - findOverlaps(), countOverlaps(), ..., %over%, %within%, %outside%; union(), intersect(), setdiff(), punion(), pintersect(), psetdiff()

Example

require(GenomicRanges)
gr <- GRanges("A", IRanges(c(10, 20, 22), width=5), "+")
shift(gr, 1)                            # 1-based coordinates!
range(gr)                               # intra-range
reduce(gr)                              # inter-range
coverage(gr)
setdiff(range(gr), gr)                  # 'introns'

IRangesList, GRangesList - List: all elements of the same type - Many *List-aware methods, but a common 'trick': apply a vectorized function to the unlisted representaion, then re-list

    grl <- GRangesList(...)
    orig_gr <- unlist(grl)
    transformed_gr <- FUN(orig)
    transformed_grl <- relist(, grl)

Reference

Aligned reads: GenomicAlignments, Rsamtools

Classes -- GenomicRanges-like behaivor

Methods

Example

require(GenomicRanges)
require(GenomicAlignments)
require(Rsamtools)

## our 'region of interest'
roi <- GRanges("chr14", IRanges(19653773, width=1)) 
## sample data
require('RNAseqData.HNRNPC.bam.chr14')
bf <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[[1]], asMates=TRUE)
## alignments, junctions, overlapping our roi
paln <- readGAlignmentsList(bf)
j <- summarizeJunctions(paln, with.revmap=TRUE)
j_overlap <- j[j %over% roi]

## supporting reads
paln[j_overlap$revmap[[1]]]

Called variants: VariantAnnotation, VariantFiltering

Classes -- GenomicRanges-like behavior

Functions and methods

Example

  ## input variants
  require(VariantAnnotation)
  fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
  vcf <- readVcf(fl, "hg19")
  seqlevels(vcf) <- "chr22"
  ## known gene model
  require(TxDb.Hsapiens.UCSC.hg19.knownGene)
  coding <- locateVariants(rowRanges(vcf),
      TxDb.Hsapiens.UCSC.hg19.knownGene,
      CodingVariants())
  head(coding)

Related packages

Reference

Integrated data representations: SummarizedExperiment

SummarizedExperiment

Annotation: org, TxDb, AnnotationHub, biomaRt, ...

Scalable computing

  1. Efficient R code
  2. Vectorize!
  3. Reuse others' work Know -- DESeq2, GenomicRanges, Biostrings, dplyr, data.table, Rcpp
  4. Iteration
  5. Chunk-wise
  6. open(), read chunk(s), close().
  7. e.g., yieldSize argument to Rsamtools::BamFile()
  8. Restriction
  9. Limit to columns and / or rows of interest
  10. Exploit domain-specific formats, e.g., BAM files and Rsamtools::ScanBamParam()
  11. Use a data base
  12. Sampling
  13. Iterate through large data, retaining a manageable sample, e.g., ShortRead::FastqSampler()
  14. Parallel evaluation
  15. After writing efficient code
  16. Typically, lapply()-like operations
  17. Cores on a single machine ('easy'); clusters (more tedious); clouds

Parallel evaluation in Bioconductor

Resources

R / Bioconductor

Publications (General Bioconductor)

Other

http://bioconductor.org/help/course-materials/2014/BioC2014/Lawrence_Talk.pdf


Bioconductor/BiocEMBO2015 documentation built on May 6, 2019, 7:48 a.m.