These notes were created during the course, and server as a transcript of topics covered.

Intro to sequencing

Workflow

  1. Experimental design
  2. Wet lab sample prep, etc
  3. Sequencing
    • FASTQ file of reads and their quality scores
    • Quality assessment (FASTQ program), trimming or removing contanimants, removing optical duplicates (FASTX, trimomatic)
    • Quality with respect to your research question
  4. Alignment / (assembly)
    • BAM file of aligned reads to a known reference genome
    • Aligners: vary from simple to use to hard to use, from 'good enough' alignments (for RNA-seq of known genes, ChIP-seq) to high-quality (e.g., DNA-seq calling variants)
    • Bowtie2 (easy, good enough), gmap (excellent, hard to use).
    • Purpose-built tools that align and reduce. E.g., RNA-seq known gene differential expression -- kalisto, sailfish
  5. Reduction
    • BED of called peaks in a ChIP-seq experiment (e.g., MACS, FindPeaks)
    • VCF of called variants (GATK, bcftools)
    • Count table (e.g., tsv) in an RNA-seq experiment (python htseq2; GenomicFeatures::summarizeOverlaps())
  6. (Statistical) analysis
    • Why statistical analysis? data is fundamentally huge; biological questions are framed in terms of classical statistics, e.g., designed experiments, hypothesis testing; technical and other artifacts, e.g., GC bias, mapability, batch effects
    • Appropriate tools: able to cope with statistics; access to advanced statistical methods; analysis has to be reproducible (some sort of scripting); processing large amounts of data is not the primary criterion.
    • R / Bioconductor is the best most awesome tool.
  7. Comprehension
    • .Rmd or similar documenting the work flow, including inputs, analysis steps, tables, figures, interpertation...

FASTQ and BAM files

View from the Linux command line...

... or within R / Bioconductor: fastq files

library(ShortRead)
strm = FastqStreamer("bigdata/SRR1039508_1.fastq.gz", 100000)
fq = yield(strm)
fq
sread(fq)
quality(fq)

R

x = rnorm(1000)
y = x + rnorm(1000, sd=.5)
df = data.frame(x=x, y=y)
plot(y ~ x, df)
fit = lm(y ~ x, df)
class(fit)
methods(class=class(fit))
methods("anova")

Help!

?log
?plot    # generic 'plot'
?plot.lm # plot for objects of class 'lm'

Bioconductor

Extensive use of 'S4' classes

library(ShortRead)
strm = FastqStreamer("bigdata/SRR1039508_1.fastq.gz", 100000)
fq = yield(strm)          # 'ShortReadQ' S4 class
class(fq)                 # introspection
methods(class=class(fq))  
reads = sread(fq)         # accessor -- get the reads
reads                     # 'DNAStringSet' S4 class
methods(class=class(reads))
gc = letterFrequency(reads, "GC", as.prob=TRUE)
hist(gc)

Help!

?DNAStringSet      # class, and often frequently used methods
?letterFrequency   # generic
methods("letterFrequency")
?"letterFrequency,XStringSet-method"

And...

Key software packages...

... and classes

Annotation

Strategies for working with big data

All material on the course materials page



Bioconductor/BiocEMBO2015 documentation built on May 6, 2019, 7:48 a.m.