Home

/

GitHub

/

Bioconductor/BiocEMBO2015

/

In Bioconductor/BiocEMBO2015: Course material for EMBO Practical Course: Analysis of High-Throughput Sequencing Data, Hinxton, UK, October, 2015

These notes were created during the course, and server as a transcript of topics covered.

Intro to sequencing

Workflow

Experimental design
Wet lab sample prep, etc
Sequencing
- FASTQ file of reads and their quality scores
- Quality assessment (FASTQ program), trimming or removing contanimants, removing optical duplicates (FASTX, trimomatic)
- Quality with respect to your research question
Alignment / (assembly)
- BAM file of aligned reads to a known reference genome
- Aligners: vary from simple to use to hard to use, from 'good enough' alignments (for RNA-seq of known genes, ChIP-seq) to high-quality (e.g., DNA-seq calling variants)
- Bowtie2 (easy, good enough), gmap (excellent, hard to use).
- Purpose-built tools that align and reduce. E.g., RNA-seq known gene differential expression -- kalisto, sailfish
Reduction
- BED of called peaks in a ChIP-seq experiment (e.g., MACS, FindPeaks)
- VCF of called variants (GATK, bcftools)
- Count table (e.g., tsv) in an RNA-seq experiment (python htseq2; GenomicFeatures::summarizeOverlaps())
(Statistical) analysis
- Why statistical analysis? data is fundamentally huge; biological questions are framed in terms of classical statistics, e.g., designed experiments, hypothesis testing; technical and other artifacts, e.g., GC bias, mapability, batch effects
- Appropriate tools: able to cope with statistics; access to advanced statistical methods; analysis has to be reproducible (some sort of scripting); processing large amounts of data is not the primary criterion.
- R / Bioconductor is the best most awesome tool.
Comprehension
- .Rmd or similar documenting the work flow, including inputs, analysis steps, tables, figures, interpertation...

FASTQ and BAM files

View from the Linux command line...

zcat *fastq.gz | less
samtools view -h *bam

... or within R / Bioconductor: fastq files

library(ShortRead)
strm = FastqStreamer("bigdata/SRR1039508_1.fastq.gz", 100000)
fq = yield(strm)
fq
sread(fq)
quality(fq)

R

Statistical programming language
Vectorized (works efficiently on vectors; vector notation is very expressive and compact)
Objects help to coordinate management of related data
Introspection helps discover what can be done with objects.

x = rnorm(1000)
y = x + rnorm(1000, sd=.5)
df = data.frame(x=x, y=y)
plot(y ~ x, df)
fit = lm(y ~ x, df)
class(fit)
methods(class=class(fit))
methods("anova")

Help!

?log
?plot    # generic 'plot'
?plot.lm # plot for objects of class 'lm'

Bioconductor

Main web site, including biocViews
Package landing pages, e.g., ChIPseeker
The support forum
1100+ packages for analysis and comprehension of high-throughput genomic data: sequencing (RNA, ChIP, variants, ...), microarray (expression, methylation, copy number, etc), flow cytometry, proteomics, imaging, ...

Extensive use of 'S4' classes

fit (from lm()) is an example of an S3 class
sread(fq) returned a DNAStringSet, an example of an S4 class

library(ShortRead)
strm = FastqStreamer("bigdata/SRR1039508_1.fastq.gz", 100000)
fq = yield(strm)          # 'ShortReadQ' S4 class
class(fq)                 # introspection
methods(class=class(fq))  
reads = sread(fq)         # accessor -- get the reads
reads                     # 'DNAStringSet' S4 class
methods(class=class(reads))
gc = letterFrequency(reads, "GC", as.prob=TRUE)
hist(gc)

Help!

?DNAStringSet      # class, and often frequently used methods
?letterFrequency   # generic
methods("letterFrequency")
?"letterFrequency,XStringSet-method"

And...

Key software packages...

ShortRead for FASTQ files
GenomicAlignments for aligned reads
VariantAnnotation for VCF files
rtracklayer import() to import BED, WIG, GFF, GTF, ..., files
Gviz for visualization of genomic data; ReportTools for reports; shiny for interactive visualizations

... and classes

DNAStringSet, DNAString for sequence data
GRanges, GRangesList for representing coordinates in genome space
SummarizedExperiment (ExpressionSet): integrated data contains: rows x columns (features x samples)
- assays()
- rowRanges() for annotations on rows
- colData() for column annotations

Annotation

Pure 'data' packages
Identifier mapping org.* packages
Gene models with TxDb.* packages
Whole genome sequences BSgenome.* packages
biomaRt for accessing ENSEMBL-based biomarts; AnnotationHub for genome-scale annotation resources

Strategies for working with big data

Write efficient R code -- vectorized
Process data in chunks, e.g., FastqStreamer(), Rsamtools::BamFile(..., yieldSize=1000000); GenomicFiles::reduceByYield() (see examples on ?reduceByYield)
Process in parallel BiocParallel

All material on the course materials page

Bioconductor/BiocEMBO2015 documentation built on May 6, 2019, 7:48 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Bioconductor/BiocEMBO2015
Course material for EMBO Practical Course: Analysis of High-Throughput Sequencing Data, Hinxton, UK, October, 2015

In Bioconductor/BiocEMBO2015: Course material for EMBO Practical Course: Analysis of High-Throughput Sequencing Data, Hinxton, UK, October, 2015

Intro to sequencing

FASTQ and BAM files

R

Bioconductor

And...

R Package Documentation

Browse R Packages

We want your feedback!

Bioconductor/BiocEMBO2015 Course material for EMBO Practical Course: Analysis of High-Throughput Sequencing Data, Hinxton, UK, October, 2015

In Bioconductor/BiocEMBO2015: Course material for EMBO Practical Course: Analysis of High-Throughput Sequencing Data, Hinxton, UK, October, 2015

Intro to sequencing

FASTQ and BAM files

R

Bioconductor

And...

R Package Documentation

Browse R Packages

We want your feedback!

Bioconductor/BiocEMBO2015
Course material for EMBO Practical Course: Analysis of High-Throughput Sequencing Data, Hinxton, UK, October, 2015