options(width=100)
knitr::opts_chunk$set(
    eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
    cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
suppressPackageStartupMessages({
    library(ENAR2016)
    library(ggplot2)
    library(airway)
    library(DESeq2)
    library(org.Hs.eg.db)
    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
    library(BSgenome.Hsapiens.UCSC.hg19)
    library(AnnotationHub)
})

Introduction

R

Functions

rnorm(10)     # sample 10 random normal deviates

Vectors

x <- rnorm(1000)
y <- x + rnorm(1000)
plot(y ~ x)

Objects: classes and methods

df <- data.frame(Y = y, X = x)
fit <- lm(Y ~ X, df)
plot(Y ~ X, df)
abline(fit, col="red", lwd=2)
anova(fit)

Packages

library(ggplot2)
ggplot(df, aes(x=X, y=Y)) + geom_point() + geom_smooth(color="blue") +
    geom_smooth(method="lm", color="red")

Help!

Bioconductor

Biological domains

Principles

Installation and use

Sequence Analysis

Six stages

Key packages: data access

Coordinated management: SummarizedExperiment

Case Study: RNA-Seq Differential Expression

This is derived from: RNA-Seq workflow: gene-level exploratory analysis and differential expression, by Michael Love, Simon Anders, Wolfgang Huber; modified by Martin Morgan, October 2015.

We walk through an end-to-end RNA-Seq differential expression workflow, using DESeq2 along with other Bioconductor packages.

The complete work flow starts from the FASTQ files, but we will start after reads have been aligned to a reference genome and reads overlapping known genes have been counted.

Our main focus is on differential gene expression analysis with DESeq2. Other Bioconductor packages are important in statistical inference of differential expression at the gene level, including Rsubread, edgeR, limma, BaySeq, and others.

The data are from an RNA-Seq experiment of airway smooth muscle cells treated with dexamethasone, a synthetic glucocorticoid steroid with anti-inflammatory effects. Glucocorticoids are used, for example, in asthma patients to prevent or reduce inflammation of the airways.

In the experiment, four primary human airway smooth muscle cell lines were treated with 1 micromolar dexamethasone for 18 hours. For each of the four cell lines, we have a treated and an untreated sample. The reference for the experiment is:

Himes BE, Jiang X, Wagner P, Hu R, Wang Q, Klanderman B, Whitaker RM, Duan Q, Lasky-Su J, Nikolos C, Jester W, Johnson M, Panettieri R Jr, Tantisira KG, Weiss ST, Lu Q. "RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells." PLoS One. 2014 Jun 13;9(6):e99625. PMID: 24926665. GEO: GSE52778.

Preparation

Counts

SummarizedExperiment

library(airway)
data("airway")
se <- airway

Assay data

Experimental design

Features (genes)

Analysis Pipeline

Create DESeqDataSet object

library(DESeq2)
dds <- DESeqDataSet(se, design = ~ cell + dex)

Run the pipeline

dds <- DESeq(dds)

Summarize results

res <- results(dds)
head(res)
mcols(res)                     # what does each column mean?
head(res[order(res$padj),])    # 'top table' by p-adj

Some Considerations

Experimental design

Keep it simple

Replicate

Avoid confounding experimental factors with other factors

Be aware of batch effects

HapMap samples from one facility, ordered by date of processing.

Wet-lab

Confounding factors

Artifacts of your particular protocols

Sequencing

Axes of variation

Application-specific, e.g.,

Alignment / Reduction

Alignment

Splice-aware aligners (and Bioconductor wrappers)

Reduction to 'count tables'

(Bowtie2 / tophat / Cufflinks / Cuffdiff / etc)

(kallisto / sailfish)

tx2gene links transcript IDs to gene IDs for summarization

tx2gene <- read.csv(file.path(dir, "tx2gene.csv"))

txi <- tximport(files, type="salmon", tx2gene=tx2gene)

Statistical Considerations

Unique statistical aspects

Summarization

Normalization

Appropriate error model

Pre-filtering

Borrowing information

Comprehension: placing differentially expressed regions in context

'Annotation' packages

Challenges & Solutions

Interoperability: Standard Representations

Genomic Ranges: GenomicRanges

Data / metadata integration: SummarizedExperiment

Scalability

Efficient R programming

Restriction and 'chunk' iteration

Parallel evaluation: BiocParallel

Reproducibility: From Script to Package

Scripts

Vignettes

Packages

Acknowledgments

Individuals

Annual conference: https://bioconductor.org/BioC2016

sessionInfo()






Bioconductor/ENAR2016 documentation built on May 6, 2019, 7:49 a.m.