In charles-plessy/CAGEr: Analysis of CAGE (Cap Analysis of Gene Expression) sequencing data for precise mapping of transcription start sites and promoterome mining

knitr::opts_chunk$set(cache  = TRUE)
knitr::opts_knit$set(verbose = TRUE)
# knitr::opts_chunk$set(dev=c('svg', 'png'))
options(width = 100)
ggplot2::theme_set(ggplot2::theme_bw())

Jump directly to the Analysis section if you are not interested in the details of the data processing.

Data load and QC in R

Metadata that varies from run to run:

LIBRARY       <- "<Insert library ID>"  # For example, the sequencing run ID.
PATH_TO_FILES <- "" # 
BS_GENOME     <- "<Insert name of a BSGenome>" # For example "BSgenome.Hsapiens.UCSC.hg38" or NULL

Load R libraries

requireNamespace("AnnotationHub") # Make sure the package is available; but do not load it.
requireNamespace("pheatmap")      # Make sure the package is available; but do not load it.
library("CAGEr")
library("ggplot2")
library("magrittr")
library("SummarizedExperiment")
library("vegan")

Create CAGEexp object and load data

pathsToInputFiles <- list.files(PATH_TO_FILES, "*.bed", full.names = TRUE) %>%
  Filter(f = function(file) file.size(file) > 0)
samplenames <- basename(pathsToInputFiles) %>% sub(".bed", "", .)
ce <-
  new( "CAGEexp"
     , colData = DataFrame( inputFiles     = pathsToInputFiles
                          , sampleLabels   = sampleLabels
                          , inputFilesType = "bed"
                          , row.names      = sampleLabels)
     , metadata = list(genomeName = "BS_GENOME"))
getCTSS(ce)
normalizeTagCount(ce, method = "simpleTpm")

Annotation with GENCODE

Here we use Bioconductor's annotation hub to retreive GENCODE, but any other annotation can be loaded directly from a file; see the Zv9_annot documentation page for an example of how an annotation file was created for zebrafish using Biomart.

The annotateCTSS and CTSStoGenes modify the CAGEexp object; see their documentation for details.

ah <- AnnotationHub::AnnotationHub()
# Search for the annotation you need.
# AnnotationHub::query(ah, c("Gencode", "gff", "human"))
ah["AH49556"]
gff <- ah[["AH49556"]]
gff

annotateCTSS(ce, gff)
CTSStoGenes(ce)

Annotation

plotAnnot(ce, 'counts')

Richness

ce$r100l1 <- rarefy(t(as.data.frame(CTSStagCountDF(ce))),100)
ce$r100l1[ce$counts < 100] <- NA
ce$r100g <- rarefy(t(data.frame(assay(GeneExpSE(ce)))),100)
ce$r100g[ce$counts < 100] <- NA

Correlation between runs

pheatmap::pheatmap( cor(data.frame(assay(GeneExpSE(ce))))
                  , annotation_col = data.frame(colData(ce))[,c("group"), drop = F])

Rarefaction (hanabi plot)

The purpose of these plots is to ease the comparison of the libraries given that they do not have exactly the same sequencing depth, and to evaluate if extra sequencing depth would be useful.

The data is rarefied with enough sampling points to give a smooth appearance to the curves. If it takes too much time, the output can be stored in an object saved on the hard drive. This way, if the commands have been run through knitr, the long computations here can be skipped if needed again in an interactive session.

For example:

rar1 <- hanabi(assay(l1), from = 0)
saveRDS(rar1, "rar1.rds")
# And later...
rar1 <- readRDS("rar1.rds")

Transcript discovery (with raw reads)

Would we detect more transcripts if we sequenced more raw reads ?

````r hanabiPlot( tapply(bed$score, bed$library, hanabi, from = 0) %>% structure(class = "hanabi") , ylab = 'number of molecules detected' , xlab = 'number of properly mapped reads' , main = paste("Transcript discovery for", LIBRARY) , GROUP = bed$library %>% levels %>% sub("_.*", "", .))

### Gene discovery

Would we detect more genes if we detected more transcripts ?

```r
hanabiPlot( hanabi(assay(GeneExpSE(ce))
          , group = ce$group
          , ylab  = "number of genes detected"
          , xlab  = "number of unique molecule counts"
          , main  = "Gene discovery"))

TSS discovery

Would we detect more TSS if we detected more transcripts ?

h <- hanabi(CTSStagCountDF(ce)) # Save results as computation may be slow.
hanabiPlot( h
          , group = ce$group
          , ylab  = "number of TSS detected"
          , xlab  = "number of unique molecule counts"
          , main  = "TSS discovery")