Bam Parameters - BamParam

High throughput sequencing RNA-Seq data comes in a multitude of flavours, i.e. even from a single provider, protocol - e.g. strand specific, paired-end - reads characteristics - e.g. read length - will vary.

The easyRNASeq simpleRNASeq method will infer these information based on excerpts sampled from the data. However, it is always best to provide these information, as 1. the inference is done on small excerpt and can fail 2. it is always good to document an analysis

By default easyRNASeq simpleRNASeq will keep the inferred parameters over the user-provided parameters if these do not agree and emit corresponding warnings. The choice to rely on inferred parameters over user-provided one is to enforce user to cross-validate their knowledge of the data characteristics, as these are crucial for an adequate processing. Remember GIGO[^2].

If the automatic inference does fail, please let me know, so that I optimise it. Meanwhile, you can use the override argument to enforce the use of user-passed parameters.

To reproduce the results from Robinson, Delhomme et al. [@Robinson:2014p6362], we first need to download an excerpt of the data.

We first retrieve the file listing and md5 codes

download.file(url=paste0("ftp://ftp.plantgenie.org/Tutorials/RnaSeqTutorial/",
                         "data/star/md5.txt"),
                  destfile="md5.txt")

In this part of the vignette, we will NOT process all the data, albeit it would be possible, but for the sake of brevity, we will only retrieve the six first datasets. We get these from the sample information contained within this package.

data(RobinsonDelhomme2014)
lapply(RobinsonDelhomme2014[1:6,"Filename"],function(f){
    # BAM files
    download.file(url=paste0("ftp://ftp.plantgenie.org/Tutorials/",
                             "RnaSeqTutorial/data/star/",f),
                  destfile=as.character(f))
    # BAM index files
    download.file(url=paste0("ftp://ftp.plantgenie.org/Tutorials/",
                             "RnaSeqTutorial/data/star/",f,".bai"),
                  destfile=as.character(paste0(f,".bai")))
})
# THIS IS A subset of the data (chr 19 only) used to speed up
# the vignette creation while still testing capabilities
data(RobinsonDelhomme2014)
lapply(RobinsonDelhomme2014[1:6,"Filename"],function(f){
    # BAM files
    file.copy(
        dir(vDir,pattern=paste0(as.character(f),"$"),full.names=TRUE)
        ,file.path(".",f))
    # BAM index files
    file.copy(
        dir(vDir,pattern=paste0(as.character(f),".bai"),full.names=TRUE)
        ,file.path(".",paste0(f,".bai")))
})

These six files - as the rest of the dataset - have been sequenced on an Illumina HiSeq 2500 in paired-end mode using a non-strand specific library protocol with a read length of 100 bp. The raw data have been processed as described in the aforementioned guidelines[^1] and as such have been filtered for rRNA sequences, trimmed for adapters and clipped for quality. The resulting reads (of length 50-100bp) have then been aligned using STAR [Dobin:2013p5293].

Using these information, we finally generate the BamParam object.

bamParam <- BamParam(paired = TRUE,
                     stranded = FALSE)

A third parameter yieldSize can be set to speed up the processing on multi-CPU or multi-core computers. It splits and processed the BAM files in chunk of size yieldSize with a default of 1M reads.

Finally, we create the list of BAM files we just retrieved.

bamFiles <- getBamFileList(dir(".","*\\.bam$"),
                           dir(".","*\\.bai$"))



Try the easyRNASeq package in your browser

Any scripts or data that you put into this service are public.

easyRNASeq documentation built on April 30, 2020, 2 a.m.