startQuest("Parallel processing")
    h1("Prelude")
    h2("Aim")

The following six chapters aim at "taking a look under the hood" of how feature summarisation i.e. creating a count table by overlapping the read alignment and the genic annotation information; functions. At the same time, we will toy with computer optimisation (parallel processing and memory management). This will introduce you to an efficient and practical R usage.

Before experimenting with the Bioconductor packages functionalities that were presented in the lecture, we will first sublimate the knowledge you've gathered so far into adressing the computational challenges faced when using HTS data: i.e.: resources and time consumption.

In the lecture, the readGAlignments function from the GenomicAlignments package was introduced and used to extract a GAlignments object. However, most of the times, an experiment will NOT consist of a single sample (of about a million reads only!) and an obvious way to speed up the process is to parallelize. In the following three sections, we will see how to perform this before ultimately discussing the pros and cons of the implemented method.

    h2("Creating GAlignment objects from BAM files")

First of all, locate the BAM files and implement a function to read them sequentially. Have a look at the lapply function man page for doing so.

   library(GenomicAlignments)
    bamfiles <- dir(file.path(extdata(),"BAM"),
                    pattern="*.bam$",full.names=TRUE)
    gAlns <- lapply(bamfiles,readGAlignments)

Nothing complicated so far, right? We proceed both files sequentially and get a list of GAlignments objects stored in the gAlns object. Apart from the coding enhancement - with one line, we can process all our samples - there is no other gains.

    h2("Processing the files in parallel")

Modern laptop possess commonly 2 multi-core CPUs that can perform tasks independently. Computational servers usually have many CPUs (commonly 8) each having several cores. An obvious enhancement to our previous solution is to take advantage of this CPU architecture and to process our samples in parallel.

Have a look at the parallel package and in particular at the mclapply function to re-implement the previous function in a parallel manner.

    library(parallel)
    gAlns <- mclapply(bamfiles,readGAlignments)
    quest(1)
    endQuest()
    startQuest("Memory management")
    h2("Dealing with memory problems")

It is NOT because there were 2 files to proceed. The mclapply has a number of default parameters - see ?mclapply for details - including the one that defaults to 2. If you want to proceed more samples in parallel, set that parameter value accordingly.

This new implementation has the obvious advantage to be X times faster (with X being the number of CPU used, or almost so as parallelization comes with a slight overhead cost), but it put a different strain on the system. As several files are being processed in parallel, the memory requirement also increase by a factor X (assuming files of almost equivalent size are to be processed). This might be fine on a computational server but given the constant increase in sequencing reads being produced per run, this will eventually be challenged.

Can you think of the way this memory issue could be adressed? i.e.: what could we modify in the way we read/process the file to limit the memory required at a given moment?

No, buying more memory is usually not an option. And anyway, at the moment, the increase rate of reads sequenced per run is faster than the memory doubling time. So, let us just move to the next section to have a go at adressing the issue.

    h2("Processing the files one chunk at a time")

To limit the memory required at any moment, one approach would be to proceed the file not as a whole, but chunk-wise. As we can assume that reads are stored independently in BAM files (or almost so, think of how Paired-End data is stored!), we simply can decide to parse, e.g.: 1,000,000 reads at a time. This will of course require to have a new way to represent a BAM file in R, i.e.: not just as a character string as we had it until now in our bamfiles object.

The Rsamtools package again comes in handy. Lookup the ?BamFile help page and try to scheme how we could take advantage of the BamFile or BamFileList classes for our purpose. As the files are small, use a yieldSize of a 100,000 reads.

The parameter of either class looks like exactly what we want. Let us recode our character object into a BamFileList.

    library(Rsamtools)
    bamFileList <- BamFileList(bamfiles,yieldSize=10^5)

Now that we have the BAM files described in a way that we can process them chunk-wise, let us do so. The paradigm is as follow: Note: it is just that: a paradigm, so DO NOT RUN IT but rather inspire yourself from it.

    ## DO NOT RUN ME!
    ## I AM JUST AN EXAMPLE
    open(bamFile)
    while(length(chunk <- readGAlignmentsFromBam(bamFile))){                
          message(length(chunk))
    }
    close(bamFile)

In the paradigm above, we process one BAM file chunk wise and report the sizes of the chunks. i.e.: these would be 0.1M reads - in our case - apart for the last one, which would be smaller or equal to 0.1M (it is unlikely that a sequencing file contains an exact multiple of our chunk size).

Now, try to implement the above paradigm in the function we implemented previously - remember the solution with mclapply previously - so as to process both our BAM files in parallel and chunk-wise.

    gAlns <- mclapply(bamFileList,function(bamFile){
     open(bamFile)
     gAln <- GAlignments()
     while(length(chunk <- readGAlignments(bamFile))){
       gAln <- c(gAln,chunk)
     }
     close(bamFile)
     return(gAln)
    })
    h2("Pros and cons of the current solution")

Before reading my comments below, take the time to jot down what you think are the advantages and drawbacks of the method implemented above. My own comments below are certainly not extensive and I would be curious to hear yours that are not matched with mine.

    h3("Pros")
  1. We have written a streamlined piece of code, using up to date functionalities from other packages. Hence, it is both easily maintainable and updatable.

  2. With regards to time consumption, we have reduced it by a factor 2 and that can be reduced further by using computer with more CPUs or a compute farm even - obviously if we have more than \(2\) samples to process.

  3. We have implemented the processing of the BAM files by chunk

    h3("Cons")
  1. There's only one big con really: we have NOT addressed the memory requirement issue satisfyingly. We do proceed the BAM files by chunks, but then we simply aggregate these chunks without further processing, so we eventually end up using the same amount of memory. This is the best we can do so far given the introduced Bioconductor functionalities, so let us move to the next step in the pipeline that will help us resolve that but first we should recap the usage of the Bioconductor packages for loading and manipulating sequencing read information in R, which is next chapter's topic.
    quest(2)
    endQuest()


UPSCb/RnaSeqTutorial documentation built on Nov. 24, 2020, 12:40 a.m.