R/combine_bam.R

Defines functions bamtolist

Documented in bamtolist

#' From BAM files to lists of data tables or GRangesList objects.
#'
#' This function reads one or multiple BAM files converting them into data
#' tables or GRanges objects, arranged in a list or a GRangesList, respectively.
#' In both cases the list elements contain, for each read: i) the name of the
#' corresponding reference sequence (i.e. of the transcript on which it aligns);
#' ii) its leftmost and rightmost position with respect to the 1st nucleotide of
#' the reference sequence; iii) its length (intended as the width of the
#' reference sequence region covered by the RNA fragment, see parameter
#' \code{indel_threshold} and \code{Details}); iv) the leftmost and rightmost
#' position of the annotated CDS of the reference sequence (if any) with respect
#' to its 1st nucleotide. Please note: start and stop codon positions for
#' transcripts without annotated CDS are set to 0.
#'
#' @param bamfolder Character string specifying the path to the folder storing
#'   BAM files.
#' @param annotation Data table as generated by \code{\link{create_annotation}}.
#'   Please make sure the name of reference transcripts in the annotation data
#'   table match those in the BAM files (see \code{refseq_sep}).
#' @param transcript_align Logical value whether BAM files in \code{bamfolder}
#'   come from a transcriptome alignment (intended as an alignment against
#'   reference transcript sequences, see \code{Details}). If TRUE (the default),
#'   reads mapping on the negative strand should not be present and, if any,
#'   they are automatically removed.
#' @param name_samples Named character string vector specifying the desired name
#'   for the elements of the output list of data tables. A character string for
#'   each BAM file in \code{bamfolder} is required. Plase be careful to name
#'   each element of the vector after the correct corresponding BAM file in
#'   \code{bamfolder}, leaving their path and extension out. No specific order
#'   is required. Default is NULL i.e. list elements are named after the name of
#'   the BAM files, leaving their path and extension out. See the example for
#'   additional details.
#' @param indel_threshold Positive integer value specifying the maximum number
#'   of indels (insertions + deletions) allowed for each read. All reads
#'   associated to more indels than specified are discarded (see
#'   \code{Details}). Default is 5.
#' @param refseq_sep Character specifying the separator between reference
#'   sequences' name and additional information to discard, stored in the same
#'   field (see \code{Details}). All characters before the first occurrence of
#'   the specified separator are kept. Default is NULL i.e. no string splitting
#'   is performed.
#' @param output_class Either "datatable" or "granges". It specifies the format
#'   of the output i.e. a list of data tables or a GRangesList object. Default
#'   is "datatable".
#' @return A list of data tables or a GRangesList object.
#' @details \strong{riboWaltz} only works for read alignments based on
#'   transcript coordinates. This choice is due to the main purpose of RiboSeq
#'   assays to study translational events through the isolation and sequencing
#'   of ribosome protected fragments. Most reads from RiboSeq are supposed to
#'   map on mRNAs and not on introns and intergenic regions. Nevertheless, BAM
#'   based on transcript coordinates can be generated in two ways: i) aligning
#'   directly against transcript sequences; ii) aligning against standard
#'   chromosome sequences, requiring the outputs to be translated in transcript
#'   coordinates. The first option can be easily handled by many aligners (e.g.
#'   Bowtie), given a reference FASTA file where each sequence represents a
#'   transcript, from the beginning of the 5' UTR to the end of the 3' UTR. The
#'   second procedure is based on reference FASTA files where each sequence
#'   represents a chromosome, usually coupled with comprehensive gene annotation
#'   files (GTF or GFF). The STAR aligner, with its option --quantMode
#'   TranscriptomeSAM (see Chapter 6 of its
#'   \href{http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STAR.posix/doc/STARmanual.pdf}{manual}),
#'   is an example of tool providing such a feature.
#'    
#'   \code{indel_threshold} is aimed at keeping only reads showing small
#'   differences between the width of the RNA fragment and the length of the
#'   reference sequence region covered after their alignment (for additional
#'   details please read
#'   \href{https://bioconductor.org/packages/release/bioc/manuals/GenomicAlignments/man/GenomicAlignments.pdf}{here}
#'   about \emph{qwidth} and \emph{width}, defined in the GAlignments-class of
#'   the GenomicAlignments package). Strong differences between these values are
#'   usually due to many indels which in turn are caused by loose thresholds
#'   used in the alignment process. High numbers of consecutive deletions
#'   (nucleotides present in the reference sequence but not in the read) may be
#'   an indication of reads mapping on exon-exon junctions. Despite this
#'   eventuality is not in accordance with alignments based on transcript
#'   coordinates, stretches  of deletions can be included by errors or
#'   inaccurate transcript annotation and should be discarded. In fact,
#'   \strong{riboWaltz} deliberately identifies the length of the reference
#'   sequences covered by the reads with the reads length, referring to it as
#'   \emph{length} in all data structures reporting reads or P-site offsets
#'   information (i.e. those generated by \code{bamtolist} itself,
#'   \code{\link{bamtobed}}, \code{\link{bedtolist}}, \code{\link{psite}} and
#'   \code{\link{psite_info}}). This value is indeed strongly connected to the
#'   identification of the P-site offsets since it determines the position of
#'   the 5' and 3' read extremities (columns \emph{end5} and \emph{end3} in many
#'   data structures generated by the package), used as starting points by
#'   \strong{riboWaltz}'s core algorithm.
#'    
#'   \code{refseq_sep} is intended to lighten the identifiers of the reference
#'   sequences included in the output list of data table or to modify them to
#'   match those in the annotation table. Many details about the reference
#'   sequence such as their version (usually dot-separated), their length, name
#'   variants, associated gene/transcript/protein names (usually pipe-separated)
#'   might indeed be stored in the FASTA file used for the alignment and
#'   automatically transferred in the BAM.
#' @examples 
#' ## ## Let's suppose there are two BAM files ("Samp1.bam" and "Samp2.bam") in
#' ## ## "path/to/BAM/files". We want to acquire them and assign to the 
#' ## ## corresponding data tables the names "Control" and "Treated",
#' ## ## respectively.
#' ## ## We first define the "name_samples" character string vector as follow:
#' ## name_of_bams <- c("Control", "Treated")
#' ## names(name_of_bams) <- c("Samp1", "Samp2")
#' ##
#' ## ## Then, we can acquire the two files: 
#' ## path_bam <- "path/to/BAM/files"
#' ## reads_list <- bamtolist(bamfolder = path_bam, name_samples = name_of_bams,
#' ##                           annotation = annotation_dt)
#' ##
#' ## ## read_list will be a list of two data tables, named "Control" (with
#' ## ## mapping reads from "Samp1.bam") and "Treated" (with mapping reads
#' ## ## from"Samp2.bam")
#' @import data.table
#' @export
bamtolist <- function(bamfolder, annotation, transcript_align = TRUE, 
                      name_samples = NULL, indel_threshold = 5,
                      refseq_sep = NULL, output_class = "datatable") {

  name_bams <- list.files(path = bamfolder, pattern = ".bam$")

  if (length(name_samples) == 0) {
    name_samples <- unlist(strsplit(name_bams, ".bam"))
    names(name_samples) <- name_samples
  } else {
    if(is.null(names(name_samples)) | any(names(name_samples) == "") ){
      stop("name_samples must be a named character vector or NULL.")
    } else {
      if( !all( sel <- names(name_samples) %in% unlist(strsplit(name_bams, ".bam")) ) ){
        if(sum(!sel) == 1){
          stop("file(s) not found in folder ", bamfolder, ": ",
               paste0(names(name_samples)[!sel], ".bam. "),
               "Please check name_samples\n")
        } else {
          stop("file(s) not found in folder ", bamfolder, ": ",
               paste0(names(name_samples)[!sel][1:(sum(!sel) - 1)], ".bam, "),
               paste0(names(name_samples)[!sel][sum(!sel)], ".bam. "),
               "Please check name_samples\n")
        }
      }
    }
  }

  sample_reads_list <- list()
  for(current_sample in names(name_samples)) {
    current_bam <- paste0(current_sample,".bam")

    cat(sprintf("Reading %s\n", current_bam))
    filename <- file.path(bamfolder, current_bam)
    dt <- as.data.table(GenomicAlignments::readGAlignments(filename))
    nreads <- nrow(dt)
    cat(sprintf("Input reads: %s M\n", format(round((nreads / 1000000), 3), nsmall = 3)))
    dt <- dt[, diff_width := qwidth - width
             ][abs(diff_width) <= indel_threshold]
    if(nreads != nrow(dt)){
      perc_nreads <- round(((nreads - nrow(dt)) / nreads) * 100, 3)
      if(perc_nreads >= 0.001){
        cat(sprintf("%s M  (%s %%) reads removed: exceeding indel_threshold.\n", 
                    format(round((nreads - nrow(dt)) / 1000000, 3), nsmall = 3), 
                    format(perc_nreads, nsmall = 3) ))
      } else {
        cat(sprintf("%s (< 0.001 %%) reads removed: exceeding indel_threshold.\n", 
                    format(round((nreads - nrow(dt)), 3), nsmall = 3)))
      }
    } else {
      cat("Good! Number of indel below indel_threshold for all reads. No reads removed.\n")
    }
    
    dt <- dt[, .(seqnames, start, end, width, strand)]
    setnames(dt, c("transcript", "end5", "end3", "length", "strand"))
    
    if(length(refseq_sep) != 0){
      dt <- dt[, transcript := tstrsplit(transcript, refseq_sep, fixed = TRUE, keep = 1)]
    }

    nreads <- nrow(dt)
    dt <- dt[as.character(transcript) %in% as.character(annotation$transcript)]
    if(nreads != nrow(dt)){
      if(nrow(dt) == 0){
        stop(sprintf("%s M  (%s %%) reads removed: reference transcript IDs not found in annotation table.\n\n",
                     format(round((nreads - nrow(dt)) / 1000000, 3), nsmall = 3), 
                     format(round(((nreads - nrow(dt)) / nreads) * 100, 3), nsmall = 3) ))
      } else{
        cat(sprintf("%s M  (%s %%) reads removed: reference transcript IDs not found in annotation table.\n", 
                    format(round((nreads - nrow(dt)) / 1000000, 3), nsmall = 3), 
                    format(round(((nreads - nrow(dt)) / nreads) * 100, 3), nsmall = 3) )) 
      }
    } else {
      cat("Great! All reads' reference transcript IDs were found in the annotation table. No reads removed.\n")
    }
    
    if(transcript_align == TRUE | transcript_align == T){
      nreads <- nrow(dt)
      dt <- dt[strand == "+"]
      
      if(nreads != nrow(dt)){
        perc_nreads <- round(((nreads - nrow(dt)) / nreads) * 100, 3)
        if(perc_nreads >= 0.001){
          cat(sprintf("%s M  (%s %%) reads removed: mapping on negative strand.\n", 
                      format(round((nreads - nrow(dt)) / 1000000, 3), nsmall = 3), 
                      format(perc_nreads, nsmall = 3) ))
        } else {
          cat(sprintf("%s (< 0.001 %%) reads removed: mapping on negative strand.\n", 
                      format(round((nreads - nrow(dt)), 3), nsmall = 3)))
        }
      } else {
        cat("Cool! All reads mapping on positive strand. No reads removed.\n")
      }
    }
    
    dt[annotation, on = 'transcript', c("cds_start", "cds_stop") := list(i.l_utr5 + 1, i.l_utr5 + i.l_cds)]
    dt[cds_start == 1 & cds_stop == 0, cds_start := 0]
    dt[, strand := NULL]
    
    nreads <- nrow(dt)
    cat(sprintf("Output reads: %s M\n", format(round((nreads / 1000000), 3), nsmall = 3)))
    
    if (output_class == "granges") {
      dt <- GenomicRanges::makeGRangesFromDataFrame(dt,
                                                    keep.extra.columns = TRUE,
                                                    ignore.strand = TRUE,
                                                    seqnames.field = c("transcript"),
                                                    start.field = "end5",
                                                    end.field = "end3",
                                                    strand.field = "strand",
                                                    starts.in.df.are.0based = FALSE)
      GenomicRanges::strand(dt) <- "+"
    }
    
    sample_reads_list[[name_samples[current_sample]]] <- dt
    cat("Done!", current_bam, "has been loaded as", name_samples[current_sample], "\n\n")
    
  }
  
  if (output_class == "granges") {
    sample_reads_list <- GenomicRanges::GRangesList(sample_reads_list)
  }
  
  return(sample_reads_list)
}

#' From BAM files to BED files.
#'
#' This function reads one or multiple BAM files converting them into BED files
#' that contain, for each read: i) the name of the corresponding reference
#' sequence (i.e. of the transcript on which it aligns); ii) its leftmost and
#' rightmost position with respect to the 1st nucleotide of the reference
#' sequence; iii) its length (intended as the width of the reference sequence
#' region covered by the RNA fragment. For further information about this choice
#' please refer to section \code{Details} of function \code{\link{bamtolist}});
#' iv) the strand on which it aligns. Please note: this function relies on the
#' \emph{bamtobed} utility of the BEDTools suite and can be only run on UNIX,
#' LINUX and Apple OS X operating systems. Moreover, to generate R data
#' structures containing reads information, the function \code{\link{bedtolist}}
#' must be run on the resulting BED files. For these reasons the authors suggest
#' the use of \code{\link{bamtolist}}.
#'
#' @param bamfolder Character string specifying the path to the folder storing
#'   BAM files. Please note: the function looks for BAM files recursively
#'   starting from the specified folder.
#' @param bedfolder Character string specifying the path to the directory where
#'   BED files shuold be stored. If the specified folder doesn't exist, it is
#'   automatically created. If NULL (the default), BED files are stored in a new
#'   subfolder of the working directory, called \emph{bed}.
#' @examples
#' ## path_bam <- "path/to/BAM/files"
#' ## path_bed <- "path/to/output/directory"
#' ## bamtobed(bamfolder = path_bam, bedfolder = path_bed)
#' @export
bamtobed <- function(bamfolder, bedfolder = NULL) {
  if (length(bedfolder) == 0) {
    bedfolder <- paste(bamfolder, "/bed", sep = "")
  }
  if (!dir.exists(bedfolder)) {
    dir.create(bedfolder)
  }
  bamtobed <- paste("bam_folder=\"", bamfolder, "\" && out_folder=\"", bedfolder, "\" && bam_files=$(find $bam_folder -type \"f\" -name \"*.bam\" | sort) && for name in $bam_files; do outname=$(echo $name | rev | cut -d'/' -f 1 | cut -d '.' -f 2- | rev); printf \"from \\t\\t $name \\t\\t to \\t\\t $out_folder/$outname.bed\\n\"; bedtools bamtobed -i $name | cut -f1,2,3,6 | awk -F'\\t' '{OFS = \"\\t\"; $5=$4; $2=$2+1; $4=$3-$2+1; print}' > $out_folder/$outname.bed; done",
                    sep = "")
  system(bamtobed)
}

#' From BED files to lists of data tables or GRangesList objects.
#'
#' This function reads one or multiple BED files, as generated by
#' \code{\link{bamtobed}}, converting them into data tables or GRanges objects,
#' arranged in a list or a GRangesList, respectively. In both cases two columns
#' are attached to the original data containing, for each read, the leftmost
#' and rightmost position of the annotated CDS of the reference sequence (if
#' any) with respect to its 1st nucleotide. Please note: start and stop codon
#' positions for transcripts without annotated CDS are set to 0.
#'
#' @param bedfolder Character string specifying the path to the folder storing
#'   BED files as generated by \code{\link{bamtobed}}.
#' @param annotation Data table as generated by \code{\link{create_annotation}}.
#'   Please make sure the name of reference transcripts in the annotation data
#'   table match those in the BED files (see also \code{refseq_sep}).
#' @param transcript_align Logical value whether BED files in \code{bedfolder}
#'   come from a transcriptome alignment (intended as an alignment against
#'   reference transcript sequences, see \code{Details}). If TRUE (the default),
#'   reads mapping on the negative strand should not be present and, if any,
#'   they are automatically removed.
#' @param name_samples Named character string vector specifying the desired name
#'   for the output list elements. A character string for each BED file in
#'   \code{bedfolder} is required. Plase be careful to name each element of the
#'   vector after the correct corresponding BED file in \code{bedfolder},
#'   leaving their path and extension out. No specific order is required.
#'   Default is NULL i.e. list elements are named after the name of the BED
#'   files, leaving their path and extension out.
#' @param refseq_sep Character specifying the separator between reference
#'   sequences' name and additional information to discard, stored in the same
#'   field (see \code{Details}). All characters before the first occurrence of
#'   the specified separator are kept. Default is NULL i.e. no string splitting
#'   is performed.
#' @param output_class Either "datatable" or "granges". It specifies the format
#'   of the output i.e. a list of data tables or a GRangesList object. Default
#'   is "datatable".
#' @details \strong{riboWaltz} only works for read alignments based on
#'   transcript coordinates. This choice is due to the main purpose of RiboSeq
#'   assays to study translational events through the isolation and sequencing
#'   of ribosome protected fragments. Most reads from RiboSeq are supposed to
#'   map on mRNAs and not on introns and intergenic regions. Nevertheless, BAM
#'   based on transcript coordinates can be generated in two ways: i) aligning
#'   directly against transcript sequences; ii) aligning against standard
#'   chromosome sequences, requiring the outputs to be translated in transcript
#'   coordinates. The first option can be easily handled by many aligners (e.g.
#'   Bowtie), given a reference FASTA file where each sequence represents a
#'   transcript, from the beginning of the 5' UTR to the end of the 3' UTR. The
#'   second procedure is based on reference FASTA files where each sequence
#'   represents a chromosome, usually coupled with comprehensive gene annotation
#'   files (GTF or GFF). The STAR aligner, with its option --quantMode
#'   TranscriptomeSAM (see Chapter 6 of its
#'   \href{http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STAR.posix/doc/STARmanual.pdf}{manual}),
#'    is an example of tool providing such a feature.
#'    
#'   \code{refseq_sep} is intended to lighten the identifiers of the reference
#'   sequences included in the final data table or to modify them to match those
#'   in the annotation table. Many details about the reference sequence such as
#'   their version (usually dot-separated), their length, name variants,
#'   associated gene/transcript/protein names (usually pipe-separated) might
#'   indeed be stored in the FASTA file used for the alignment and automatically
#'   transferred in the BAM.
#' @return A list of data tables or a GRangesList object.
#' @examples
#' ## path_bed <- "path/to/BED/files"
#' ## bedtolist(bedfolder = path_bed, annotation = mm81cdna)
#' @import data.table
#' @export
bedtolist <- function(bedfolder, annotation, transcript_align = TRUE,
                      name_samples = NULL, refseq_sep = FALSE, 
                      output_class = "datatable") {
  names <- list.files(path = bedfolder, pattern = ".bed$")
  if (length(name_samples) == 0) {
    name_samples <- unlist(strsplit(names, ".bed"))
    names(name_samples) <- unlist(strsplit(names, ".bed"))
  } else {
    if (length(name_samples) > length(names)) {
      cat("\n")
      stop("length of name_samples greater than number of files\n\n")
    }
    if (length(name_samples) < length(names)) {
      cat("\n")
      stop("length of name_samples smaller than number of files\n\n")
    }
  }

  i <- 0
  sample_reads_list <- list()
  for (n in names) {
    i <- i + 1
    cat(sprintf("Reading %s\n", n))
    filename <- paste(bedfolder, n, sep = "/")
    dt <- fread(filename, sep = "\t", header = FALSE)
    setnames(dt, c("transcript", "end5", "end3", "length", "strand"))
    
    if(length(refseq_sep) != 0){
      dt <- dt[, transcript := tstrsplit(transcript, refseq_sep, fixed = TRUE, keep = 1)]
    }
    
    nreads <- nrow(dt)
    cat(sprintf("Input reads: %s M\n", format(round((nreads / 1000000), 3), nsmall = 3)))
    dt <- dt[as.character(transcript) %in% as.character(annotation$transcript)]
    if(nreads != nrow(dt)){
      if(nrow(dt) == 0){
        stop(sprintf("%s M  (%s %%) reads removed: reference transcript IDs not found in annotation table.\n\n",
                     format(round((nreads - nrow(dt)) / 1000000, 3), nsmall = 3), 
                     format(round(((nreads - nrow(dt)) / nreads) * 100, 3), nsmall = 3) ))
      } else{
        cat(sprintf("%s M  (%s %%) reads removed: reference transcript IDs not found in annotation table.\n", 
                    format(round((nreads - nrow(dt)) / 1000000, 3), nsmall = 3), 
                    format(round(((nreads - nrow(dt)) / nreads) * 100, 3), nsmall = 3) )) 
      }
    } else {
      cat("Great! All reads' reference transcript IDs were found in annotation table. No reads removed.\n")
    }
    
    if(transcript_align == TRUE | transcript_align == T){
      nreads <- nrow(dt)
      dt <- dt[strand == "+"]
      
      if(nreads != nrow(dt)){
        perc_nreads <- round(((nreads - nrow(dt)) / nreads) * 100, 3)
        if(perc_nreads >= 0.001){
          cat(sprintf("%s M  (%s %%) reads removed: mapping on negative strand.\n", 
                      format(round((nreads - nrow(dt)) / 1000000, 3), nsmall = 3), 
                      format(perc_nreads, nsmall = 3) ))
        } else {
          cat(sprintf("%s (< 0.001 %%) reads removed: mapping on negative strand.\n", 
                      format(round((nreads - nrow(dt)), 3), nsmall = 3)))
        }
      } else {
        cat("Cool! All reads mapping on positive strand. No reads removed.\n")
      }
    }
    
    dt[annotation, on = 'transcript', c("cds_start", "cds_stop") := list(i.l_utr5 + 1, i.l_utr5 + i.l_cds)]
    dt[cds_start == 1 & cds_stop == 0, cds_start := 0]
    dt[, strand := NULL]
    
    nreads <- nrow(dt)
    cat(sprintf("Output reads: %s M\n", format(round((nreads / 1000000), 3), nsmall = 3)))
    
    if(output_class == "granges") {
      dt <- GenomicRanges::makeGRangesFromDataFrame(dt,
                                                    keep.extra.columns = TRUE,
                                                    ignore.strand = TRUE,
                                                    seqnames.field = c("transcript"),
                                                    start.field = "end5",
                                                    end.field = "end3",
                                                    strand.field = "strand",
                                                    starts.in.df.are.0based = FALSE)
      GenomicRanges::strand(dt) <- "+"
    }
    
    sample_reads_list[[name_samples[i]]] <- dt
  }
  
  if(output_class == "granges") {
    sample_reads_list <- GenomicRanges::GRangesList(sample_reads_list)
  }
  
  return(sample_reads_list)
}
LabTranslationalArchitectomics/riboWaltz documentation built on Jan. 17, 2024, 12:18 p.m.