bamtolist: From BAM files to lists of data tables or GRangesList...

View source: R/combine_bam.R

bamtolistR Documentation

From BAM files to lists of data tables or GRangesList objects.

Description

This function reads one or multiple BAM files converting them into data tables or GRanges objects, arranged in a list or a GRangesList, respectively. In both cases the list elements contain, for each read: i) the name of the corresponding reference sequence (i.e. of the transcript on which it aligns); ii) its leftmost and rightmost position with respect to the 1st nucleotide of the reference sequence; iii) its length (intended as the width of the reference sequence region covered by the RNA fragment, see parameter indel_threshold and Details); iv) the leftmost and rightmost position of the annotated CDS of the reference sequence (if any) with respect to its 1st nucleotide. Please note: start and stop codon positions for transcripts without annotated CDS are set to 0.

Usage

bamtolist(
  bamfolder,
  annotation,
  transcript_align = TRUE,
  name_samples = NULL,
  indel_threshold = 5,
  refseq_sep = NULL,
  output_class = "datatable"
)

Arguments

bamfolder

Character string specifying the path to the folder storing BAM files.

annotation

Data table as generated by create_annotation. Please make sure the name of reference transcripts in the annotation data table match those in the BAM files (see refseq_sep).

transcript_align

Logical value whether BAM files in bamfolder come from a transcriptome alignment (intended as an alignment against reference transcript sequences, see Details). If TRUE (the default), reads mapping on the negative strand should not be present and, if any, they are automatically removed.

name_samples

Named character string vector specifying the desired name for the elements of the output list of data tables. A character string for each BAM file in bamfolder is required. Plase be careful to name each element of the vector after the correct corresponding BAM file in bamfolder, leaving their path and extension out. No specific order is required. Default is NULL i.e. list elements are named after the name of the BAM files, leaving their path and extension out. See the example for additional details.

indel_threshold

Positive integer value specifying the maximum number of indels (insertions + deletions) allowed for each read. All reads associated to more indels than specified are discarded (see Details). Default is 5.

refseq_sep

Character specifying the separator between reference sequences' name and additional information to discard, stored in the same field (see Details). All characters before the first occurrence of the specified separator are kept. Default is NULL i.e. no string splitting is performed.

output_class

Either "datatable" or "granges". It specifies the format of the output i.e. a list of data tables or a GRangesList object. Default is "datatable".

Details

riboWaltz only works for read alignments based on transcript coordinates. This choice is due to the main purpose of RiboSeq assays to study translational events through the isolation and sequencing of ribosome protected fragments. Most reads from RiboSeq are supposed to map on mRNAs and not on introns and intergenic regions. Nevertheless, BAM based on transcript coordinates can be generated in two ways: i) aligning directly against transcript sequences; ii) aligning against standard chromosome sequences, requiring the outputs to be translated in transcript coordinates. The first option can be easily handled by many aligners (e.g. Bowtie), given a reference FASTA file where each sequence represents a transcript, from the beginning of the 5' UTR to the end of the 3' UTR. The second procedure is based on reference FASTA files where each sequence represents a chromosome, usually coupled with comprehensive gene annotation files (GTF or GFF). The STAR aligner, with its option –quantMode TranscriptomeSAM (see Chapter 6 of its manual), is an example of tool providing such a feature.

indel_threshold is aimed at keeping only reads showing small differences between the width of the RNA fragment and the length of the reference sequence region covered after their alignment (for additional details please read here about qwidth and width, defined in the GAlignments-class of the GenomicAlignments package). Strong differences between these values are usually due to many indels which in turn are caused by loose thresholds used in the alignment process. High numbers of consecutive deletions (nucleotides present in the reference sequence but not in the read) may be an indication of reads mapping on exon-exon junctions. Despite this eventuality is not in accordance with alignments based on transcript coordinates, stretches of deletions can be included by errors or inaccurate transcript annotation and should be discarded. In fact, riboWaltz deliberately identifies the length of the reference sequences covered by the reads with the reads length, referring to it as length in all data structures reporting reads or P-site offsets information (i.e. those generated by bamtolist itself, bamtobed, bedtolist, psite and psite_info). This value is indeed strongly connected to the identification of the P-site offsets since it determines the position of the 5' and 3' read extremities (columns end5 and end3 in many data structures generated by the package), used as starting points by riboWaltz's core algorithm.

refseq_sep is intended to lighten the identifiers of the reference sequences included in the output list of data table or to modify them to match those in the annotation table. Many details about the reference sequence such as their version (usually dot-separated), their length, name variants, associated gene/transcript/protein names (usually pipe-separated) might indeed be stored in the FASTA file used for the alignment and automatically transferred in the BAM.

Value

A list of data tables or a GRangesList object.

Examples

## ## Let's suppose there are two BAM files ("Samp1.bam" and "Samp2.bam") in
## ## "path/to/BAM/files". We want to acquire them and assign to the 
## ## corresponding data tables the names "Control" and "Treated",
## ## respectively.
## ## We first define the "name_samples" character string vector as follow:
## name_of_bams <- c("Control", "Treated")
## names(name_of_bams) <- c("Samp1", "Samp2")
##
## ## Then, we can acquire the two files: 
## path_bam <- "path/to/BAM/files"
## reads_list <- bamtolist(bamfolder = path_bam, name_samples = name_of_bams,
##                           annotation = annotation_dt)
##
## ## read_list will be a list of two data tables, named "Control" (with
## ## mapping reads from "Samp1.bam") and "Treated" (with mapping reads
## ## from"Samp2.bam")

LabTranslationalArchitectomics/riboWaltz documentation built on Jan. 17, 2024, 12:18 p.m.