bamtolist | R Documentation |
This function reads one or multiple BAM files converting them into data
tables or GRanges objects, arranged in a list or a GRangesList, respectively.
In both cases the list elements contain, for each read: i) the name of the
corresponding reference sequence (i.e. of the transcript on which it aligns);
ii) its leftmost and rightmost position with respect to the 1st nucleotide of
the reference sequence; iii) its length (intended as the width of the
reference sequence region covered by the RNA fragment, see parameter
indel_threshold
and Details
); iv) the leftmost and rightmost
position of the annotated CDS of the reference sequence (if any) with respect
to its 1st nucleotide. Please note: start and stop codon positions for
transcripts without annotated CDS are set to 0.
bamtolist(
bamfolder,
annotation,
transcript_align = TRUE,
name_samples = NULL,
indel_threshold = 5,
refseq_sep = NULL,
output_class = "datatable"
)
bamfolder |
Character string specifying the path to the folder storing BAM files. |
annotation |
Data table as generated by |
transcript_align |
Logical value whether BAM files in |
name_samples |
Named character string vector specifying the desired name
for the elements of the output list of data tables. A character string for
each BAM file in |
indel_threshold |
Positive integer value specifying the maximum number
of indels (insertions + deletions) allowed for each read. All reads
associated to more indels than specified are discarded (see
|
refseq_sep |
Character specifying the separator between reference
sequences' name and additional information to discard, stored in the same
field (see |
output_class |
Either "datatable" or "granges". It specifies the format of the output i.e. a list of data tables or a GRangesList object. Default is "datatable". |
riboWaltz only works for read alignments based on transcript coordinates. This choice is due to the main purpose of RiboSeq assays to study translational events through the isolation and sequencing of ribosome protected fragments. Most reads from RiboSeq are supposed to map on mRNAs and not on introns and intergenic regions. Nevertheless, BAM based on transcript coordinates can be generated in two ways: i) aligning directly against transcript sequences; ii) aligning against standard chromosome sequences, requiring the outputs to be translated in transcript coordinates. The first option can be easily handled by many aligners (e.g. Bowtie), given a reference FASTA file where each sequence represents a transcript, from the beginning of the 5' UTR to the end of the 3' UTR. The second procedure is based on reference FASTA files where each sequence represents a chromosome, usually coupled with comprehensive gene annotation files (GTF or GFF). The STAR aligner, with its option –quantMode TranscriptomeSAM (see Chapter 6 of its manual), is an example of tool providing such a feature.
indel_threshold
is aimed at keeping only reads showing small
differences between the width of the RNA fragment and the length of the
reference sequence region covered after their alignment (for additional
details please read
here
about qwidth and width, defined in the GAlignments-class of
the GenomicAlignments package). Strong differences between these values are
usually due to many indels which in turn are caused by loose thresholds
used in the alignment process. High numbers of consecutive deletions
(nucleotides present in the reference sequence but not in the read) may be
an indication of reads mapping on exon-exon junctions. Despite this
eventuality is not in accordance with alignments based on transcript
coordinates, stretches of deletions can be included by errors or
inaccurate transcript annotation and should be discarded. In fact,
riboWaltz deliberately identifies the length of the reference
sequences covered by the reads with the reads length, referring to it as
length in all data structures reporting reads or P-site offsets
information (i.e. those generated by bamtolist
itself,
bamtobed
, bedtolist
, psite
and
psite_info
). This value is indeed strongly connected to the
identification of the P-site offsets since it determines the position of
the 5' and 3' read extremities (columns end5 and end3 in many
data structures generated by the package), used as starting points by
riboWaltz's core algorithm.
refseq_sep
is intended to lighten the identifiers of the reference
sequences included in the output list of data table or to modify them to
match those in the annotation table. Many details about the reference
sequence such as their version (usually dot-separated), their length, name
variants, associated gene/transcript/protein names (usually pipe-separated)
might indeed be stored in the FASTA file used for the alignment and
automatically transferred in the BAM.
A list of data tables or a GRangesList object.
## ## Let's suppose there are two BAM files ("Samp1.bam" and "Samp2.bam") in
## ## "path/to/BAM/files". We want to acquire them and assign to the
## ## corresponding data tables the names "Control" and "Treated",
## ## respectively.
## ## We first define the "name_samples" character string vector as follow:
## name_of_bams <- c("Control", "Treated")
## names(name_of_bams) <- c("Samp1", "Samp2")
##
## ## Then, we can acquire the two files:
## path_bam <- "path/to/BAM/files"
## reads_list <- bamtolist(bamfolder = path_bam, name_samples = name_of_bams,
## annotation = annotation_dt)
##
## ## read_list will be a list of two data tables, named "Control" (with
## ## mapping reads from "Samp1.bam") and "Treated" (with mapping reads
## ## from"Samp2.bam")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.