psite_info: Update reads information according to the inferred P-sites.

View source: R/psites.R

psite_infoR Documentation

Update reads information according to the inferred P-sites.

Description

This function provides additional reads information according to the position of the P-site identfied by psite. It attaches to each data table in a list four columns reporting i) the P-site position with respect to the 1st nucleotide of the transcript, ii) the P-site position with respect to the start and the stop codon of the annotated coding sequence (if any) and iii) the region of the transcript (5' UTR, CDS, 3' UTR) that includes the P-site. Please note: 1) for transcripts not associated to any annotated CDS the P-site position with respect to the start and the stop codon is set to NA; 2) P-sites of short reads (<20 nts) might be located very close to the 5' or 3' extremity, with no biological meaning and causing potential downstream issues; for these reasons, all read lengths showing this feature will be removed. Optionally, additional columns reporting the three nucleotides covered by the P-site, the A-site and the E-site are attached, based on FASTA files or BSgenome data packages containing the transcript nucleotide sequences.

Usage

psite_info(
  data,
  offset,
  site = NULL,
  fastapath = NULL,
  fasta_genome = TRUE,
  refseq_sep = NULL,
  bsgenome = NULL,
  gtfpath = NULL,
  txdb = NULL,
  dataSource = NA,
  organism = NA,
  output_class = "datatable"
)

Arguments

data

Either list of data tables or GRangesList object from bamtolist, bedtolist, duplicates_filter or length_filter.

offset

Data table from psite.

site

Either "psite, "asite", "esite" or a combination of these strings. It specifies if additional column(s) reporting the three nucleotides covered by the ribosome P-site ("psite"), A-site ("asite") and E-site ("esite") should be added. Note: either fastapath or bsgenome is required for this purpose. Default is NULL.

fastapath

fastapath Character string specifying the FASTA file used in the alignment step, including its path, name and extension. This file can contain reference nucleotide sequences either of genome asseblies (chromosome sequences) or of transcripts (see Details and fasta_genome). Please make sure the sequences derive from the same release of the annotation file used in the create_annotation function. Note: either fastapath or bsgenome is required to generate additional column(s) specified by site. Default is NULL.

fasta_genome

Logical value whether the FASTA file specified by fastapath contains nucleotide sequences of genome asseblies (chromosome sequences). If TRUE (the default), an annotation object is required (see gtfpath and txdb). FALSE implies nucleotide sequences of transcripts are provided instead.

refseq_sep

Character specifying the separator between reference sequences' name and additional information to discard, stored in the headers of the FASTA file specified by fastapath (if any). It might be required for matching the reference sequences' identifiers reported in the input list of data tables. All characters before the first occurrence of the specified separator are kept. Default is NULL i.e. no string splitting is performed.

bsgenome

Character string specifying the BSgenome data package with the genome sequences to be loaded. If not already present in the system, it is automatically installed through the biocLite.R script (check the list of available BSgenome data packages by running the available.genomes function of the BSgenome package). This parameter must be coupled with an annotation object (see gtfpath and txdb). Please make sure the sequences included in the specified BSgenome data pakage are in agreement with the sequences used in the alignment step. Note: either fastapath or bsgenome is required to generate additional column(s) specified by site. Default is NULL.

gtfpath

Character string specifying the location of a GTF file, including its path, name and extension. Please make sure the GTF file and the sequences specified by fastapath or bsgenome derive from the same release. Note that either gtfpath or txdb is required if and only if nucleotide sequences of genome assemblies (chromosome sequences) are provided (see fastapath or bsgenome). Default is NULL.

txdb

Character string specifying the TxDb annotation package to be loaded. If not already present in the system, it is automatically installed through the biocLite.R script (check here the list of available TxDb annotation packages). Please make sure the TxDb annotation package and the sequences specified by fastapath or bsgenome derive from the same release. Note that either gtfpath or txdb is required if and only if nucleotide sequences of genome assemblies (chromosome sequences) are provided (see fastapath or bsgenome). Default is NULL.

dataSource

Optional character string describing the origin of the GTF data file. This parameter is considered only if gtfpath is specified. For more information about this parameter please refer to the description of dataSource of the makeTxDbFromGFF function included in the GenomicFeatures package.

organism

Optional character string reporting the genus and species of the organism of the GTF data file. This parameter is considered only if gtfpath is specified. For more information about this parameter please refer to the description of organism of the makeTxDbFromGFF function included in the GenomicFeatures package.

output_class

Either "datatable" or "granges". It specifies the format of the output i.e. a list of data tables or a GRangesList object. Default is "datatable".

Details

riboWaltz only works for read alignments based on transcript coordinates. This choice is due to the main purpose of RiboSeq assays to study translational events through the isolation and sequencing of ribosome protected fragments. Most reads from RiboSeq are supposed to map on mRNAs and not on introns and intergenic regions. BAM based on transcript coordinates can be generated in two ways: i) aligning directly against transcript sequences; ii) aligning against sequences of genome assemblies i.e. standard chromosome sequences, thus requiring the outputs to be translated in transcript coordinates. The first option can be easily handled by many aligners (e.g. Bowtie), given a reference FASTA file where each sequence represents a transcript, from the beginning of the 5' UTR to the end of the 3' UTR. The second procedure is based on reference FASTA files where each sequence represents a chromosome, usually coupled with comprehensive gene annotation files (GTF or GFF). The STAR aligner, with its option –quantMode TranscriptomeSAM (see Chapter 6 of its manual), is an example of tool providing such a feature.

Value

A list of data tables or a GRangesList object.

Examples

data(reads_list)
data(psite_offset)
data(mm81cdna)

reads_psite_list <- psite_info(reads_list, psite_offset)

LabTranslationalArchitectomics/riboWaltz documentation built on Jan. 17, 2024, 12:18 p.m.