CDS: Generate a CDS coordinates table
In EricEdwardBryant/iSTOP: iSTOP - Induced STOP Experiment Design

Description Usage Arguments Details Value

View source: R/CDS-coordinates.R

CDS takes transcript annotation tables in UCSC format and reshapes them to have coordinates for each exon represented on a single row, rather than collapsed into a comma separated string in a single cell.

CDS(tx, gene, tx_cols, gene_cols, shift_start = 1L, shift_end = 0L)

CDS_example()

CDS_Celegans_UCSC_ce11()

CDS_Dmelanogaster_UCSC_dm6()

CDS_Drerio_UCSC_danRer10()

CDS_Hsapiens_UCSC_hg38()

CDS_Mmusculus_UCSC_mm10()

CDS_Rnorvegicus_UCSC_rn6()

CDS_Scerevisiae_UCSC_sacCer3()

CDS_Athaliana_BioMart_plantsmart28()

`tx`	A URL to a genome's transcript reference file. This table must have tab separated fields and contain, identifiers for each transcript, chromosome, strand, CDS start/end, and exon start/end information. There should be only one row per transcript and exon start/end columns should contain comma separated cooordinates for each exon in the transcript. An example file can be found here.
`gene`	A URL to a tab separated file that maps transcript identifiers to common gene names. An example file can be found here.
`tx_cols`	A character vector of expected column names for the known-gene reference file. Required columns: "tx", "chr", "strand", "cds_start", "cds_end", "exon_start" and "exon_end". All other columns will be ignored.
`gene_cols`	A character vector of expected column names for the cross-reference file. Required columns: "tx" and "gene". All other columns will be ignored.
`shift_start`	Number of bases to shift the start positions. Defaults to 1 as this is necessary for compatibility with Biostrings::getSeq which includes the start position in the returned sequence and begins counting bases at 1.
`shift_end`	Number of bases to shift the end positions. Defaults to 0.

The output of CDS should meet the following standards (1) each row should represent the coordinates of a single exon, (2) exons should be numbered in order with reference to the transcript's strand (e.g. the first exon should include the start codon). The absolute numbering is unimportant so long as they are numbered in the correct order. (3) The first and last exon coordinates should begin with the start codon and end with the stop codon.

To save the trouble of looking up URLs, pre-defined CDS builders are provided. They are named CDS_<Species>_<data-source>_<genome-assembly-ID>().

A data.frame with the following columns where each row represents a single exon:

COLUMN-NAME DATA-TYPE DESCRIPTION
tx chr Transcript symbol
gene chr Gene symbol
exon int Exon rank in gene (lowest contains ATG, highest contains native Stop)
chr chr Chromosome
strand chr Strand (+/-)
start int CDS coordinate start (always <= end)
end int CDS coordinate end (always >= start)