makeTxDb: Making a TxDb object from user supplied annotations
In jmacdon/GenomicFeatures: Conveniently import and query gene models

Description Usage Arguments Details Value Author(s) See Also Examples

makeTxDb is a low-level constructor for making a TxDb object from user supplied transcript annotations.

Note that the end user will rarely need to use makeTxDb directly but will typically use one of the high-level constructors makeTxDbFromUCSC, makeTxDbFromEnsembl, or makeTxDbFromGFF.

1
2
3

makeTxDb(transcripts, splicings, genes=NULL,
         chrominfo=NULL, metadata=NULL,
         reassign.ids=FALSE, on.foreign.transcripts=c("error", "drop"))

`transcripts`	Data frame containing the genomic locations of a set of transcripts.
`splicings`	Data frame containing the exon and CDS locations of a set of transcripts.
`genes`	Data frame containing the genes associated to a set of transcripts.
`chrominfo`	Data frame containing information about the chromosomes hosting the set of transcripts.
`metadata`	2-column data frame containing meta information about this set of transcripts like organism, genome, UCSC table, etc... The names of the columns must be `"name"` and `"value"` and their type must be character.
`reassign.ids`	`TRUE` or `FALSE`. Controls how internal ids should be assigned for each type of feature i.e. for transcripts, exons, and CDS. For each type, if `reassign.ids` is `FALSE` (the default) and if the ids are supplied, then they are used as the internal ids, otherwise the internal ids are assigned in a way that is compatible with the order defined by ordering the features first by chromosome, then by strand, then by start, and finally by end.
`on.foreign.transcripts`	Controls what to do when the input contains foreign transcripts i.e. transcripts that are on sequences not in `chrominfo`. If set to `"error"` (the default)

The transcripts (required), splicings (required) and genes (optional) arguments must be data frames that describe a set of transcripts and the genomic features related to them (exons, CDS and genes at the moment). The chrominfo (optional) argument must be a data frame containing chromosome information like the length of each chromosome.

transcripts must have 1 row per transcript and the following columns:

tx_id: Transcript ID. Integer vector. No NAs. No duplicates.
tx_chrom: Transcript chromosome. Character vector (or factor) with no NAs.
tx_strand: Transcript strand. Character vector (or factor) with no NAs where each element is either "+" or "-".
tx_start, tx_end: Transcript start and end. Integer vectors with no NAs.
tx_name: [optional] Transcript name. Character vector (or factor). NAs and/or duplicates are ok.
tx_type: [optional] Transcript type (e.g. mRNA, ncRNA, snoRNA, etc...). Character vector (or factor). NAs and/or duplicates are ok.
gene_id: [optional] Associated gene. Character vector (or factor). NAs and/or duplicates are ok.

Other columns, if any, are ignored (with a warning).

splicings must have N rows per transcript, where N is the nb of exons in the transcript. Each row describes an exon plus, optionally, the CDS contained in this exon. Its columns must be:

tx_id: Foreign key that links each row in the splicings data frame to a unique row in the transcripts data frame. Note that more than 1 row in splicings can be linked to the same row in transcripts (many-to-one relationship). Same type as transcripts$tx_id (integer vector). No NAs. All the values in this column must be present in transcripts$tx_id.
exon_rank: The rank of the exon in the transcript. Integer vector with no NAs. (tx_id, exon_rank) pairs must be unique.
exon_id: [optional] Exon ID. Integer vector with no NAs.
exon_name: [optional] Exon name. Character vector (or factor). NAs and/or duplicates are ok.
exon_chrom: [optional] Exon chromosome. Character vector (or factor) with no NAs. If missing then transcripts$tx_chrom is used. If present then exon_strand must also be present.
exon_strand: [optional] Exon strand. Character vector (or factor) with no NAs. If missing then transcripts$tx_strand is used and exon_chrom must also be missing.
exon_start, exon_end: Exon start and end. Integer vectors with no NAs.
cds_id: [optional] CDS ID. Integer vector. If present then cds_start and cds_end must also be present. NAs are allowed and must match those in cds_start and cds_end.
cds_name: [optional] CDS name. Character vector (or factor). If present then cds_start and cds_end must also be present. NAs and/or duplicates are ok. Must contain NAs at least where cds_start and cds_end contain them.
cds_start, cds_end: [optional] CDS start and end. Integer vectors. If one of the 2 columns is missing then all cds_* columns must be missing. NAs are allowed and must occur at the same positions in cds_start and cds_end.
cds_phase: [optional] CDS phase. Integer vector. If present then cds_start and cds_end must also be present. NAs are allowed and must match those in cds_start and cds_end.

Other columns, if any, are ignored (with a warning).

genes should not be supplied if transcripts has a gene_id column. If supplied, it must have N rows per transcript, where N is the nb of genes linked to the transcript (N will be 1 most of the time). Its columns must be:

tx_id: [optional] genes must have either a tx_id or a tx_name column but not both. Like splicings$tx_id, this is a foreign key that links each row in the genes data frame to a unique row in the transcripts data frame.
tx_name: [optional] Can be used as an alternative to the genes$tx_id foreign key.
gene_id: Gene ID. Character vector (or factor). No NAs.

Other columns, if any, are ignored (with a warning).

chrominfo must have 1 row per chromosome and the following columns:

chrom: Chromosome name. Character vector (or factor) with no NAs and no duplicates.
length: Chromosome length. Integer vector with either all NAs or no NAs.
is_circular: [optional] Chromosome circularity flag. Logical vector. NAs are ok.

Other columns, if any, are ignored (with a warning).

A TxDb object.

Hervé Pagès

makeTxDbFromUCSC, makeTxDbFromBiomart, and makeTxDbFromEnsembl, for making a TxDb object from online resources.
makeTxDbFromGRanges and makeTxDbFromGFF for making a TxDb object from a GRanges object, or from a GFF or GTF file.
The TxDb class.
saveDb and loadDb in the AnnotationDbi package for saving and loading a TxDb object as an SQLite file.

transcripts <- data.frame(
                   tx_id=1:3,
                   tx_chrom="chr1",
                   tx_strand=c("-", "+", "+"),
                   tx_start=c(1, 2001, 2001),
                   tx_end=c(999, 2199, 2199))
splicings <-  data.frame(
                   tx_id=c(1L, 2L, 2L, 2L, 3L, 3L),
                   exon_rank=c(1, 1, 2, 3, 1, 2),
                   exon_start=c(1, 2001, 2101, 2131, 2001, 2131),
                   exon_end=c(999, 2085, 2144, 2199, 2085, 2199),
                   cds_start=c(1, 2022, 2101, 2131, NA, NA),
                   cds_end=c(999, 2085, 2144, 2193, NA, NA),
                   cds_phase=c(0, 0, 2, 0, NA, NA))

txdb <- makeTxDb(transcripts, splicings)