In monahton/GencoDymo: Data Extraction and Manipulation from the GENCODE Database

# This is required to load an Example Dataset. This image file must be present in the vignette folder.
  load("exampleData.RData")
  library(dplyr)

knitr::opts_chunk$set(echo = TRUE)

Introduction

Gene annotations for the human genome has been made available by several databases integrating data from systematic explorations of different experimental and computational resources. The GENCODE project [1], part of the ENCODE project [2], offers accurate annotations of the human and mouse genomes from the literature, primary data repositories, computational predictions, manual annotation, and experimental validation of genes and transcripts including the noncoding transcripts which are continously discovered and annotated. GENCODE is by large one of the most comprehensive and standardized database for gene annotations that is widely used by the community.\ The gene sets provided by GENCODE are comprehensive and include protein-coding and non-coding loci including alternatively spliced isoforms and pseudogenes. GENCODE gene annotations are constantly updated and regularly released as the Ensembl/GENCODE gene sets which are accessible via the official website (https://www.gencodegenes.org). GENCODE gene sets are released on average four times a year for mouse and two times a year for human [1].\ GencoDymo is an R package that interrogates different releases of GENCODE annotations from the human and the mouse genomes. GencoDymo extracts and manipulates GENCODE annotations and provides easily-accessible data regarding annotations statistics, releases comparison, in addition to introns and splice sites data. Moreover, GencoDymo can produce fasta files of donor and acceptor splice sites motifs that can be directly uploaded to the MaxEntScan tool [3] for the calculation of splice sites scores.

GencoDymo can be used to import and process any gtf formatted file. It automatizes and speeds up these data manipulation steps:

Extract introns data
Assign splice sites consensus sequence
Remove redundant exons and introns
Compile MaxEntScan splice sites motifs
Subset elements on the forward or reverse strand.
Calculate transcripts-per-gene frequency
Calculate exons-per-transcript frequency
Calculate spliced transcripts length
Compare annotations from different GENCODE releases.
Perform summary statistics analysis

Data Preparation

Download GENCODE annotations

In this workflow, we will be using the most recent release from GENCODE for the humann genome (release 32) as an example. The analysis will be performed on lncRNAs annotations gtf file and the basic annotations gtf file provided by GENCODE.\
Create a working directory to store these files.

 # download basic annotations gtf file
download.file(url="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.basic.annotation.gtf.gz", destfile = "gencode.v32.basic.annotation.gtf.gz", method = "wget")

 # download lncRNAs annotations gtf file
download.file(url="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.long_noncoding_RNAs.gtf.gz", destfile = "gencode.v32.long_noncoding_RNAs.gtf.gz", method = "wget")

Load a BSgenome object

GencoDymo requires a BSgenome object for the extraction of splice sites consensus and motifs sequences. For this workflow, we will be using the human genome sequence (hg38)

 # Load BSgenome object
   library(BSgenome.Hsapiens.UCSC.hg38)
   hg38 <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38

Import input files

Start by loading GencoDymo package

 # Load BSgenome object
   library(GencoDymo)

The load_gtf() function imports any gtf.gz file as a dataframe in the R environment.

 # Import basic annotations gtf as a dataframe
   gen32 <- load_gtf("gencode.v32.basic.annotation.gtf.gz")
 # Import lncRNAs annotations gtf as a dataframe
   gen32_lncRNA <- load_gtf("gencode.v32.long_noncoding_RNAs.gtf.gz")

Extract protein-coding genes

Basic annotation files in GENCODE are a subset of the corresponding comprehensive annotation, including only those transcripts tagged as 'basic' in every gene. This subset prioritises full-length protein coding transcripts over partial or non-protein coding transcripts within the same gene, and intends to highlight those transcripts that will be useful to the majority of users.\ The basic annotation file contains all gene classes annotated by GENCODE (lncRNAs, pseudogenes, tRNAs, scRNA...) in addition to the protein-coding genes.

The extract_pc() function subset the basic annotation dataframe extracting only data marked with the biotype "protein-coding". The function assign in the global environment a new dataframe called: pc_df that contains annotated protein-coding genes, transcripts and exons.

  # Extract protein-coding genes from basic annotation file
    pc_df <- extract_pc(gen32)

Introns data manipulation

Extract introns

The gtf files provided by GENCODE (for example the lncRNAs annotations gtf file) contains data regarding genes, transcripts, and exons only. Given the loaded data frame from the gtf files of GENCODE, the extract_introns() function produces a dataframe containing all the intronic regions in between all of those exons.

# Extract introns from the gtf file of lncRNAs
  gen32_lncRNA_introns <- extract_introns(gen32_lncRNA)

The obtained data frame includes the introns coordinates, their corresponding gene and transcripts ids, and the ordinal position of the intron.

Assign the donor and acceptor splice sites of GENCODE annotated introns

Introns always have two distinct nucleotides at either end. At the 5' end, the splice donor site includes an almost invariant sequence GT (GU in the RNA). The splice acceptor site at the 3' end of the intron terminates the intron with an almost invariant AG sequence.

Fig1. Donor and acceptor splice sites consensus. Image taken from (https://commons.wikimedia.org/wiki/File:Consensus_intron.jpg).

The majority of introns belong to the U2-type spliceosome and are flanked by GT– AG splice site dinucleotides. The most frequent exception to this rule are the U2-type GC–AG splice sites, comprising ∼0.8% of human splice sites and about ∼0.09% of the human splice sites belong to the U12-type which are processed by the minor spliceosome and are flanked by AT–AC dinucleotides [4].

The assign_ss() function adds two new columns to the introns data frame, assigning both the 5' and 3' splice sites. The sequences are retrieved from the loaded BSgenome object.

The first argument of the function is a dataframe containing intronic coordinates and a second argument should specify the genome sequence of the studied species (the human genome sequence hg38 is used by default).

intr_splice_sites <- assign_ss(gen32_lncRNA_introns, genome = hg38)

Extract splice sites motifs for MaxEntScan webtool

MaxEntScan is a webtool based on the approach for modeling the sequences of short sequence motifs such as those involved in RNA splicing which simultaneously accounts for non-adjacent as well as adjacent dependencies between positions. This method is based on the 'Maximum Entropy Principle' and generalizes most previous probabilistic models of sequence motifs such as weight matrix models and inhomogeneous Markov models [3]. \ The MaxEntScan::score5ss assign scores for donor splice sites according to four models (http://hollywood.mit.edu/burgelab/maxent/Xmaxentscan_scoreseq.html) by using 9-mer SEQUENCES motif as input in a .fa file. The 9-mers motif contain 3 bases in the exon and 6 bases in the intron. \ The MaxEntScan::score3ss assign scores for donor splice sites according to three models (http://hollywood.mit.edu/burgelab/maxent/Xmaxentscan_scoreseq_acc.html). It takes as input a .fa file of 23-mer SEQUENCES in which 20 bases are in the intron and 3 bases in the exon.

The extract_5ss_motif() and extract_3ss_motif() functions are used along with BSgenome objects of the studied species to retrieve MaxEntScan::score5ss and MaxEntScan::score3ss motif sequences respectively. \ The functions generate a fasta file in the working directory as an output, that contains either 9-mers or 23-mers, respectively for each 5' and 3' splice-sites, and a dataframe for the coordinates and IDs of the corresponding motifs. \ The generated fasta files can then be directly utilized by MaxEntScan tools.

Two arguments must be specified in each of the above functions. The first argument is a dataframe containing the intron coordinates and the second argument is a BSgenome object (the human genome sequence hg38 is used by default).

# Produce a fasta file with 5'ss 9-mers:
extract_5ss_motif(gen32_lncRNA_introns, BSgenome.Hsapiens.UCSC.hg38)
# Produce a fasta file with 3'ss 23-mers:
extract_3ss_motif(gen32_lncRNA_introns, BSgenome.Hsapiens.UCSC.hg38)

NOTE: When the above functions are used with the data frame of the entire introns of the GENCODE gtf files, they might take up to several minutes until the computation is done.

Data Manipulation of Exons and Introns

Extract the single-exon genes and transcripts ID

Some genes are composed by one single exon and have no introns.
The se_genes() function extract all the single-exon genes and returns in a dataframe their ensembl gene IDs.

# Extract single-Exon genes:
  se_genes(gen32)
  se_genes(gen32_lncRNA)

The se_transcripts() function extract all the single-exon transcripts and returns in a dataframe their ensembl transcript IDs.

# Extract single-Exon transcripts:
  se_transcripts(gen32)
  se_transcripts(gen32_lncRNA)

Assign the ordinal position of exons as single, first, inner, or last

The classify_exons() function adds a new column to the dataframe called "EXON_CLASSIFICATION", describing the position of each exon. Four categories are automatically generated: (first_exons), (inner_exons), (last_exons) and (single_exons).

# Annotate each exons as first, inner, last or single exons.
  classified_exons_df <- classify_exons(gen32_lncRNA)

This classification is informative and useful for subsequent analysis, considering that first and last exons often have peculiar regulative roles.

Redundant exons and introns

The eliminate_redundant_exons() function is used to retain unique exons only, removing all redundant ones. In fact, in genes with multiple transcripts, some exons are not alternatively spliced and will be represented multiple times within the gtf file. This function takes an exon-containing dataframe as input and returns a data frame with unique exons only.

# Remove redundant exons
eliminate_redundant_exons(gen32_lncRNA)

The eliminate_redundant_introns() function is used to retain unique introns only, removing redundant ones.

# Remove redundant introns
eliminate_redundant_introns(introns_df)

GENCODE data summary statistics

GencoDymo package provide a number of functions that gives a brief summary of data annotations in a particular GENCODE release regarding genes, transcripts, exons and introns of lncRNAs and protein-coding genes. Moreover, the package can compare different releases of GENCODE showing the variation in the number of annotations between them

Compare annotations from GENCODE different releases

As GENCODE regularly updates its annotations, different releases from GENCODE contain differing number of annotations. Some genes are added in new releases while others are classified into different gene classes. In the majority of cases, the ensembl gene ID of genes remains constant while the gene name might vary.

The gencode_compare() function compares two different releases of GENCODE annotations. \ The function takes three arguments: the first release data frame, the newer release data frame, and the type. The type represent the genetic element to be compared which can be "gene", "transcript", "exon" or "intron".

Let's load first in addition to the gtf file of lncRNAs annotations from release 32 (gen32_lncRNA), the gtf file of release 30.

# download gtf file of lncRNA annotations from release 30
  download.file(url="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/gencode.v30.long_noncoding_RNAs.gtf.gz", destfile = "gencode.v30.long_noncoding_RNAs.gtf.gz", method = "wget")

# load the downloaded gtf file as a data frame in r
  gen30_lncRNA <- load_gtf("gencode.v30.long_noncoding_RNAs.gtf.gz")
  gen30_lncRNA_introns <- extract_introns(gen30_lncRNA)

Now let's compare the different number of genes, transcripts, exons and introns between these two releases:

# Calculates the number of differing genetic elements
  gencode_compare(gen30_lncRNA, gen32_lncRNA, type = "gene")
  gencode_compare(gen30_lncRNA, gen32_lncRNA, type = "transcript")
  gencode_compare(gen30_lncRNA, gen32_lncRNA, type = "exon")
  gencode_compare(gen30_lncRNA_introns, gen32_lncRNA_introns, type = "intron")

The function produces as an outout a Delta value, that corresponds to the difference in number between the specified elements of the two inputs, in addition to its Percentage.

The output is summarised in Table1.

Table 1. Number and percentage of the different genetic elements between release 30 and release 32 of GENCODE for lncRNAs.

Genetic Element | Release 30 | Release 32 | Delta | Percentage (%) ---- | ----- | ---- | ---- | ---- Genes | 16193 | 17910 | 1717 | 9.58 Transcripts | 30369 | 48351 | 17982 | 37.19 Exons | 96215 | 176413 | 80198 | 45.46 Introns | 65846 | 128062 | 62216 | 48.58

Number of transcripts per gene

The genes_freq() function is used to produce a Transcripts per Gene frequency table. This table describes the number and the percentage of genes for which a given number of transcript isoforms exist.

# Transcripts per Gene frequency table:
  genes_freq_df <- genes_freq(gen32_lncRNA)

The output is a data frame composed of the following columns:

The first column represents the number of transcript isoforms per gene.
The second column describes how many genes correspond to such isoforms number.
The third column reports the percentage of such elements.

Number of exons per transcript

The trans_freq() function is used to produce an Exons per Transcript frequency table. This table describes the number and the percentage of transcripts that are composed by a certain number of exons.

# Exons per transcript frequency table:
  trans_freq_df <- trans_freq(gen32_lncRNA)

The output is a data frame composed of the following columns:

The first column represents the number of exons per transcript.
The second column describes how many transcripts contain the specified number of exons.
The third column reports the percentage of such elements.

Spliced transcripts length

The spliced_trans_length() function calculates the length of each transcript in its mature form (following the splicing of its introns).

# Length of spliced transcript  
  spliced_trans_length(gen32_lncRNA)

Exons length summary statistics

The stat_exon() function produces a data frame describing the mean, median, standard deviation and std error of exons' length. \ The function describes these parameters for exons taking into consideration their ordinal position.

# Calculates mean, median, sd and std error of exons
  stat_exon(gen32_lncRNA)

The output is generated for single exons, first exons, inner exons, and last exons. Length is reported in base pairs (bp).

Introns Statistics

The stat_intron() function produces a data frame describing the mean, median, standard deviation and std error of introns' length. \ The function describes the length of introns taking into consideration also whether they are first or inner, and from each category whether it harbors GT-AG, GC-AG or AT-AC splice sites. This classification could be of particular relevance as introns haroboring different splice sites have different length characteristics as reported in [5].

# Calculates mean, median, sd and std error of introns
stat_intron(introns_ss)

Length is reported in base pairs (bp).

Session information

The following provides the session information used when compiling this document.

   devtools::session_info()

References

Frankish A, Diekhans M, Ferreira AM, Johnson R, Jungreis I, Loveland J et al. “GENCODE reference annotation for the human and mouse genomes”. Nucleic Acids Res. 2019. 47: D766–773. doi: 10.1093/nar/gky955.
Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J et al. "GENCODE: producing a reference annotation for ENCODE". Genome Biol. 2006. 7 Suppl 1:S4.1-9. doi: 10.1186/gb-2006-7-s1-s4.
Yeo G and Burge CB. “Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals”. J Comput Biol. 2004. 11:377-394. doi: 10.1089/1066527041410418.
Parada GE, Munita R, Cerda CA and Gysling K. “A comprehensive survey of non-canonical splice sites in the human transcriptome”. Nucleic Acids Res. 2014. 42: 10564-10578. doi: 10.1093/nar/gku744.
Abou Alezz M, Celli L, Belotti G, Lisa A and Bione S. "Unexpected features of GC-AG introns in long non-coding and protein-coding genes suggest a new role as regulatory elements". bioRxiv. 2019. doi: 10.1101/683938

monahton/GencoDymo documentation built on July 4, 2025, 8:36 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

monahton/GencoDymo
Data Extraction and Manipulation from the GENCODE Database

In monahton/GencoDymo: Data Extraction and Manipulation from the GENCODE Database

Introduction

Data Preparation

Download GENCODE annotations

Load a BSgenome object

Import input files

Extract protein-coding genes

Introns data manipulation

Extract introns

Assign the donor and acceptor splice sites of GENCODE annotated introns

Extract splice sites motifs for MaxEntScan webtool

Data Manipulation of Exons and Introns

Extract the single-exon genes and transcripts ID

Assign the ordinal position of exons as single, first, inner, or last

Redundant exons and introns

GENCODE data summary statistics

Compare annotations from GENCODE different releases

Number of transcripts per gene

Number of exons per transcript

Spliced transcripts length

Exons length summary statistics

Introns Statistics

Session information

References

R Package Documentation

Browse R Packages

We want your feedback!

monahton/GencoDymo Data Extraction and Manipulation from the GENCODE Database

In monahton/GencoDymo: Data Extraction and Manipulation from the GENCODE Database

Introduction

Data Preparation

Download GENCODE annotations

Load a BSgenome object

Import input files

Extract protein-coding genes

Introns data manipulation

Extract introns

Assign the donor and acceptor splice sites of GENCODE annotated introns

Extract splice sites motifs for MaxEntScan webtool

Data Manipulation of Exons and Introns

Extract the single-exon genes and transcripts ID

Assign the ordinal position of exons as single, first, inner, or last

Redundant exons and introns

GENCODE data summary statistics

Compare annotations from GENCODE different releases

Number of transcripts per gene

Number of exons per transcript

Spliced transcripts length

Exons length summary statistics

Introns Statistics

Session information

References

R Package Documentation

Browse R Packages

We want your feedback!

monahton/GencoDymo
Data Extraction and Manipulation from the GENCODE Database