In uzh/ezRun: An R meta-package for the analysis of Next Generation Sequencing Data

library(ezRun)

Introduction

ezRun makes use of reference genomes stored in a canonical directory structure. Reference genomes can be built from a gtf file containing the gene annotations and a fasta file holding the genome sequence. With the file-based representation the reference genomes are available to external tools like read aligners and others.

This pipeline has been tested for Ensembl and GENCODE annotation.

To have necessary external tools available to R, activate the conda environment "ezRun"

library(ezRun)
library(reticulate)
use_condaenv("ezRun", conda = "/usr/local/ngseq/miniconda3/bin/conda",
             required = TRUE)
py_discover_config()

GENCODE

Here we give an exmaple of human reference annotation from GENCODE.

organism <- "Homo_sapiens"
db <- "GENCODE"
build <- "GRCh38.p13"
release <- "Release_35"

## We download the reference genome and gtf from GENCODE release 35
gtfURL <- "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_35/gencode.v35.primary_assembly.annotation.gtf.gz"
download.file(gtfURL, basename(gtfURL))
genomeURL <- "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_35/GRCh38.primary_assembly.genome.fa.gz"
download.file(genomeURL, basename(genomeURL))
featureFn <- basename(gtfURL)
genomeFn <- basename(genomeURL)

Other annotation information can either be fetched via BioMart in makeFeatAnnoEnsembl function or downloaded from BioMart. The required attributes are "Transcript stable ID", "Gene description", "GO term accession", "GO domain" in the manual download from BioMart.

refBuild <- file.path(organism, db, build, "Annotation",
                      str_c(release, "2020-10-30", sep="-"))
## The reference folder will be generated under `genomesRoot`, which is current working directory here.
param <- ezParam(list(refBuild=refBuild, genomesRoot="."))
buildRefDir(param$ezRef, genomeFile=genomeFn, genesFile=featureFn)
buildIgvGenome(param$ezRef)

makeFeatAnnoEnsembl(featureFile=file.path(dirname(param$ezRef@refFeatureFile),
                                          "features.gtf"),
                    genomeFile=param$ezRef@refFastaFile,
                    organism="hsapiens_gene_ensembl")

makeFeatAnnoEnsembl(featureFile=file.path(dirname(param$ezRef@refFeatureFile),
                                          "genes.gtf"),
                    genomeFile=param$ezRef@refFastaFile,
                    organism="hsapiens_gene_ensembl")

Structure of a reference genome folder

The reference genome folder is inspired by illumina's iGenome folders, but has differences with respect to handling different annotation versions.

A reference genome build must be in a folder with the path

<species> / <provider> / <build>

Build names should be unique within an installation. Within each build the sub-directories are:

- <provider>
  - <build>
    - Annotation
      - Genes (is a link to the latest version)
      - Release-<date>
        - Genes
          - genes.gtf                            # buildRefDir
          - genes.sorted.gtf                     # buildIgvGenome
          - genes.sorted.gtf.idx                 # buildIgvGenome
          - genes_annotation.txt                 # makeFeatAnnoEnsembl
          - genes_annotation_byGene.txt          # makeFeatAnnoEnsembl
          - genes_annotation_byTranscript.txt    # makeFeatAnnoEnsembl
          - features.gtf                         # buildRefDir
          - features_annotation.txt              # makeFeatAnnoEnsembl
          - features_annotation_byGene.txt       # makeFeatAnnoEnsembl 
          - features_annotation_byTranscript.txt # makeFeatAnnoEnsembl
    - Sequence
      - WholeGenomeFasta
        - genome-chromsizes.txt                  # ezParam
        - genome.fa                              # buildRefDir
        - genome.fa.fai                          # buildRefDir
        - genome.dict                            # buildRefDir
    - igv_build.genome                           # buildIgvGenome

Indices of the various aligners will be built on demand and also placed in the folder structure.

Description of annotation files

Genome annotation files

Genome annotation files that define features (e.g. gene loci) on the genome must be in gtf file format. If the input file is in gff format, it can be converted to gtf with the gffread utility from the cufflinks suite

gffread -E -T -o genes.gtf genes.gff

By convention the content of the feature files in the reference folders is:

genes.gtf -- covers all protein coding loci and transcribed noncoding regions in the reference (i.e. all loci where transcripts with a poly-A are generated); designed for poly-A RNA-seq
features.gtf -- all annotated features in the genome, including pseudogenes, etc.; meant for whole genome exploratory analyses. Note: the included pseudogenes may give rise to multi-mapping reads in RNA-seq approaches because of the inclusion of loci that are never transcribed.

Feature annotation file

Every file <name>.gtf must be accompanied by two annotation files <name>_annotation_byGene.txt and <name>_annotation_byTranscript.txt This annotation file must have rownames in the first column of the two files: gene_id and transcript_id, respectively. These ids should match the corresponding ids in the gtf file. Currently these annotation files contain the following columns:

transcript_id
gene_id
gene_name
type: more general types of gene_biotype: "pseudogene", "long_noncoding", "short_noncoding", "protein_coding", "rRNA", "tRNA"
strand
seqid
start
end
biotypes: gene biotype
description: gene description
gc
width
GO BP
GO MF
GO CC

Processing and checks of genome assemblies and gene annotation files

The buildRefDir function does:

from the fasta file remove anything everything after the first space on the sequence identifier lines
remove contigs where the contig-name includeds the word PATCH
verify that the gene annotation file uses the same chromosome names as the fasta file
remove chromosomes from the annotation files that are not present in the fasta file
create a fasta index file with

samtools faidx genome.fa

create the genome dict file with

java -Djava.io.tmpdir=. -jar picard.jar CreateSequenceDictionary R=genome.fa O=genome.dict

uzh/ezRun documentation built on June 14, 2025, 1:29 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

uzh/ezRun
An R meta-package for the analysis of Next Generation Sequencing Data

In uzh/ezRun: An R meta-package for the analysis of Next Generation Sequencing Data

Introduction

GENCODE

Structure of a reference genome folder

Description of annotation files

Genome annotation files

Feature annotation file

Processing and checks of genome assemblies and gene annotation files

R Package Documentation

Browse R Packages

We want your feedback!

uzh/ezRun An R meta-package for the analysis of Next Generation Sequencing Data

In uzh/ezRun: An R meta-package for the analysis of Next Generation Sequencing Data

Introduction

GENCODE

Structure of a reference genome folder

Description of annotation files

Genome annotation files

Feature annotation file

Processing and checks of genome assemblies and gene annotation files

R Package Documentation

Browse R Packages

We want your feedback!

uzh/ezRun
An R meta-package for the analysis of Next Generation Sequencing Data