library(ezRun)

Introduction

ezRun makes use of reference genomes stored in a canonical directory structure. Reference genomes can be built from a gtf file containing the gene annotations and a fasta file holding the genome sequence. With the file-based representation the reference genomes are available to external tools like read aligners and others.

This pipeline has been tested for Ensembl and GENCODE annotation.

To have necessary external tools available to R, activate the conda environment "ezRun"

library(ezRun)
library(reticulate)
use_condaenv("ezRun", conda = "/usr/local/ngseq/miniconda3/bin/conda",
             required = TRUE)
py_discover_config()

GENCODE

Here we give an exmaple of human reference annotation from GENCODE.

organism <- "Homo_sapiens"
db <- "GENCODE"
build <- "GRCh38.p13"
release <- "Release_35"
## We download the reference genome and gtf from GENCODE release 35
gtfURL <- "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_35/gencode.v35.primary_assembly.annotation.gtf.gz"
download.file(gtfURL, basename(gtfURL))
genomeURL <- "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_35/GRCh38.primary_assembly.genome.fa.gz"
download.file(genomeURL, basename(genomeURL))
featureFn <- basename(gtfURL)
genomeFn <- basename(genomeURL)

Other annotation information can either be fetched via BioMart in makeFeatAnnoEnsembl function or downloaded from BioMart. The required attributes are "Transcript stable ID", "Gene description", "GO term accession", "GO domain" in the manual download from BioMart.

refBuild <- file.path(organism, db, build, "Annotation",
                      str_c(release, "2020-10-30", sep="-"))
## The reference folder will be generated under `genomesRoot`, which is current working directory here.
param <- ezParam(list(refBuild=refBuild, genomesRoot="."))
buildRefDir(param$ezRef, genomeFile=genomeFn, genesFile=featureFn)
buildIgvGenome(param$ezRef)
makeFeatAnnoEnsembl(featureFile=file.path(dirname(param$ezRef@refFeatureFile),
                                          "features.gtf"),
                    genomeFile=param$ezRef@refFastaFile,
                    organism="hsapiens_gene_ensembl")

makeFeatAnnoEnsembl(featureFile=file.path(dirname(param$ezRef@refFeatureFile),
                                          "genes.gtf"),
                    genomeFile=param$ezRef@refFastaFile,
                    organism="hsapiens_gene_ensembl")

Structure of a reference genome folder

The reference genome folder is inspired by illumina's iGenome folders, but has differences with respect to handling different annotation versions.

A reference genome build must be in a folder with the path

<species> / <provider> / <build>

Build names should be unique within an installation. Within each build the sub-directories are:

- <provider>
  - <build>
    - Annotation
      - Genes (is a link to the latest version)
      - Release-<date>
        - Genes
          - genes.gtf                            # buildRefDir
          - genes.sorted.gtf                     # buildIgvGenome
          - genes.sorted.gtf.idx                 # buildIgvGenome
          - genes_annotation.txt                 # makeFeatAnnoEnsembl
          - genes_annotation_byGene.txt          # makeFeatAnnoEnsembl
          - genes_annotation_byTranscript.txt    # makeFeatAnnoEnsembl
          - features.gtf                         # buildRefDir
          - features_annotation.txt              # makeFeatAnnoEnsembl
          - features_annotation_byGene.txt       # makeFeatAnnoEnsembl 
          - features_annotation_byTranscript.txt # makeFeatAnnoEnsembl
    - Sequence
      - WholeGenomeFasta
        - genome-chromsizes.txt                  # ezParam
        - genome.fa                              # buildRefDir
        - genome.fa.fai                          # buildRefDir
        - genome.dict                            # buildRefDir
    - igv_build.genome                           # buildIgvGenome

Indices of the various aligners will be built on demand and also placed in the folder structure.

Description of annotation files

Genome annotation files

Genome annotation files that define features (e.g. gene loci) on the genome must be in gtf file format. If the input file is in gff format, it can be converted to gtf with the gffread utility from the cufflinks suite

gffread -E -T -o genes.gtf genes.gff 

By convention the content of the feature files in the reference folders is:

Feature annotation file

Every file <name>.gtf must be accompanied by two annotation files <name>_annotation_byGene.txt and <name>_annotation_byTranscript.txt This annotation file must have rownames in the first column of the two files: gene_id and transcript_id, respectively. These ids should match the corresponding ids in the gtf file. Currently these annotation files contain the following columns:

Processing and checks of genome assemblies and gene annotation files

The buildRefDir function does:

samtools faidx genome.fa
java -Djava.io.tmpdir=. -jar picard.jar CreateSequenceDictionary R=genome.fa O=genome.dict


uzh/ezRun documentation built on April 19, 2024, 8:25 a.m.