library(ezRun)
ezRun
makes use of reference genomes stored in a canonical directory structure.
Reference genomes can be built from a gtf file containing the gene annotations
and a fasta file holding the genome sequence.
With the file-based representation the reference genomes are available to external tools like read aligners and others.
This pipeline has been tested for Ensembl and GENCODE annotation.
To have necessary external tools available to R, activate the conda environment "ezRun"
library(ezRun) library(reticulate) use_condaenv("ezRun", conda = "/usr/local/ngseq/miniconda3/bin/conda", required = TRUE) py_discover_config()
Here we give an exmaple of human reference annotation from GENCODE.
organism <- "Homo_sapiens" db <- "GENCODE" build <- "GRCh38.p13" release <- "Release_35"
## We download the reference genome and gtf from GENCODE release 35 gtfURL <- "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_35/gencode.v35.primary_assembly.annotation.gtf.gz" download.file(gtfURL, basename(gtfURL)) genomeURL <- "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_35/GRCh38.primary_assembly.genome.fa.gz" download.file(genomeURL, basename(genomeURL)) featureFn <- basename(gtfURL) genomeFn <- basename(genomeURL)
Other annotation information can either be fetched via BioMart
in makeFeatAnnoEnsembl
function or downloaded from BioMart.
The required attributes are "Transcript stable ID", "Gene description",
"GO term accession", "GO domain" in the manual download from BioMart.
refBuild <- file.path(organism, db, build, "Annotation", str_c(release, "2020-10-30", sep="-")) ## The reference folder will be generated under `genomesRoot`, which is current working directory here. param <- ezParam(list(refBuild=refBuild, genomesRoot=".")) buildRefDir(param$ezRef, genomeFile=genomeFn, genesFile=featureFn) buildIgvGenome(param$ezRef)
makeFeatAnnoEnsembl(featureFile=file.path(dirname(param$ezRef@refFeatureFile), "features.gtf"), genomeFile=param$ezRef@refFastaFile, organism="hsapiens_gene_ensembl") makeFeatAnnoEnsembl(featureFile=file.path(dirname(param$ezRef@refFeatureFile), "genes.gtf"), genomeFile=param$ezRef@refFastaFile, organism="hsapiens_gene_ensembl")
The reference genome folder is inspired by illumina's iGenome folders, but has differences with respect to handling different annotation versions.
A reference genome build must be in a folder with the path
<species> / <provider> / <build>
Build names should be unique within an installation. Within each build the sub-directories are:
- <provider> - <build> - Annotation - Genes (is a link to the latest version) - Release-<date> - Genes - genes.gtf # buildRefDir - genes.sorted.gtf # buildIgvGenome - genes.sorted.gtf.idx # buildIgvGenome - genes_annotation.txt # makeFeatAnnoEnsembl - genes_annotation_byGene.txt # makeFeatAnnoEnsembl - genes_annotation_byTranscript.txt # makeFeatAnnoEnsembl - features.gtf # buildRefDir - features_annotation.txt # makeFeatAnnoEnsembl - features_annotation_byGene.txt # makeFeatAnnoEnsembl - features_annotation_byTranscript.txt # makeFeatAnnoEnsembl - Sequence - WholeGenomeFasta - genome-chromsizes.txt # ezParam - genome.fa # buildRefDir - genome.fa.fai # buildRefDir - genome.dict # buildRefDir - igv_build.genome # buildIgvGenome
Indices of the various aligners will be built on demand and also placed in the folder structure.
Genome annotation files that define features (e.g. gene loci) on the genome must be in gtf file format. If the input file is in gff format, it can be converted to gtf with the gffread
utility from the cufflinks suite
gffread -E -T -o genes.gtf genes.gff
By convention the content of the feature files in the reference folders is:
Every file <name>.gtf
must be accompanied by two annotation files <name>_annotation_byGene.txt
and <name>_annotation_byTranscript.txt
This annotation file must have rownames in the first column of the two files: gene_id
and transcript_id
, respectively.
These ids should match the corresponding ids in the gtf file.
Currently these annotation files contain the following columns:
The buildRefDir function does:
samtools faidx genome.fa
java -Djava.io.tmpdir=. -jar picard.jar CreateSequenceDictionary R=genome.fa O=genome.dict
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.