View source: R/genome_download_helper.R
get_phix_genome | R Documentation |
This function automatically downloads (if files not already exists)
genomes and contaminants specified for genome alignment.
By default, it will use ensembl reference,
upon completion, the function will store
a file called file.path(output.dir, "outputs.rds")
with
the output paths of your completed genome/annotation downloads.
For most non-model nonvertebrate organisms, you need
my fork of biomartr for it to work:
remotes::install_github("Roleren/biomartr)
If you misspelled something or crashed, delete wrong files and
run again.
Do remake = TRUE, to do it all over again.
get_phix_genome(phix, output.dir, gunzip)
phix |
logical, default FALSE, download phiX sequence to filter
out Illumina control reads. ORFik defines Phix as a contaminant genome.
Phix is used in Illumina sequencers for sequencing quality control.
Genome is: refseq, Escherichia phage phiX174.
If sequencing facility created fastq files with the command |
output.dir |
directory to save downloaded data |
gunzip |
logical, default TRUE, uncompress downloaded files that are zipped when downloaded, should be TRUE! |
Some files that are made after download:
- A fasta index for the genome
- A TxDb to speed up GTF/GFF reading
- Seperat of merged contaminant files
Files that can be made:
- Gene symbols (hgnc, etc)
- Uniprot ids (For name of protein structures)
If you want custom genome or gtf from you hard drive, assign existing
paths like this:
annotation <- getGenomeAndAnnotation(GTF = "path/to/gtf.gtf",
genome = "path/to/genome.fasta")
a named character vector of path to genomes and gtf downloaded, and additional contaminants if used. If merge_contaminants is TRUE, will not give individual fasta files to contaminants, but only the merged one.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4919035/
Other STAR:
STAR.align.folder()
,
STAR.align.single()
,
STAR.allsteps.multiQC()
,
STAR.index()
,
STAR.install()
,
STAR.multiQC()
,
STAR.remove.crashed.genome()
,
install.fastp()
## Get Saccharomyces cerevisiae genome and gtf (create txdb for R)
#getGenomeAndAnnotation("Saccharomyces cerevisiae", tempdir(), assembly_type = "toplevel")
## Download and add pseudo 5' UTRs
#getGenomeAndAnnotation("Saccharomyces cerevisiae", tempdir(), assembly_type = "toplevel",
# pseudo_5UTRS_if_needed = 100)
## Get Danio rerio genome and gtf (create txdb for R)
#getGenomeAndAnnotation("Danio rerio", tempdir())
output.dir <- "/Bio_data/references/zebrafish"
## Get Danio rerio and Phix contamints to deplete during alignment
#getGenomeAndAnnotation("Danio rerio", output.dir, phix = TRUE)
## Optimize for ORFik (speed up for large annotations like human or zebrafish)
#getGenomeAndAnnotation("Danio rerio", tempdir(), optimize = TRUE)
# Drosophila melanogaster (toplevel exists only)
#getGenomeAndAnnotation("drosophila melanogaster", output.dir = file.path(config["ref"],
# "Drosophila_melanogaster_BDGP6"), assembly_type = "toplevel")
## How to save malformed refseq gffs:
## First run function and let it crash:
#annotation <- getGenomeAndAnnotation(organism = "Arabidopsis thaliana",
# output.dir = "~/Desktop/test_plant/",
# assembly_type = "primary_assembly", db = "refseq")
## Then apply a fix (example for linux, too long rows):
# fixed_gff <- fix_malformed_gff("~/Desktop/test_plant/Arabidopsis_thaliana_genomic_refseq.gff")
## Then updated arguments:
# annotation <- c(fixed_gff, "~/Desktop/test_plant/Arabidopsis_thaliana_genomic_refseq.fna")
# names(annotation) <- c("gtf", "genome")
# Then make the txdb (for faster R use)
# makeTxdbFromGenome(annotation["gtf"], annotation["genome"], organism = "Arabidopsis thaliana")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.