Build-Reference-methods | R Documentation |
These function builds the reference required by the SpliceWiz engine, as well as alternative splicing annotation data for SpliceWiz. See examples below for guides to making the SpliceWiz reference.
getResources(
reference_path = "./Reference",
fasta = "",
gtf = "",
overwrite = FALSE,
force_download = FALSE,
verbose = TRUE
)
buildRef(
reference_path = "./Reference",
fasta = "",
gtf = "",
overwrite = FALSE,
force_download = FALSE,
chromosome_aliases = NULL,
genome_type = "",
nonPolyARef = "",
MappabilityRef = "",
BlacklistRef = "",
ontologySpecies = "",
useExtendedTranscripts = TRUE,
lowMemoryMode = TRUE,
verbose = TRUE
)
buildFullRef(
reference_path = "./Reference",
fasta = "",
gtf = "",
use_STAR_mappability = FALSE,
overwrite = FALSE,
force_download = FALSE,
chromosome_aliases = NULL,
genome_type = "",
nonPolyARef = "",
MappabilityRef = "",
BlacklistRef = "",
ontologySpecies = "",
useExtendedTranscripts = TRUE,
verbose = TRUE,
n_threads = 4,
...
)
getNonPolyARef(genome_type)
getAvailableGO(localHub = FALSE, ah = AnnotationHub(localHub = localHub))
reference_path |
(REQUIRED) The directory path to store the generated reference files |
fasta |
The file path or web link to the user-supplied genome
FASTA file. Alternatively, the name of the AnnotationHub record containing
the genome resource. May be omitted if |
gtf |
The file path or web link to the user-supplied transcript
GTF file (or gzipped GTF file). Alternatively, the name of the
AnnotationHub record containing the transcript GTF file. May be omitted if
|
overwrite |
(default |
force_download |
(default |
verbose |
(default |
chromosome_aliases |
(Highly optional) A 2-column data frame containing chromosome name conversions. If this is set, allows processBAM to parse BAM alignments to a genome whose chromosomes are named differently to the reference genome. The most common scenario is where Ensembl genome typically use chromosomes "1", "2", ..., "X", "Y", whereas UCSC/Gencode genome use "chr1", "chr2", ..., "chrX", "chrY". See example below. Refer to https://github.com/dpryan79/ChromosomeMappings for a list of chromosome alias resources. |
genome_type |
Allows |
nonPolyARef |
(Optional) A BED file of regions defining known
non-polyadenylated transcripts. This file is used for QC analysis
to measure Poly-A enrichment quality of samples. An RDS file (openable
using |
MappabilityRef |
(Optional) A BED file of low mappability regions due to
repeat elements in the genome. If omitted, the file generated by
|
BlacklistRef |
A BED file of regions to be otherwise excluded from IR
analysis. If omitted, a blacklist is not used (this is the default).
An RDS file (openable using |
ontologySpecies |
(default |
useExtendedTranscripts |
(default |
lowMemoryMode |
(default |
use_STAR_mappability |
(default FALSE) In |
n_threads |
The number of threads used to generate the STAR reference and mappability calculations. Multi-threading is not used for SpliceWiz reference generation (but multiple cores are utilised in data-table and fst file processing automatically, where available). See STAR-methods |
... |
For |
localHub |
(default |
ah |
For |
getResources()
processes the files, downloads resources from
web links or from AnnotationHub()
, and saves a local copy in the "resource"
subdirectory within the given reference_path
. Resources are retrieved via
either:
User-supplied FASTA and GTF file. This can be a file path, or a web link
(e.g. 'http://', 'https://' or 'ftp://'). Use fasta
and gtf
to specify the files or web paths to use.
AnnotationHub genome and gene annotation (Ensembl): supply the names of
the genome sequence and gene annotations to fasta
and gtf
.
buildRef()
will first run getResources()
if resources are
not yet saved locally (i.e. getResources()
is not already run).
Then, it creates the SpliceWiz references. Typical run-times are
5 to 10 minutes for human and mouse genomes (after resources are downloaded).
NB: the parameters fasta
and gtf
can be omitted in buildRef()
if
getResources()
is already run.
buildFullRef()
builds the STAR aligner reference alongside the SpliceWiz
reference. The STAR reference will be located in the STAR
subdirectory
of the specified reference path. If use_STAR_mappability
is set to TRUE
this function will empirically compute regions of low mappability. This
function requires STAR
to be installed on the system (which only runs on
linux-based systems).
getNonPolyARef()
returns the path of the non-polyA reference file for the
human and mouse genomes.
Typical usage involves running buildRef()
for human and mouse genomes
and specifying the genome_type
to use the default MappabilityRef
and
nonPolyARef
files for the specified genome. For non-human non-mouse
genomes, use one of the following alternatives:
Create the SpliceWiz reference without using Mappability Exclusion regions.
To do this, simply run buildRef()
and omit MappabilityRef
. This is
acceptable assuming the introns assessed are short and do not contain
intronic repeats
Calculating Mappability Exclusion regions using the STAR aligner,
and building the SpliceWiz reference. This can be done using the
buildFullRef()
function, on systems where STAR
is installed
Instead of using the STAR aligner, any genome splice-aware aligner could be
used. See Mappability-methods for
an example workflow using the Rsubread aligner. After producing the
MappabilityExclusion.bed.gz
file (in the Mappability
subfolder), run
buildRef()
using this file (or simply leave it blank).
BED files are tab-separated text files containing 3 unnamed columns
specifying chromosome, start and end coordinates. To view an example BED
file, open the file specified in the path returned by
getNonPolyARef("hg38")
If MappabilityRef
, nonPolyARef
and BlacklistRef
are left blank, the
following will be used (by priority):
The previously used Mappability, non-polyA and/or Blacklist file resource from a previous run, if available,
The resource implied by the genome_type
parameter, if specified,
No resource is used.
To rebuild a SpliceWiz reference using existing resources
This is typically run when updating an old resource to a new SpliceWiz
version. Simply run buildRef(), specifying the existing reference directory,
leave the fasta
and gtf
parameters blank, and set overwrite = TRUE
.
SpliceWiz will use the previously-used resources to re-create the reference.
See examples below for common use cases.
For getResources
: creates the following local resources:
reference_path/resource/genome.2bit
: Local copy of the genome sequences
as a TwoBitFile.
reference_path/resource/transcripts.gtf.gz
: Local copy of the gene
annotation as a gzip-compressed file.
For buildRef()
and buildFullRef()
: creates a SpliceWiz reference
which is written to the given directory specified by reference_path
.
Files created includes:
reference_path/settings.Rds
: An RDS file containing parameters used
to generate the SpliceWiz reference
reference_path/SpliceWiz.ref.gz
: A gzipped text file containing collated
SpliceWiz reference files. This file is used by processBAM
reference_path/fst/
: Contains fst files for subsequent easy access to
SpliceWiz generated references
reference_path/cov_data.Rds
: An RDS file containing data required to
visualise genome / transcript tracks.
buildFullRef()
also creates a STAR
reference located in the STAR
subdirectory inside the designated reference_path
For getNonPolyARef()
: Returns the file path to the BED file for
the nonPolyA loci for the specified genome.
For getAvailableGO()
: Returns a vector containing names of species with
supported gene ontology annotations.
getResources()
: Processes / downloads a copy of the
genome and gene annotations and stores this in the "resource" subdirectory
of the given reference path
buildRef()
: First calls getResources()
(if required). Afterwards creates the SpliceWiz reference in the
given reference path
buildFullRef()
: One-step function that fetches resources,
creates a STAR reference (including mappability calculations), then
creates the SpliceWiz reference
getNonPolyARef()
: Returns the path to the BED file
containing coordinates of known non-polyadenylated transcripts for genomes
hg38
, hg19
, mm10
and mm9
,
getAvailableGO()
: Returns available species on Bioconductor's
AnnotationHub. Currently, only Bioconductor's OrgDb/Ensembl gene ontology
annotations are supported.
Mappability-methods for methods to calculate low mappability regions
STAR-methods for a list of STAR wrapper functions
AnnotationHub
https://github.com/alexchwong/SpliceWizResources for RDS files of
Mappability Exclusion GRanges objects (for hg38, hg19, mm10 and mm9)
that can be use as input files for MappabilityRef
in buildRef()
.
These resources are intended for SpliceWiz users on older Bioconductor
versions (3.13 or earlier)
# Quick runnable example: generate a reference using SpliceWiz's example genome
example_ref <- file.path(tempdir(), "Reference")
getResources(
reference_path = example_ref,
fasta = chrZ_genome(),
gtf = chrZ_gtf()
)
buildRef(
reference_path = example_ref
)
# NB: the above is equivalent to:
example_ref <- file.path(tempdir(), "Reference")
buildRef(
reference_path = example_ref,
fasta = chrZ_genome(),
gtf = chrZ_gtf()
)
# Get the path to the Non-PolyA BED file for hg19
getNonPolyARef("hg19")
# View available species for AnnotationHub's Ensembl/orgDB-based GO resources
availSpecies <- getAvailableGO()
# Build example reference with `Homo sapiens` Ens/orgDB gene ontology
ont_ref <- file.path(tempdir(), "Reference_withGO")
buildRef(
reference_path = ont_ref,
fasta = chrZ_genome(),
gtf = chrZ_gtf(),
ontologySpecies = "Homo sapiens"
)
## Not run:
### Long examples ###
# Generate a SpliceWiz reference from user supplied FASTA and GTF files for a
# hg38-based genome:
buildRef(
reference_path = "./Reference_user",
fasta = "genome.fa", gtf = "transcripts.gtf",
genome_type = "hg38"
)
# NB: Setting `genome_type = hg38`, will automatically use default
# nonPolyARef and MappabilityRef for `hg38`
# Reference generation from Ensembl's FTP links:
FTP <- "ftp://ftp.ensembl.org/pub/release-94/"
buildRef(
reference_path = "./Reference_FTP",
fasta = paste0(FTP, "fasta/homo_sapiens/dna/",
"Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"),
gtf = paste0(FTP, "gtf/homo_sapiens/",
"Homo_sapiens.GRCh38.94.chr.gtf.gz"),
genome_type = "hg38"
)
# Get AnnotationHub record names for Ensembl release-94:
# First, search for the relevant AnnotationHub record names:
ah <- AnnotationHub::AnnotationHub()
AnnotationHub::query(ah, c("Homo Sapiens", "release-94"))
buildRef(
reference_path = "./Reference_AH",
fasta = "AH65745",
gtf = "AH64631",
genome_type = "hg38"
)
# Build a SpliceWiz reference, setting chromosome aliases to allow
# this reference to process BAM files aligned to UCSC-style genomes:
chrom.df <- GenomeInfoDb::genomeStyles()$Homo_sapiens
buildRef(
reference_path = "./Reference_UCSC",
fasta = "AH65745",
gtf = "AH64631",
genome_type = "hg38",
chromosome_aliases = chrom.df[, c("Ensembl", "UCSC")]
)
# One-step generation of SpliceWiz and STAR references, using 4 threads.
# NB1: requires a linux-based system with STAR installed.
# NB2: A STAR reference genome will be generated in the `STAR` subfolder
# inside the given `reference_path`.
# NB3: A custom Mappability Exclusion file will be calculated using STAR
# and will be used to generate the SpliceWiz reference.
buildFullRef(
reference_path = "./Reference_with_STAR",
fasta = "genome.fa", gtf = "transcripts.gtf",
genome_type = "hg38",
use_STAR_mappability = TRUE,
n_threads = 4
)
# NB: the above is equivalent to running the following in sequence:
getResources(
reference_path = "./Reference_with_STAR",
fasta = "genome.fa", gtf = "transcripts.gtf"
)
STAR_buildRef(
reference_path = reference_path,
also_generate_mappability = TRUE,
n_threads = 4
)
buildRef(
reference_path = "./Reference_with_STAR",
genome_type = ""
)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.