NxtIRFcore: Core Engine for NxtIRF: a User-Friendly Intron Retention and Alternative Splicing Analysis using the IRFinder Engine

BuildReference

R Documentation

Builds reference files used by IRFinder / NxtIRF.

Description

These function builds the reference required by the IRFinder engine, as well as alternative splicing annotation data for NxtIRF. See examples below for guides to making the NxtIRF reference.

Usage

GetReferenceResource(
  reference_path = "./Reference",
  fasta = "",
  gtf = "",
  overwrite = FALSE,
  force_download = FALSE
)

BuildReference(
  reference_path = "./Reference",
  fasta = "",
  gtf = "",
  overwrite = FALSE,
  force_download = FALSE,
  chromosome_aliases = NULL,
  genome_type = "",
  nonPolyARef = "",
  MappabilityRef = "",
  BlacklistRef = "",
  UseExtendedTranscripts = TRUE
)

GetNonPolyARef(genome_type)

BuildReference_Full(
  reference_path,
  fasta,
  gtf,
  chromosome_aliases = NULL,
  overwrite = FALSE,
  force_download = FALSE,
  genome_type = genome_type,
  use_STAR_mappability = FALSE,
  nonPolyARef = GetNonPolyARef(genome_type),
  BlacklistRef = "",
  UseExtendedTranscripts = TRUE,
  n_threads = 4
)

Arguments

`reference_path`	(REQUIRED) The directory path to store the generated reference files
`fasta`	The file path or web link to the user-supplied genome FASTA file. Alternatively, the name of the AnnotationHub record containing the genome resource. May be omitted if `GetReferenceResource()` has already been run using the same `reference_path`.
`gtf`	The file path or web link to the user-supplied transcript GTF file (or gzipped GTF file). Alternatively, the name of the AnnotationHub record containing the transcript GTF file. May be omitted if `GetReferenceResource()` has already been run using the same `reference_path`.
`overwrite`	(default `FALSE`) For `GetReferenceResource()`: if the genome FASTA and gene annotation GTF files already exist in the `resource` subdirectory, it will not be overwritten. For `BuildReference()` and `BuildReference_Full()`: the NxtIRF reference will not be overwritten if one already exist. A reference is considered to exist if the file `IRFinder.ref.gz` is present inside `reference_path`.
`force_download`	(default `FALSE`) When online resources are retrieved, a local copy is stored in the `NxtIRFcore` BiocFileCache. Subsequent calls to the web resource will fetch the local copy. Set `force_download` to `TRUE` will force the resource to be downloaded from the web. Set this to `TRUE` only if the web resource has been updated since the last retrieval.
`chromosome_aliases`	(Highly optional) A 2-column data frame containing chromosome name conversions. If this is set, allows IRFinder to parse BAM files where alignments are made to a genome whose chromosomes are named differently to the reference genome. The most common scenario is where Ensembl genome typically use chromosomes "1", "2", ..., "X", "Y", whereas UCSC/Gencode genome use "chr1", "chr2", ..., "chrX", "chrY". See example below. Refer to https://github.com/dpryan79/ChromosomeMappings for a list of chromosome alias resources.
`genome_type`	Allows `BuildReference()` to select default `nonPolyARef` and `MappabilityRef` for selected genomes. Allowed options are: `hg38`, `hg19`, `mm10`, and `mm9`.
`nonPolyARef`	(Optional) A BED file of regions defining known non-polyadenylated transcripts. This file is used for QC analysis of IRFinder-processed files to measure Poly-A enrichment quality of samples. If omitted, and `genome_type` is defined, the default for the specified genome will be used.
`MappabilityRef`	(Optional) A BED file of low mappability regions due to repeat elements in the genome. If omitted, the file generated by `Mappability_CalculateExclusions()` will be used where available, and if this is not, the default file for the specified `genome_type` will be used. If `genome_type` is not specified, `MappabilityRef` is not used. See details.
`BlacklistRef`	A BED file of regions to be otherwise excluded from IR analysis. If omitted, a blacklist is not used (this is the default).
`UseExtendedTranscripts`	(default `TRUE`) Should non-protein-coding transcripts such as anti-sense and lincRNA transcripts be included in searching for IR / AS events? Setting `FALSE` (vanilla IRFinder) will exclude transcripts other than `protein_coding` and `processed_transcript` transcripts from IR analysis.
`use_STAR_mappability`	(default FALSE) In `BuildReference_Full()`, whether to run STAR_Mappability to calculate low-mappability regions. We recommend setting this to `FALSE` for the common genomes (human and mouse), and to `TRUE` for genomes not supported by `genome_type`. When set to false, the MappabilityExclusion default file corresponding to `genome_type` will automatically be used.
`n_threads`	The number of threads used to generate the STAR reference and mappability calculations. Multi-threading is not used for NxtIRF reference generation (but multiple cores are utilised in data-table and fst file processing automatically, where available). See STAR-methods

Details

GetReferenceResource() processes the files, downloads resources from web links or from AnnotationHub(), and saves a local copy in the "resource" subdirectory within the given reference_path. Resources are retrieved via either:

User-supplied FASTA and GTF file. This can be a file path, or a web link (e.g. 'http://', 'https://' or 'ftp://'). Use fasta and gtf to specify the files or web paths to use.
AnnotationHub genome and gene annotation (Ensembl): supply the names of the genome sequence and gene annotations to fasta and gtf.

BuildReference() will first run GetReferenceResource() if resources are not yet saved locally (i.e. GetReferenceResource() is not already run). Then, it creates the NxtIRF / IRFinder references. Typical run-times are 5 to 10 minutes for human and mouse genomes (after resources are downloaded).

NB: the parameters fasta and gtf can be omitted in BuildReference() if GetReferenceResource() is already run.

Typical usage involves running BuildReference() for human and mouse genomes and specifying the genome_type to use the default MappabilityRef and nonPolyARef files for the specified genome. For non-human non-mouse genomes, use one of the following alternatives:

Create the NxtIRF reference without using Mappability Exclusion regions. To do this, simply run BuildReference() and omit MappabilityRef. This is acceptable assuming the introns assessed are short and do not contain intronic repeats
Calculating Mappability Exclusion regions using the STAR aligner, and building the NxtIRF reference. This can be done using the BuildReference_Full() function, on systems where STAR is installed
Instead of using the STAR aligner, any genome splice-aware aligner could be used. See Mappability-methods for details. After producing the MappabilityExclusion.bed.gz file (in the Mappability subfolder), run BuildReference() using this file (or simply leave it blank).

BED files are tab-separated text files containing 3 unnamed columns specifying chromosome, start and end coordinates. To view an example BED file, open the file specified in the path returned by GetNonPolyARef("hg38")

See examples below for common use cases.

Value

For GetReferenceResource: creates the following local resources:

reference_path/resource/genome.2bit: Local copy of the genome sequences as a TwoBitFile.
reference_path/resource/transcripts.gtf.gz: Local copy of the gene annotation as a gzip-compressed file. For BuildReference and BuildReference_Full: creates a NxtIRF reference which is written to the given directory specified by reference_path. Files created includes:
reference_path/settings.Rds: An RDS file containing parameters used to generate the NxtIRF reference
reference_path/IRFinder.ref.gz: A gzipped text file containing collated IRFinder reference files. This file is used by IRFinder
reference_path/fst/: Contains fst files for subsequent easy access to NxtIRF generated references
reference_path/cov_data.Rds: An RDS file containing data required to visualise genome / transcript tracks.

BuildReference_Full also creates a STAR reference located in the STAR subdirectory inside the designated reference_path

For GetNonPolyARef: Returns the file path to the BED file for the nonPolyA loci for the specified genome.

Functions

GetReferenceResource: Processes / downloads a copy of the genome and gene annotations and stores this in the "resource" subdirectory of the given reference path
BuildReference: First calls GetReferenceResource() (if required). Afterwards creates the NxtIRF reference in the given reference path
GetNonPolyARef: Returns the path to the BED file containing coordinates of known non-polyadenylated transcripts for genomes hg38, hg19, mm10 and mm9,
BuildReference_Full: One-step function that fetches resources, creates a STAR reference (including mappability calculations), then creates the NxtIRF reference

Examples

# Quick runnable example: generate a reference using NxtIRF's example genome

example_ref <- file.path(tempdir(), "Reference")
GetReferenceResource(
    reference_path = example_ref,
    fasta = chrZ_genome(),
    gtf = chrZ_gtf()
)
BuildReference(
    reference_path = example_ref
)

# NB: the above is equivalent to:

example_ref <- file.path(tempdir(), "Reference")
BuildReference(
    reference_path = example_ref,
    fasta = chrZ_genome(),
    gtf = chrZ_gtf()
)

# Get the path to the Non-PolyA BED file for hg19

GetNonPolyARef("hg19")
## Not run: 

### Long examples ###

# Generate a NxtIRF reference from user supplied FASTA and GTF files for a
# hg38-based genome:

BuildReference(
    reference_path = "./Reference_user",
    fasta = "genome.fa", gtf = "transcripts.gtf",
    genome_type = "hg38"
)

# NB: Setting `genome_type = hg38`, will automatically use default
# nonPolyARef and MappabilityRef for `hg38`

# Reference generation from Ensembl's FTP links:

FTP <- "ftp://ftp.ensembl.org/pub/release-94/"
BuildReference(
    reference_path = "./Reference_FTP",
    fasta = paste0(FTP, "fasta/homo_sapiens/dna/",
        "Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"),
    gtf = paste0(FTP, "gtf/homo_sapiens/",
        "Homo_sapiens.GRCh38.94.chr.gtf.gz"),
    genome_type = "hg38"
)

# Get AnnotationHub record names for Ensembl release-94:

# First, search for the relevant AnnotationHub record names:

ah <- AnnotationHub::AnnotationHub()
AnnotationHub::query(ah, c("Homo Sapiens", "release-94"))
# snapshotDate(): 2021-09-23
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: TwoBitFile, GRanges
# additional mcols(): taxonomyid, genome, description, coordinate_1_based,
#   maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH64628"]]'
#
#             title
#   AH64628 | Homo_sapiens.GRCh38.94.abinitio.gtf
#   AH64629 | Homo_sapiens.GRCh38.94.chr.gtf
#   AH64630 | Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf
#   AH64631 | Homo_sapiens.GRCh38.94.gtf
#   AH65744 | Homo_sapiens.GRCh38.cdna.all.2bit
#   AH65745 | Homo_sapiens.GRCh38.dna.primary_assembly.2bit
#   AH65746 | Homo_sapiens.GRCh38.dna_rm.primary_assembly.2bit
#   AH65747 | Homo_sapiens.GRCh38.dna_sm.primary_assembly.2bit
#   AH65748 | Homo_sapiens.GRCh38.ncrna.2bit

BuildReference(
    reference_path = "./Reference_AH",
    fasta = "AH65745",
    gtf = "AH64631",
    genome_type = "hg38"
)

# Build a NxtIRF reference, setting chromosome aliases to allow
# this reference to process BAM files aligned to UCSC-style genomes:

chrom.df <- GenomeInfoDb::genomeStyles()$Homo_sapiens

BuildReference(
    reference_path = "./Reference_UCSC",
    fasta = "AH65745",
    gtf = "AH64631",
    genome_type = "hg38",
    chromosome_aliases = chrom.df[, c("Ensembl", "UCSC")]
)

# One-step generation of NxtIRF and STAR references, using 4 threads.
# NB1: requires a linux-based system with STAR installed.
# NB2: A STAR reference genome will be generated in the `STAR` subfolder
#      inside the given `reference_path`.
# NB3: A custom Mappability Exclusion file will be calculated using STAR
#      and will be used to generate the NxtIRF reference.

BuildReference_Full(
    reference_path = "./Reference_with_STAR",
    fasta = "genome.fa", gtf = "transcripts.gtf",
    genome_type = "",
    use_STAR_mappability = TRUE,
    n_threads = 4
)

# NB: the above is equivalent to running the following in sequence:

GetReferenceResource(
    reference_path = "./Reference_with_STAR",
    fasta = "genome.fa", gtf = "transcripts.gtf"
)
STAR_buildRef(
    reference_path = reference_path,
    also_generate_mappability = TRUE,
    n_threads = 4
)
BuildReference(
    reference_path = "./Reference_with_STAR",
    genome_type = ""
)

## End(Not run)

alexchwong/NxtIRFcore documentation built on Oct. 31, 2022, 9:14 a.m.