STAR-methods: STAR wrappers for building reference for STAR, and aligning...

STAR-methodsR Documentation

STAR wrappers for building reference for STAR, and aligning RNA-sequencing

Description

These STAR helper / wrapper functions allow users to (1) create a STAR genome reference (with or without GTF), (2) align one or more RNA-seq samples, and (3) calculate regions of low mappability. STAR references can be created using one-step (genome and GTF), or two-step (genome first, then on-the-fly with injected GTF) approaches.

Usage

STAR_version()

STAR_buildRef(
  reference_path,
  STAR_ref_path = file.path(reference_path, "STAR"),
  n_threads = 4,
  overwrite = FALSE,
  sjdbOverhang = 100,
  sparsity = 1,
  also_generate_mappability = FALSE,
  map_depth_threshold = 4,
  additional_args = NULL,
  ...
)

STAR_alignExperiment(
  Experiment,
  STAR_ref_path,
  BAM_output_path,
  n_threads = 4,
  overwrite = FALSE,
  two_pass = FALSE,
  trim_adaptor = "AGATCGGAAG",
  additional_args = NULL
)

STAR_alignReads(
  fastq_1 = c("./sample_1.fastq"),
  fastq_2 = NULL,
  STAR_ref_path,
  BAM_output_path,
  n_threads = 4,
  overwrite = FALSE,
  two_pass = FALSE,
  trim_adaptor = "AGATCGGAAG",
  memory_mode = "NoSharedMemory",
  additional_args = NULL
)

STAR_buildGenome(
  reference_path,
  STAR_ref_path = file.path(reference_path, "STAR"),
  n_threads = 4,
  overwrite = FALSE,
  sparsity = 1,
  also_generate_mappability = FALSE,
  map_depth_threshold = 4,
  additional_args = NULL,
  ...
)

STAR_loadGenomeGTF(
  reference_path,
  STAR_ref_path,
  STARgenome_output = file.path(tempdir(), "STAR"),
  n_threads = 4,
  overwrite = FALSE,
  sjdbOverhang = 100,
  extraFASTA = "",
  additional_args = NULL
)

STAR_mappability(
  reference_path,
  STAR_ref_path = file.path(reference_path, "STAR"),
  map_depth_threshold = 4,
  n_threads = 4,
  ...
)

Arguments

reference_path

The path to the reference. getResources must first be run using this path as its reference_path

STAR_ref_path

(Default - the "STAR" subdirectory under reference_path) The directory containing the STAR reference to be used or to contain the newly-generated STAR reference

n_threads

The number of threads to run the STAR aligner.

overwrite

(default FALSE) For STAR_buildRef, STAR_buildGenome and STAR_loadGenomeGTF - if STAR genome already exists, should it be overwritten. For STAR_alignExperiment and STAR_alignReads - if BAM file already exists, should it be overwritten.

sjdbOverhang

(Default = 100) A STAR setting indicating the length of the donor / acceptor sequence on each side of the junctions. Ideally equal to (mate_length - 1). See the STAR aligner manual for details.

sparsity

(default 1) Sets STAR's --genomeSAsparseD option. For human (and mouse) genomes, set this to 2 to allow STAR to perform genome generation and mapping using < 16 Gb of RAM, albeit with slightly lower mapping rate (~ 0.1% lower, according to STAR's author). Setting this to higher values is experimental (and not tested)

also_generate_mappability

Whether STAR_buildRef() and STAR_buildGenome() also calculate Mappability Exclusion regions.

map_depth_threshold

(Default 4) The depth of mapped reads threshold at or below which Mappability exclusion regions are defined. See Mappability-methods. Ignored if also_generate_mappability = FALSE

additional_args

A character vector of additional arguments to be parsed into STAR. See examples below.

...

Additional arguments to be parsed into generateSyntheticReads(). See Mappability-methods.

Experiment

A two or three-column data frame with the columns denoting sample names, forward-FASTQ and reverse-FASTQ files. This can be conveniently generated using findFASTQ

BAM_output_path

The path under which STAR outputs the aligned BAM files. In STAR_alignExperiment(), STAR will output aligned BAMS inside subdirectories of this folder, named by sample names. In STAR_alignReads(), STAR will output directly into this path.

two_pass

Whether to use two-pass mapping. In STAR_alignExperiment(), STAR first-pass will align every sample to generate a list of splice junctions but not BAM files. The junctions are then given to STAR to generate a temporary genome containing information about novel junctions, thereby improving novel junction detection. In STAR_alignReads(), STAR will use ⁠--twopassMode Basic⁠

trim_adaptor

The sequence of the Illumina adaptor to trim via STAR's --clip3pAdapterSeq option

fastq_1, fastq_2

In STAR_alignReads: character vectors giving the path(s) of one or more FASTQ (or FASTA) files to be aligned. If single reads are to be aligned, omit fastq_2

memory_mode

The parameter to be parsed to --genomeLoad; either NoSharedMemory or LoadAndKeep are used.

STARgenome_output

The output path of the created on-the-fly genome

extraFASTA

(default "") One or more FASTA files containing spike-in genome sequences (e.g. ERCC, Sequins), as required.

Details

Pre-requisites

STAR_buildRef() and STAR_buildGenome() require prepared genome and gene annotation reference retrieved using getResources, which is run internally by buildRef

STAR_loadGenomeGTF() requires the above, and additionally a STAR genome created using STAR_buildGenome()

STAR_alignExperiment(), STAR_alignReads(), and STAR_mappability(): requires a STAR genome, which can be built using STAR_buildRef() or STAR_buildGenome() followed by STAR_loadGenomeGTF()

Function Description

For STAR_buildRef: this function will create a STAR genome reference using the same genome FASTA and gene annotation GTF used to create the SpliceWiz reference. Optionally, it will run STAR_mappability if also_generate_mappability is set to TRUE

For STAR_alignExperiment: aligns a set of FASTQ or paired FASTQ files using the given STAR genome using the STAR aligner. A data.frame specifying sample names and corresponding FASTQ files are required

For STAR_alignReads: aligns a single or pair of FASTQ files to the given STAR genome using the STAR aligner.

For STAR_buildGenome: Creates a STAR genome reference, using ONLY the FASTA file used to create the SpliceWiz reference. This allows users to create a single STAR reference for use with multiple transcriptome (GTF) references (on different occasions). Optionally, it will run STAR_mappability if also_generate_mappability is set to TRUE

For STAR_loadGenomeGTF: Creates an "on-the-fly" STAR genome, injecting GTF from the given SpliceWiz reference_path, setting sjdbOverhang setting, and (optionally) any spike-ins via the extraFASTA parameter. This allows users to create a single STAR reference for use with multiple transcriptome (GTF) references, with different sjdbOverhang settings, and/or spike-ins (on different occasions or for different projects).

For STAR_mappability: this function will first will run generateSyntheticReads, then use the given STAR genome to align the synthetic reads using STAR. The aligned BAM file will then be processed using calculateMappability to calculate the lowly-mappable genomic regions, producing the MappabilityExclusion.bed.gz output file.

Value

For STAR_version(): The STAR version

For STAR_buildRef(): None

For STAR_alignExperiment(): None

For STAR_alignReads(): None

For STAR_buildGenome(): None

For STAR_loadGenomeGTF(): The path of the on-the-fly STAR genome, typically in the subdirectory "_STARgenome" within the given STARgenome_output directory

For STAR_mappability(): None

Functions

  • STAR_version(): Checks whether STAR is installed, and its version

  • STAR_buildRef(): Creates a STAR genome reference, using both FASTA and GTF files used to create the SpliceWiz reference

  • STAR_alignExperiment(): Aligns multiple sets of FASTQ files, belonging to multiple samples

  • STAR_alignReads(): Aligns a single sample (with single or paired FASTQ or FASTA files)

  • STAR_buildGenome(): Creates a STAR genome reference, using ONLY the FASTA file used to create the SpliceWiz reference

  • STAR_loadGenomeGTF(): Creates an "on-the-fly" STAR genome, injecting GTF from the given SpliceWiz reference_path, setting sjdbOverhang setting, and (optionally) any spike-ins as extraFASTA

  • STAR_mappability(): Calculates lowly-mappable genomic regions using STAR

See Also

Build-Reference-methods findSamples Mappability-methods

The latest STAR documentation

Examples

# 0) Check that STAR is installed and compatible with SpliceWiz

STAR_version()
## Not run: 

# The below workflow illustrates
# 1) Getting the reference resource
# 2) Building the STAR Reference, including Mappability Exclusion calculation
# 3) Building the SpliceWiz Reference, using the Mappability Exclusion file
# 4) Aligning (a) one or (b) multiple raw sequencing samples.


# 1) Reference generation from Ensembl's FTP links

FTP <- "ftp://ftp.ensembl.org/pub/release-94/"

getResources(
    reference_path = "Reference_FTP",
    fasta = paste0(FTP, "fasta/homo_sapiens/dna/",
        "Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"),
    gtf = paste0(FTP, "gtf/homo_sapiens/",
        "Homo_sapiens.GRCh38.94.chr.gtf.gz")
)

# 2) Generates STAR genome within the SpliceWiz reference. Also generates
# mappability exclusion gzipped BED file inside the "Mappability/" sub-folder

STAR_buildRef(
    reference_path = "Reference_FTP",
    STAR_ref_path = file.path("Reference_FTP", "STAR"),
    n_threads = 8,
    also_generate_mappability = TRUE
)

# 2a) Generates STAR genome of the example SpliceWiz genome.
#     This demonstrates using custom STAR parameters, as the example 
#     SpliceWiz genome is ~100k in length, 
#     so --genomeSAindexNbases needs to be
#     adjusted to be min(14, log2(GenomeLength)/2 - 1)

getResources(
    reference_path = "Reference_chrZ",
    fasta = chrZ_genome(),
    gtf = chrZ_gtf()
)

STAR_buildRef(
    reference_path = "Reference_chrZ",
    STAR_ref_path = file.path("Reference_chrZ", "STAR"),
    n_threads = 8,
    additional_args = c("--genomeSAindexNbases", "7"),
    also_generate_mappability = TRUE
)

# 3) Build SpliceWiz reference using the newly-generated 
#    Mappability exclusions

#' NB: also specifies to use the hg38 nonPolyA resource

buildRef(reference_path = "Reference_FTP", genome_type = "hg38")

# 4a) Align a single sample using the STAR reference

STAR_alignReads(
    fastq_1 = "sample1_1.fastq", fastq_2 = "sample1_2.fastq",
    STAR_ref_path = file.path("Reference_FTP", "STAR"),
    BAM_output_path = "./bams/sample1",
    n_threads = 8
)

# 4b) Align multiple samples, using two-pass alignment

Experiment <- data.frame(
    sample = c("sample_A", "sample_B"),
    forward = file.path("raw_data", c("sample_A", "sample_B"),
        c("sample_A_1.fastq", "sample_B_1.fastq")),
    reverse = file.path("raw_data", c("sample_A", "sample_B"),
        c("sample_A_2.fastq", "sample_B_2.fastq"))
)

STAR_alignExperiment(
    Experiment = Experiment,
    STAR_ref_path = file.path("Reference_FTP", "STAR"),
    BAM_output_path = "./bams",
    n_threads = 8,
    two_pass = TRUE
)

# - Building a STAR genome (only) reference, and injecting GTF as a
#   subsequent step
#
#   This is useful for users who want to create a single STAR genome, for
#   experimentation with different GTF files.
#   It is important to note that the chromosome names of the genome (FASTA)
#   file and the GTF file needs to be identical. Thus, Ensembl and Gencode
#   GTF files should not be mixed (unless the chromosome GTF names have
#   been fixed)

# - also set sparsity = 2 to build human genome so that it will fit in
#   16 Gb RAM. NB: this step's RAM usage can be set using the
#   `--limitGenomeGenerateRAM` parameter

STAR_buildGenome(
    reference_path = "Reference_FTP",
    STAR_ref_path = file.path("Reference_FTP", "STAR_genomeOnly"),
    n_threads = 8, sparsity = 2,
    additional_args = c("--limitGenomeGenerateRAM", "16000000000")
)

# - Injecting a GTF into a genome-only STAR reference
#
#   This creates an on-the-fly STAR genome, using a GTF file 
#   (derived from a SpliceWiz reference) into a new location.
#   This allows a single STAR reference to use multiple GTFs
#   on different occasions.

STAR_new_ref <- STAR_loadGenomeGTF(
    reference_path = "Reference_FTP",
    STAR_ref_path = file.path("Reference_FTP", "STAR_genomeOnly"),
    STARgenome_output = file.path(tempdir(), "STAR"),
    n_threads = 4,
    sjdbOverhang = 100
)

# This new reference can then be used to align your experiment:

STAR_alignExperiment(
    Experiment = Experiment,
    STAR_ref_path = STAR_new_ref,
    BAM_output_path = "./bams",
    n_threads = 8,
    two_pass = TRUE
)

# Typically, one should `clean up` the on-the-fly STAR reference (as it is
#   large!). If it is in a temporary directory, it will be cleaned up
#   when the current R session ends; otherwise this needs to be done manually:

unlink(file.path(tempdir(), "STAR"), recursive = TRUE)


## End(Not run)

alexchwong/SpliceWiz documentation built on Oct. 15, 2024, 10:12 a.m.