Mappability-methods: Calculate low mappability genomic regions

Mappability-methodsR Documentation

Calculate low mappability genomic regions

Description

These functions empirically calculate low-mappability (Mappability Exclusion) regions using the given reference. A splice-aware alignment software capable of aligning reads to the genome is required. See details and examples below.

Usage

generateSyntheticReads(
  reference_path,
  read_len = 70,
  read_stride = 10,
  error_pos = 35,
  verbose = TRUE,
  alt_fasta_file
)

calculateMappability(
  reference_path,
  aligned_bam = file.path(reference_path, "Mappability", "Aligned.out.bam"),
  threshold = 4,
  n_threads = 1
)

Arguments

reference_path

The directory of the reference prepared by getResources

read_len

The nucleotide length of the synthetic reads

read_stride

The nucleotide distance between adjacent synthetic reads

error_pos

The position of the procedurally-generated nucleotide error from the start of each synthetic reads

verbose

Whether additional status messages are shown

alt_fasta_file

(Optional) The path to the user-supplied genome fasta file, if different to that found inside the resource subdirectory of the reference_path. If getResources has already been run, this parameter should be omitted.

aligned_bam

The BAM file of alignment of the synthetic reads generated by generateSyntheticReads(). Users should use a genome splice-aware aligner, preferably the same aligner used to align the samples in their experiment.

threshold

Genomic regions with this alignment read depth (or below) in the aligned synthetic read BAM are defined as low mappability regions.

n_threads

The number of threads used to calculate mappability exclusion regions from aligned bam file of synthetic reads.

Details

Creating a Mappability Exclusion BED file is a three-step process.

  • First, using generateSyntheticReads(), synthetic reads are systematically generated using the given genome contained within reference_path, prepared via getResources. Alternatively, use alt_fasta_file to set the genome sequence if this is different to that prepared by getResources or if getResources is not yet run.

  • Second, an aligner such as STAR (preferably the same aligner used for the subsequent RNA-seq experiment) is required to align these reads to the source genome. Poorly mapped regions of the genome will be reflected by regions of low coverage depth.

  • Finally, the BAM file containing the aligned reads is analysed using calculateMappability(), to identify low-mappability regions to compile the Mappability Exclusion BED file.

It is recommended to leave all parameters to their default settings. Regular users should only specify reference_path, aligned_bam and n_threads, as required.

NB: STAR_mappability runs all 3 steps required, using the STAR aligner. This only works in systems where STAR is installed.

NB2: buildFullRef builds the STAR reference, then calculates mappability. It then uses the calculated mappability regions to build the SpliceWiz reference.

NB3: In systems where STAR is not available, consider using HISAT2 or Rsubread. A working example using Rsubread is shown below.

Value

  • For generateSyntheticReads: writes Reads.fa to the Mappability subdirectory inside the given reference_path.

  • For calculateMappability: writes a gzipped BED file named MappabilityExclusion.bed.gz to the Mappability subdirectory inside reference_path. This BED file is automatically used by buildRef if its MappabilityRef parameter is not specified.

Functions

  • generateSyntheticReads(): Generates synthetic reads from a genome FASTA file, for mappability calculations.

  • calculateMappability(): Generate a BED file defining low mappability regions, using reads generated by generateSyntheticReads(), aligned to the genome.

See Also

Build-Reference-methods

Examples


# (1a) Creates genome resource files

ref_path <- file.path(tempdir(), "refWithMapExcl")

getResources(
    reference_path = ref_path,
    fasta = chrZ_genome(),
    gtf = chrZ_gtf()
)

# (1b) Systematically generate reads based on the example genome:

generateSyntheticReads(
    reference_path = ref_path
)
## Not run: 

# (2) Align the generated reads using Rsubread:

# (2a) Build the Rsubread genome index:

subreadIndexPath <- file.path(ref_path, "Rsubread")
if(!dir.exists(subreadIndexPath)) dir.create(subreadIndexPath)
Rsubread::buildindex(
    basename = file.path(subreadIndexPath, "reference_index"), 
    reference = chrZ_genome()
)

# (2b) Align the synthetic reads using Rsubread::subjunc()

Rsubread::subjunc(
    index = file.path(subreadIndexPath, "reference_index"), 
    readfile1 = file.path(ref_path, "Mappability", "Reads.fa"),
    output_file = file.path(ref_path, "Mappability", "AlignedReads.bam"),
    useAnnotation = TRUE,
    annot.ext = chrZ_gtf(),
    isGTF = TRUE
)

# (3) Analyse the aligned reads in the BAM file for low-mappability regions:

calculateMappability(
    reference_path = ref_path,
    aligned_bam = file.path(ref_path, "Mappability", "AlignedReads.bam")
)

# (4) Build the example reference using the calculated Mappability Exclusions

buildRef(ref_path)

# NB the default is to search for the BED file generated by
# `calculateMappability()` in the given reference_path

## End(Not run)

alexchwong/SpliceWiz documentation built on Oct. 15, 2024, 10:12 a.m.