homopolymerMatcher: Match homopolymer lengths
In MarioniLab/sarlacc: Pipeline for Oxford Nanopore RNA-Seq Data Analysis

Description Usage Arguments Details Value Author(s) See Also Examples

Match the true and observed homopolymer lengths based on pairwise alignments of read sequences to a reference.

1	homopolymerMatcher(alignments)

alignments

A Global PairwiseAlignmentsSingleSubject object, usually produced by pairwiseAlignment. The subject should be a constant reference sequence.

This function will identify all “true” homopolymers in the reference sequence (i.e., runs of the same base). For each true homopolymer, it will identify the corresponding read subsequence based on the pairwise alignment. The observed length of the true homopolymer is defined as the longest contiguous run of the same base (ignoring deletion characters) in the corresponding read subsequence.

This is most easily illustrated with a few examples below. For demonstration purposes, only the true homopolymer region and the corresponding read subsequence are shown in uppercase.

  # Observed length of 2.
  acgtAA--tgca # Read
  acgtAAAAtgca # Reference

  # Observed length of 3, despite the deletion character.
  acgtAA-Atgca # Read
  acgtAAAAtgca # Reference

  # Observed length of 2, as the T breaks the run of A's.
  acgtAATAtgca # Read
  acgtAAAAtgca # Reference

  # Observed length of 3, before the breaking T.
  acgtAAATAtgca # Read
  acgt-AAAAtgca # Reference

  # Observed length of 6, including the insertions before and after.
  acgtAAAAAAtgca # Read
  acgt-AAAA-tgca # Reference

  # Observed length of 4, as the observed run must overlap actual homopolymer bases.
  acgtAAAATAAAAAAAAAtgca # Read
  acgtAAAA----------tgca # Reference

An IRanges object where each entry represents a homopolymer run in the reference sequence. The metadata contains base, the base identity of the homopolymer; and observed, a RleList containing an integer run length encoding for each homopolymer. The integer Rle contains the distribution of the observed lengths of that homopolymer in the read sequence.

Aaron Lun, with contributions from Cheuk-Ting Law

pairwiseAlignment, homopolymerFinder

1
2
3

aln <- pairwiseAlignment(subject=DNAString(c("AAAAGGGGGCCCCTTTT")), 
    DNAStringSet(c("AAAAAGGGGGCCCCCCTTTTT", "AAAAGGGGGCCCCTTTTT")))
homopolymerMatcher(aln)