In percyfal/ripr: Repeat Induced Point mutations in R

About

This vignette describes an analysis of Repeat Induced Point (RIP) mutations in R using the ripr package. ripr contains functionality for parsing RepeatMasker output, calculating RIP scores, and plotting scores along chromosomes in Manhattan-like plots. Before we start with the example analysis, we describe how ripr represents RepeatMasker output.

R setup

library(knitr)
knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE,
                      fig.width=8, fig.height=6, autodep=TRUE,
                      cache=FALSE, include=TRUE, eval=TRUE, tidy=FALSE,
                      dev=c('png'))
knitr::knit_hooks$set(inline = function(x) {
  prettyNum(x, big.mark=" ")
})

library(ripr)
library(ggplot2)
library(plyr)
library(dplyr)
library(cowplot)
library(Biostrings)
library(GenomeInfoDb)
library(GenomicRanges)

library(viridis)
library(RColorBrewer)
bw <- theme_bw(base_size=18) %+replace% theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
theme_set(bw)
color.pal.4 <- brewer.pal(name = "Paired", n = 4)

On object representation of RepeatMasker output

RepeatMasker screens sequences for repeats and low-complexity regions. ripr contains functions to parse two of RepeatMasker's output files, namely the annotation and alignment results. These files pair coordinates in the input sequence that is scanned for repeats (here, the genome sequence) with coordinates in the repeat sequence that is defined in the RepeatMasker library. The input sequence will henceforth be referred to as the query and the repeat sequence the subject.

ripr stores the results from RepeatMasker output as an AlignmentPairs object, which is a subclass of the Bioconductor class S4Vectors::Pairs. A Pairs object aligns two vectors along slot names first and second, and the AlignmentPairs object adds extra slots related to RepeatMasker output. In the following sections we parse RepeatMasker output and explain the additional slot names.

For more information about RepeatMasker and its output formats, see [@repeatmasker_docs].

Annotation output

The RepeatMasker annotation output is a simple tabular format where each line consists of a query-subject pair and additional statistics such as score and percentage divergence:

``` {r head-annotation-file, comment=NA, echo=FALSE} cat(system(paste("head -5 ", system.file("extdata", "crassa_lg2.fasta.out", package="ripr")), intern=TRUE), sep="\n") wzxhzdk:3 The `query` and `subject` slots can be accessed with the `query` and `sbjct` accessor functions: wzxhzdk:4 The query object consists of ranges on the genome where RepeatMasker has identified repeats: wzxhzdk:5 `AlignmentItem` extends `GenomicRanges::GRanges` by adding a slot `bases` that stores the *no. of bases in query sequence past the ending position of match* [see @repeatmasker_docs, section How to read the results]. There is also an additional slot `sequence` that can store the actual sequence of the matching region. The subject is a `RepeatAlignmentItem` object, where the only difference compared to `AlignmentItem` is that it has an extra slot `repeat_class` that stores information about the repeat class, as defined by RepeatMasker: wzxhzdk:6 ### Alignment output The alignment output contains information similar to that in the annotation output, but as the names implies also includes alignments:

``` {r head-alignment-file, comment=NA, echo=FALSE} alnfile <- system.file("extdata", "crassa_lg2.fasta.cat.gz", package="ripr") p1 <- system(paste("zcat ", alnfile, " | head -10"), intern=TRUE) p2 <- system(paste("zcat ", alnfile, " | head -34 | tail -8"), intern=TRUE) cat(c(p1, " ...", " ", p2), sep="\n") wzxhzdk:7 As when reading annotation files, the result is an `AlignmentPairs` object. Looking at the query, we now see that the `sequence` slot is populated with the aligned sequence, including gaps: wzxhzdk:8 It is important to note however, that the annotation and alignment files are **not** equivalent, as the RepeatMasker documentation clearly states [@repeatmasker_docs, section Discrepancies between alignments and annotation]: Most discrepancies between alignments and annotation result from adjustments made to produce more legible annotation. This annotation also tends to be closer to the biological reality than the raw cross_match output. For example, adjustments often are necessary when a repeat is fragmented through deletions, insertions, or an inversion. Many subfamilies of repeats closely resemble each other, and when a repeat is fragmented these fragments can be assigned different subfamily names in the raw output. The program often can decide if fragments are derived from the same integrated transposable element and which subfamily name is appropriate (subsequently given to all fragments). This can result in discrepancies in the repeat name and matching positions in the consensus sequence (subfamily consensus sequences differ in length). The analyses described in this vignette are all based on the annotation files, and no further mention will be made of the alignment output. # The input data The data consists of chromosome 2 sequence files and repeatmasker results for three *Neurospora* species: | Species | Chromosome name | Sequence file | Annotation | Alignment | |-----------------|------------------|------------------|----------------------|-------------------------| | *N.crassa* | Supercontig_12.2 | crassa_lg2.fasta | crassa_lg2.fasta.out | crassa_lg2.fasta.cat.gz | | *N.sitophila* | 5940pb_lg2 | 5940pb_lg2.fasta | 5940pb_lg2.fasta.out | 5940pb_lg2.fasta.cat.gz | | *N.tetrasperma* | 9046pb_lg2 | 9046pb_lg2.fasta | 9046pb_lg2.fasta.out | 9046pb_lg2.fasta.cat.gz | We define genome and annotation files for loading: wzxhzdk:9 Genome files are loaded with `Biostrings::readDNAStringSet` wzxhzdk:10 and annotations files with `readRepeatMaskerAnnotation`. We also add the genome name as a metadata attribute for later access. wzxhzdk:11 The loading of files was wrapped in a call to `AlignmentPairsList` which is a list container for `AlignmentPairs` objects, modelled after `GenomicRanges::GRangesList`. ## Adjustment of repeat classes Before proceeding, we look at the repeat classes for all the results. wzxhzdk:12 We simplify the names and rename levels with `plyr::mapvalues`. wzxhzdk:13 # RIP index calculation on TE elements The `ripr` package defines three different RIP metrics: - **RIP product index:** TA/AT ratio [@margolin_methylated_1998; @@selker_methylated_2003] - not RIP < 0.8 - RIP > 1.1 - **RIP substrate index:** (CA + TG) / (AC + GT) ratio [@margolin_methylated_1998; @@selker_methylated_2003] - not RIP > 1.1 - RIP < 0.9 - **RIP composite index:** RIP product index - RIP substrate index [@lewis_relics_2009] - RIP: positive value In order to calculate the RIP scores, we need the genome subsequences that are defined by the query coordinates in each AlignmentPairs object. The `calculateRIP` function takes as input an `AlignmentPairs` object, and a genome represented by a `Biostrings::DNAStringSet` object. Since we loaded the genomes before, we can proceed to calculate the RIP scores: wzxhzdk:14 ## RIP index as a function of repeat width To begin with we investigate the effect of repeat width on the RIP composite index score. Typically, signatures of RIP (score > 0) are visible first at 400bp (REF: rephrase). We plot the RIP composite index versus query width and note that ripped sequences indeed are longer. wzxhzdk:15 ## Comparing RIP index to a null scores Even though there seem to be a large number of repeats that are ripped, we would like to compare the observed scores to that which we would see in a random background. The function `makeNullRIPScores` generates mock `AlignmentItem` objects on a shuffled reference sequence: wzxhzdk:16 We can then use the `autoplot` function to plot the observed RIP scores together with the null distributions, either as scatter plots wzxhzdk:17 or densities wzxhzdk:18 or both. wzxhzdk:19 # Window-based analyses In addition to genome-wide analyses, `ripr` provides functions for analyzing RIP scores, repeat content, and GC content in windows. wzxhzdk:20 wzxhzdk:21 The results can be plotted with `autoplot` functions, either each `GRanges` separately wzxhzdk:22 wzxhzdk:23 # References

percyfal/ripr documentation built on May 7, 2020, 1:10 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com