In dwinter/repeatR: Parse and analyse RepeatMasker output

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

repeatR

Read and analyse RepeatMasker output in R.

Very early in development!

install

library(devtools)
install_github("dwinter/repeatR")

A basic usage

The package comes with a small example dataset, including the repeats from one scaffold in the kākāpō assembly. We can read this file in memory using read_rm

library(repeatR)
# create a file path relative to the installed package, this step is not
# necessary for normal usage
rm_file <- system.file("extdata", "kakapo.out", package="repeatR")
kakapo <- read_rm(rm_file)
kakapo

As you can see, the function reads tdata and returns a data.frame with the alignment information from RepeatMasker.We can now quickly look at the composition of the repeats alignments on this scaffold:

library(ggplot2)
ggplot(kakapo, aes(tclass)) + 
    geom_bar() + 
    coord_flip()  + 
    theme_bw(base_size=14)

It is important to note, however, that the alignment between a reference genome and a given repeat element might be broken up over multiple rows in RepeatMakser output. This occurs when elements are nested within each other (a pattern that is very common for some elements in some species). repeatR provides a the function summarise_rm_ID to produce a new table with one row per unique element in the genome.

kakapo_aggregated <- summarise_rm_ID(kakapo)
head(kakapo_aggregated)

With this data, we can start to analyse the total amount of the scaffold covered by elements of different classes

ggplot(kakapo_aggregated, aes(qlen, tclass)) +
    geom_col() +
    theme_bw(base_size=14) +
    scale_x_continuous(labels=Mb_lab)

Quite often, you will want to remove some fo the sequences that are included in the output file. For instance, simple repeats and low complexity regions. The function filter_by_tclass will remove thise sequences along with functional RNAs and ARTEFACT sequences.

kakapo_just_TEs <-  filter_by_tclass(kakapo_aggregated)
table(kakapo_just_TEs$tclass)

Or the distrbution of the p_sub statistic (the proportion of bases that different from the consensus element). The function make_TE_pallete includes a pre-defined pallete for the tclass column.

ggplot(kakapo_just_TEs, aes(p_sub, fill=tclass)) +
    geom_histogram(colour="black") +
    scale_fill_manual(values=make_TE_pallete(kakapo_aggregated)) +
    theme_bw(base_size=14)