knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

TRAVIS <- !identical(tolower(Sys.getenv("TRAVIS")), "true")
knitr::opts_chunk$set(purl = TRAVIS)
knitr::include_graphics("logo_tagMeppr.pdf")

Hi there! Welcome!

TagMap is a very useful method for transposon mapping [@Stern037762], enabling researchers to map the insertion sites with ease and generate long sequencing reads. However, there is little to none automatisation and downstream analysis software available for these reads. TagMeppr is an easy to use, memory efficient fastq-to-figure package written in R.

Usage

The basic usage of tagMapper revolves around three clear steps:

  1. index: a tagMapper-index is made once for a specific genome and protocol (e.g. hg19 and PigyBac).
  2. align: a tagMapperSample-object is made and aligned to the index
  3. analyse: determine and plot highly likely integraton-sites
library(tagMeppr)

tagMepprIndex

Remember: you only have to make a tagMepprIndex once per genome and protocol! This means that you can use it for however many samples for that protocol and organism (e.g. PiggyBac in human cells) you may have.

library("BSgenome.Hsapiens.UCSC.hg19")

reference_hg19_PB = makeIndex(indexPath = '../premadeReferences/', 
                              bsgenome = BSgenome.Hsapiens.UCSC.hg19, 
                              ITR = 'PiggyBac', 
                              targetInsertionSite = 'TTAA', verbose = T)

Loading a tagMepprIndex is easy with loadIndex().

library("BSgenome.Hsapiens.UCSC.hg19",quietly = T)
reference_hg19_PB = loadIndex('BSgenome.Hsapiens.UCSC.hg19_PiggyBac_tagMeppRindex.fa.gz')

You can use print() on this tagMepprIndex, which gives you a lot of information:

print(reference_hg19_PB)

tagMepprSample

The core object of tagMeppr is the tagMepprSample. Everything, from the paths of the fastq's to the final results, are stored in this object. To kick off the analysis, such an object should be made for every sample by using newTagMeppr(). In this function, the paths of the four respective fastq's, the name of the sample and the transposon are asked.

C91 = newTagMeppr(F1 = 'clone91_FWD_R1.fq.gz',
                 F2 = 'clone91_FWD_R2.fq.gz',
                 R1 = 'clone91_REV_R1.fq.gz',
                 R2 = 'clone91_REV_R2.fq.gz',
                 name = "clone9 (minimised)",
                 protocol = 'PiggyBac')

When you use the print() method, a summary will be given about the sample and its progression trough the tagMeppr-pipeline.

print(C91)

primer-checks

After this, it is a good idea to check whether your primer-design is one that tagMeppr expects: reverse- and forward-primers on different ITRs. Also, it checks if the reverse is found near the start of the transposon. This is to be able to accurately denote the orientation of the transposon. Normally, we expect the reverse-primer to be found at the start of the ITR-sequence. When running checkPrimer(), it will automatically update the tagMepprSample-object with the rev5_fwd3 tag. If it appears to be to other way around, the sample- object will have a rev5_fwd3=T flag.

It is highly recommended to do this!

fwdPrimer = "CGTCAATTTTACGCAGACTATC"
revPrimer = "GTACGTCACAATATGATTATCTTTCTAG"

checkPrimer(fwdPrimer = fwdPrimer, 
            revPrimer = revPrimer, 
            exp = C91, 
            ITR = 'PiggyBac')
print(C91)

Align

To align the tagMeppr-sample to the index, use the align()-function. This will call bwa mem and run it with optimal parameters set. If bwa or samtools are not installed, the function will can an error. Afterwards, the chimeric reads will be filtered and placed in a GRanges object for easy usage later on. Running align() will automatically update the tagMeppr-sample to include the parsed reads.

align(exp = C91, 
      ref = reference_hg19_PB, 
      cores = 30)

The object will be updated, which keeps your evironment tidy. The summary will now also show the number of informative reads (i.e. a chimeric read to both an ITR and the reference) and the temporary folder of the .BAM-files.

print(C91)

PCR-duplicates

By default, align() will remove suspected PCR-duplicates. This is not possible with normal tools like samtools due to the chimeric nature of the reads. If one doesn't want this deduplication to happen, set dedup=F in align().

C91_noDedup = C91
align(exp = C91_noDedup, 
      ref = reference_hg19_PB,
      dedup = F,
      cores = 30)

The output of the print()-command will now show that deduplication didn't happen.

print(C91_noDedup)

Analysis

Find insertions

The findInsertions() function will first find all reads that overlap a TIS, which in the case of PiggyBac will be "TTAA". Next it will calculate whether there is a bias towards one side of the TIS using a binominal test. The bias, denoted as $D$, $-1$ when all reads are upstream and $+1$ when all reads are downstream of the TIS.This is done independently for the forward and reverse reads:

$$ \begin{aligned} p_{fwd/rev} = \binom{reads_{D<0}}{reads} \end{aligned} $$

Next, we filter out TISs which have the bias on the same side of the TIS:

$$ \begin{aligned} sgn(D_{fwd}) \neq sgn(D_{rev}) \end{aligned} $$

To calculate a "TIS-specific" p-value, we use Edgington's sum-p method, which is very conservative in our usage. This ensures that, when $p_{combined} < \alpha$, both the fwd and the rev reads are indeed biased.

$$ \begin{aligned} p_{combined} = \dfrac{(\sum_{i=1}^{2} p_i)^{2}}{2!} \end{aligned} $$

Afterwards, a holm-correction is done to limit the Family-Wise Error Rate (FWER).

findInsertions(exp = C91, ref = reference_hg19_PB, padding = 2)
findInsertions(exp = C91_noDedup, ref = reference_hg19_PB, padding = 2)
print(C91)
foundInsertions = results( C91 )

head(foundInsertions)

barplot(table(foundInsertions$strand), horiz = F,
        col = tagMepprCol(),
        main = 'found orientations')

Plot

Single insertions

plotSite(C91,site = 3)

Returning to the sample witout deduplication, one can now see what would happen:

plotSite(C91_noDedup,site = 3)

You can also set a limit on the maximum number of reads shown:

plotSite(C91_noDedup,site = 3, maxReads = 100)

All insertions

plotInsertions(exp = C91)

show orientation:

plotInsertions(exp = C91, showOrientation = T)
Plot muliple samples
C91_a = C91
C91_b = C91

C91_a$name = 'clone9 subset 1'
C91_b$name = 'clone9 subset 2'

C91_a$results = sample(C91_a$results, size = floor(length(C91_a$results)/2))
C91_b$results = sample(C91_b$results, size = floor(length(C91_b$results)/2))

tagMepprSampleList = list(C91_a, C91_b)
plotInsertions(tagMepprSampleList)
plotInsertions(tagMepprSampleList, showOrientation = T)
plotInsertions(tagMepprSampleList, sideBySide = T)

significance

cnt1 = seq(1:20)
cnt2 = seq(1:20)
CNTS = data.frame(P_fwd = rep(cnt1, 20), P_rev = rep(cnt2, each = 20))
CNTS$p = apply(CNTS, 1, function(x){
  p = metap::sump(c(binom.test(x[1],x[1])$p.value, binom.test(x[2],x[2])$p.value))$p
})
CNTS.bk = CNTS
CNTS$p = -log10(CNTS$p)
CNTS = CNTS[CNTS$P_rev < CNTS$P_fwd, ]
sigs = CNTS.bk[CNTS.bk$p < 0.05, 1:2]
sigs$SIG = "*"
sigs$p = 0

library(ggplot2)
ggplot(CNTS, aes(x= P_rev, y = P_fwd, fill = p)) +
  geom_raster() +
  scale_fill_distiller(palette = 'PuRd', direction = 1)+
  coord_fixed(expand = F) +
  theme_classic()+
  labs(fill = '-log10(P)', title = 'expected significances', 
       subtitle = 'assuming |D| == 1', x = 'fwd counts', y = 'rev counts')+
  geom_text(data = sigs[sigs$P_fwd <= sigs$P_rev,], mapping = aes( label = SIG))+
  annotate('segment', x = 4.5, xend = 20.5, y = 2.5, yend = 2.5)+
  annotate('segment', y = 4.5, yend = 20.5, x = 2.5, xend = 2.5)+
  annotate('segment', x = 2.5, xend = 3.5, y = 4.5, yend = 4.5)+
  annotate('segment', y = 2.5, yend = 3.5, x = 4.5, xend = 4.5)+
  annotate('segment', x = 3.5, xend = 4.5, y = 3.5, yend = 3.5) +
  annotate('segment', y = 3.5, yend = 4.5, x = 3.5, xend = 3.5)

Biography



robinweide/tagmeppr documentation built on Feb. 26, 2020, 10:41 p.m.