Introduction

strandCheckR is a package used a window that slides across a bam file and gets the count/coverage of reads mapping to +/- strand in each window. The returned Data Frame gives information across the whole genome through each sliding window. That helps to quantify the amount of reads coming from double strand genomic DNA. It can be applied to detect DNA contamination within a strand specifc RNA-seq by assessing whether each window contains reads mainly from one strand, as would be expected from RNA molecules, or from both strands as would be expected from contaminating DNA. Hence, it is considered as an additional quality checking for strand specific RNA data. The package can also be used to detect regions with specific behavior (e.g. highest number of reads) through the whole genome and not only feature regions.

The window read count function is designed flexibly so that user can filter low mapping quality reads, set read proportion to overlap a window, set window length & step etc. It was also implemented in an efficient way to manange big bam file. For a typical human RNA-seq bam file, it takes about 3 minutes to scan and get strand information using a standard laptop 2,3 GHz i5 16 GB.

A function to filter out reads coming from double strand DNA is also provided. For each sliding window, it considers the proportion of +/- stranded reads comparing to a given threshold, to decide if a window contains reads derived from single strand RNA or double stranded DNA. A read that does not belong to any window coming from RNA will be removed. A read that belongs to some windows coming from RNA will be kept with a probability caluclated based on the proportion of +/- stranded of those windows.

Get windows information

The function getStrandFromBamFile is used to get the number of +/- stranded reads from all sliding windows throughout a list of bam files. The bam files should be sorted and have their index files located at the same path as well.

library(strandCheckR)
files <- system.file(
    "extdata",c("s1.sorted.bam","s2.sorted.bam"),package = "strandCheckR"
    )
win <- getStrandFromBamFile(files, sequences = "10")
# shorten the file name
win$File <- basename(as.character(win$File))
win

The returned DataFrame has 10 columns that correspond to the read type (R1 or R2 or single end read), sequence names, starting & ending bases of the windows (by default the window length is 1000 and the window step is 100), number of positive & negative reads that overlap each window (NbPositive, NbNegative), number of positive & negative bases that overlap each window (CovPositive, CovNegative), the maximum coverage (MaxCoverage) and the file name (File). Windows that do not overlap with any read are not reported.

The windows with highest read counts, for example, can be obtained as follows.

head(win[order((win$NbPos+win$NbNeg),decreasing=TRUE),])

Here is an example for paired end bam file.

fileP <- system.file("extdata","paired.bam",package = "strandCheckR")
winP <- getStrandFromBamFile(fileP, sequences = "10")
winP$File <- basename(as.character(winP$File)) #shorten file name
winP

Intersect with an annotation GRanges object

If you have an annotation data, you can integrate it with the sliding windows obtained from the previous step using the function intersectWithFeature. The annotation must be a GRanges object, with or without mcols. Make sure that the sequence names in windows and annotation data are consistent. By default, you will have an additional column in the returned windows which indicates wheather a window overlap with any feature in the annotation object. You can also get details of the overlapped features in the mcols of the annotation object by specifying it in the mcolsAnnnot parameter.

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
annot <- transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene)
#add chr before the sequence names to be consistent with the annot data
win$Seq <- paste0("chr",win$Seq) 
win <- intersectWithFeature(
    windows = win, annotation = annot, overlapCol = "OverlapTranscript"
    )
win

If you have a gtf or gff file, you can use the function gffReadGR of the ballgown package to read it as a GRanges object.

Plot histogram and windows information

With these windows, you can have some plots via plotHist and plotWin functions which can be saved to an appropriate location.

To plot the histogram of positive proportion of the sliding windows, we use the function plotHist. This function will first calculates the frequency of positive proportion over all windows, and also group/normalize them based on given column names. Then it uses the geom_bar function of ggplot2 to plot those frequencies. The plot gives you an idea on how much double-strand DNA is contained in your sample. In perfectly clear ss-RNA-seq, the positive proportion of every window should be either around 0\% or 100\%. The more amount of windows have this proportion around 50\%, the more the sample was contaminated by DNA.

plotHist(
    windows = win, groupBy = c("File","OverlapTranscript"), 
    normalizeBy = "File", scales = "free_y"
    )

In this example, file s2.sorted.bam seems to be contaminated with double strand DNA, while file s1.sorted.bam is cleaner. By default, the windows are splitted in to 4 coverage groups <10, 10-100, 100-1000, and >1000. It can be set via option split.

For paired-end data, we can group the Data Frame by read type:

plotHist(
    windows = winP, groupBy = "Type", normalizeBy = "Type", scales = "free_y"
    )

Heatmap can be used instead of classic barplot for histogram when specifying heatmap=TRUE. This would be usefull to visualize mutliple files in the same plot.

plotHist(
    windows = win, groupBy = c("File","OverlapTranscript"), 
    normalizeBy = "File", heatmap = TRUE
    )

plotWin creates a plot on the number of reads vs positive proportion of each window. There are also 4 lines correspond to 4 representative thresholds (0.6, 0.7, 0.8, 0.9). Threshold is a parameter that is used when filtering a bam file using filterDNA. Given a threshold, a positive (resp. negative) window is kept if and only if it is above (resp. below) the corresponding threshold line on this plot. This can be used to give guidance as to the best threshold to choose when filtering windows.

plotWin(win, groupBy = "File")

Filter bam files

The functions filterDNA removes potential double strand DNA from a bam file using a given threshold.

win2 <- filterDNA(
    file = files[2], sequences = "10", destination = "s2.filter.bam", 
    threshold = 0.7, getWin = TRUE
    )

Other parameters can be specified for more flexible filtering: define the ranges that you want to always keep, the minimum number of reads under which you want to ignore, the pvalue threshold for keeping a windows etc. Look at the manual page of the function for more options and explanations.

Plotting the windows before and after filering:

win2$File <- basename(as.character(win2$File))
win2$File <- factor(win2$File,levels = c("s2.sorted.bam","s2.filter.bam"))
library(ggplot2)
plotHist(win2,groupBy = "File",normalizeBy = "File",scales = "free_y") +
    ggtitle("Histogram of positive proportions over all sliding windows before 
            and after filtering reads coming from double strand DNA")


UofABioinformaticsHub/rnaCleanR documentation built on Aug. 11, 2021, 11:51 p.m.