R/getPlotSetArray.R
In seqplots: An interactive tool for visualizing NGS signals and sequence motif densities along genomic features using average plots and heatmaps

Documented in getPlotSetArray

#' Process genomic signal
#'
#' Function to process genomic signal from tracks and/or motif data, 
#' calculate statistics. This function should be used as the entry point to the
#' SeqPlots pipeline and followed by plotting function(s).
#'
#' @param tracks Character vector or list of BigWig track paths. For motif 
#'  density plots \code{\link{MotifSetup}} class containing motifs setup 
#'  and/or BigWig track paths (see details)
#' @param features Character vector or list containing feature file paths (BED 
#'  or GFF).
#' @param refgenome The UCSC code of reference genome, e.g. 'hg19' for 
#'  Homo sapiens (see details)
#' @param bin Binning window size in base pairs, defaults to 1L
#' @param rm0 Remove zeros from mean/error estimate calculation, 0 in track 
#' file will be treated as missing data, defaults to FALSE
#' @param xmin Upstream calculation distance in base pairs, defaults to 200L
#' @param xmax Downstream calculation distance in base pairs, defaults to 2000L
#' @param xanchored Anchored, feature body pseudo length in base pairs.
#'  The features will be extended or shrunk using linear approximation. 
#'  Used only if /code{type="af"}, defaults to 1000L
#' @param type The type of the calculation, "pf" for point features (default),
#'  "mf" for midpoint features, "ef" for endpoint features and "af" for 
#'  anchored features, see details
#' @param add_heatmap Add the heatmap data to output, must be on to plot
#'  heatmap form output \code{\link{PlotSetArray}} class, defauls to TRUE
#' @param ignore_strand If TRUE the directionality is ignored, that is all 
#'  features' strands, regardless of annotation in GFF/BED file, are treated 
#'  as undetermined ("*"), defaults to FALSE
#' @param verbose Print various messages and warnings, defaults to FALSE
#' @param stat If set to "median" the median is used as summarizing statistic 
#'   for linear plots instead of mean 
#' @param lvl1m function to handle lvl 1 messages, useful when invoked 
#'  in Shiny GUI environment, defaults to \code{\link[base]{message}}
#' @param lvl2m function to handle lvl 2 messages, useful when invoked 
#'  in Shiny GUI environment, defaults to \code{\link[base]{message}}
#'  
#' @return The \code{\link{PlotSetArray}} object.
#' 
#' @details
#' This function takes genomic coordinates in BED or GFF format, and extracts
#' the signal from track files (BigWig) and/or calculates motif density in 
#' these regions. Then it computes the statistics required for average 
#' and heatmap plots. Returns the \code{\link{PlotSetArray}} class, 
#' which can be further subsisted, and used for plotting.
#' 
#' \subsection{Modes of operation}{
#' The function operate in three modes, determined by \code{type} parameter:
#' \itemize{
#'     \item \strong{Point Features} - anchor plot on the start of a feature. 
#'     By default, plot will be directional if strand information is present 
#'     (i.e, use start position and plot on positive strand for + strand 
#'     features and use end position and plot on negative strand for minus 
#'     strand features). If strand information is not present in the feature 
#'     file (or if the "ignore strand" option is chosen), plot will use start 
#'     position of feature and be plotted on the positive strand 
#'     (see explanations). 
#'     User chooses length of upstream and downstream sequence to plot.
#'     \item \strong{Midpoint Features} - similar to point feature, but plot 
#'     is centred on the midpoint of the feature.
#'     \item \strong{Endpoint Features} - similar to point feature, but plot 
#'     is centred on the end point (most downstream) of the feature.
#'     \item \strong{Anchored Features} - features are anchored at start 
#'     and stop positions and given pseudo-length chosen by the user. 
#'     Additionally, the user chooses the length of sequence upstream of 
#'     the start and downstream of the end to plot.
#' }
#' }
#' 
#' \subsection{Binning the track}{
#' \code{bin} numeric parameter determines the resolution of data acquisition.
#' The default value 10 means that 10bp intervals within the plotting range
#' will be summarized by calculating the mean. Higher values increases the
#' speed of calculation and produces smoother plots, but decreases resolution.
#' }
#' 
#' \subsection{DNA motifs}{
#' The \code{\link{MotifSetup}} class allows to calculate and plot the density
#' of any user-defined motif around the chosen genomic feature using the 
#' reference sequence package. Motif plots can be mixed with track files' 
#' signal plots. The \code{\link{MotifSetup}} can be initialized in following 
#' way:
#' 
#'  
#' \code{ms <- MotifSetup()} \cr
#' \code{ms$addMotif("TATA", window=200L, heatmap=TRUE, revcomp=TRUE, 
#'     name=pattern)} \cr
#' \code{ms$addMotif("GAGA", window=100L)$addBigWig("path/to/file.bw")} \cr
#' 
#' The \code{addMotiff} methods accepts following parameters:
#' \describe{
#'     \item{\code{motif}}{ The DNA motif sequence. }
#'     \item{\code{window}}{ Sliding window size in base pairs [bp] - the size
#'     of the sliding window for motif calculation. The value (number of 
#'     matching motifs within the window) is reported in the middle of the 
#'     window, e.g. if window is set to 200bp, DNA motif is "GC" and there are
#'     8 CpGs in first 200 bp of the chromosome the value 8 will be 
#'     reported at 100th bp.}
#'     \item{\code{name}}{ Display name - The name of the motif that will be 
#'     shown in key and heatmap labels. Leave blank to use DNA motif value.}
#'     \item{\code{heatmap}}{ Plot heatmap or error estimates - this checkbox 
#'     determines if heatmap matrix and error estimates should be calculated. 
#'     If unchecked much faster algorithm will be used for motif density 
#'     calculation, but only the average plot without the error estimates 
#'     will be available.}
#'     \item{\code{revcomp}}{ Match reverse complement as well - select if 
#'     reverse complement motif should be reported as well. For example the 
#'     TATA motif will report both TATA and ATAT with this option selected.}
#' }
#' }
#' 
#' 
#' \subsection{Reference genomes}{
#' The \code{refgenome} parameter determines the reference genome to be used 
#  for calculation. Reference genome package is needed to establish baseline
#' chromosome naming convention (e.g. chrX vs. X) and chromosome lengths. 
#' Also for motif plots the genomic sequence is used to calculate motif
#' density tracks. To check which genomic packages are installed in current R 
#' session use \code{\link[BSgenome]{installed.genomes}} function.
#' \code{\link[BSgenome]{available.genomes}} gives the list of all reference 
#' genome packages currently supplied by BioConductor. Please refer to 
#' \code{\link[BSgenome]{BSgenome}} package documentation for installing and 
#' forging new genome packages.
#' }
#' 
#' @family plotting functions
#' @export
#' 
#' 
#' @examples
#'   
#' # Get the paths of example files                      
#' bed1 <- system.file("extdata", 
#'     "Transcripts_ce10_chrI_100Kb.bed", package="seqplots")
#' bed2 <- system.file("extdata", 
#'     "GSM1208361_chrI_100Kb_PeakCalls.bed", package="seqplots")
#' bw1 <- system.file("extdata", 
#'     "GSM1208360_chrI_100Kb_q5_sample.bw", package="seqplots")
#' 
#' #If required install C. elegans genomic package from Bioconductor
#' if(!"BSgenome.Celegans.UCSC.ce10" %in% BSgenome::installed.genomes()) {
#'     if(.Platform$OS.type != "windows" || .Machine$sizeof.pointer != 4) {
#'         if (!requireNamespace("BiocManager", quietly=TRUE))
    #'         install.packages("BiocManager")
#'         BiocManager::install("BSgenome.Celegans.UCSC.ce10")
#'     }
#' }
#' 
#' #Get getPlotSetArray for track and feature files
#' #Does not work on Windows i386 (32 bit)
#' if(.Platform$OS.type != "windows" || .Machine$sizeof.pointer != 4) {
#'     plotset1 <- getPlotSetArray(bw1, c(bed1, bed2), 'ce10')
#' } else {
#'     load(system.file("extdata", "precalc_plotset.Rdata", package="seqplots"))
#' }
#' plot(plotset1) #Average plot
#' plot(plotset1[1,], what='h') #Heatmap
#' 
#' #Get getPlotSetArray for motifs, track and feature files
#' ms <- MotifSetup()
#' ms <- MotifSetup()
#' ms$addMotif('GAGA')
#' ms$addMotif('TATA')
#' ms$addBigWig(bw1)
#' if(.Platform$OS.type != "windows" || .Machine$sizeof.pointer != 4) {
#'     plotset2 <- getPlotSetArray(ms, c(bed1, bed2), 'ce10')
#' }
#' plot(plotset2) #Average plot
#' plot(plotset2[1,], what='h') #Heatmap
#'  
getPlotSetArray <- function(
    tracks, features, refgenome, bin=10L, rm0=FALSE, ignore_strand=FALSE, 
    xmin=2000L, xmax=2000L, xanchored=1000L, type='pf', add_heatmap=TRUE, 
    verbose=FALSE, stat='mean', lvl1m=message, lvl2m=message) {
    
    old_opt <- options('warn'); on.exit(options(old_opt));
    
    if( !verbose ) options('warn'=-1)
    if( class(tracks) == "MotifSetup" ) tracks <- tracks$data
    
    if( class(tracks) == "BigWigFile" ) tracks <- list(tracks)
    if( class(tracks) == "BigWigFileList" ) tracks <- as.list(tracks)
    
    if( class(tracks) == "BamFile" ) tracks <- list(tracks)
    if( class(tracks) == "BamFileList" ) tracks <- as.list(tracks)
    
    if( class(features) == "GRanges" ) features <- list(features)
    if( is(features, "GRangesList") ) features <- as.list(features)
    
    n <- 1; k <- 1; fe_names <- character();
    
    TSS <- list(length(features))
    ANNO <- list(length(features))
    
    extarct_matrix <- function(track, gr, size, ignore_strand) {
        sum <- suppressWarnings(.Call(
            'BWGFile_summary', path.expand(path(track)),
            as.character(seqnames(gr)), ranges(gr), 
            S4Vectors::recycleIntegerArg(size, "size", length(gr)), 
            "mean", as.numeric(NA_real_), PACKAGE='rtracklayer'
        ))
        M <- do.call( rbind, sum )
        if (!ignore_strand) 
            M[as.character(strand(gr))=='-',] <- M[
                as.character(strand(gr))=='-', ncol(M):1]
        return(M)
    }
    
    
    for (j in features) {
        
        #Set up genome package
        if( class(refgenome) == "SQLiteConnection" ) {
            genome_ind <- dbGetQuery(
                refgenome, paste0("SELECT genome FROM files WHERE name = '", 
                                  basename(j), "'")
            )
            if(genome_ind == 'custom') genome_ind <- NA
        } else {
            genome_ind <- refgenome  
        }
        
        if(!is.na(genome_ind)) {
            GENOME <- getREF(genome_ind)
            
            if(class(try(seqlevelsStyle(GENOME))) == "character") {
                remap_chr <- FALSE
            } else {
                remap_chr <- TRUE
            }
        } else {
            GENOME <- NA
            remap_chr <- FALSE
        }
     
        
        #Get features to plot
        message(class(j))
        if(class(j) == "character") {
            tss <- try( 
                anno_out <- sel <- rtracklayer::import(normalizePath(j))
            , silent = FALSE );
            if (class(sel) == "try-error") {
                file_con <- file( normalizePath(j) )
                anno_out <- sel <- rtracklayer::import(file_con)
                close(file_con)
            }
        } else if(class(j) == "GRanges") {
            anno_out <- sel <- j
        }
        
        proc <- list(); proc_names <- character();
        for(i in 1:length(tracks) ) {

            fe_name <- if(class(j) == "character") {
                basename(j) 
            } else if(length(names(features)[n])) {
                names(features)[n] 
            } else {
                paste0('feature', '_', n)
            }
            
            tr_name <- if(class(tracks[[i]]) == 'BigWigFile' | class(tracks[[i]]) == 'BamFile') 
                if( is.null(names(tracks[i])) ) basename(path(tracks[[i]])) else names(tracks[i])
            else 
                if( is.null(names(tracks[i])) ) basename(tracks[[i]][[1]]) else names(tracks[i])
                        
            lvl1m(paste(
                'Processing:', fe_name, '@', tr_name, 
                '[', k, '/',length(features)*length(tracks), ']'
            ))
            
            if (ignore_strand)                strand(sel) <- '*'
            if( type == 'mf' ) sel <- resize(sel, 1, fix='center')
            if( type == 'ef' ) sel <- resize(sel, 1, fix='end')
            
            if( class(tracks[[i]]) == 'character' ) {
                if(grepl('bam$', tracks[[i]])) {
                    tracks[[i]] <- Rsamtools::BamFile( normalizePath(tracks[[i]]) )
                    bf <- tracks[[i]]
                    if(remap_chr) { seqlevelsStyle(sel) <- seqlevelsStyle(bf)[1] }
                } else {
                    track <- BigWigFile( normalizePath(tracks[[i]]) )
                    if(remap_chr) { seqlevelsStyle(sel) <- seqlevelsStyle(track)[1] }
                }
            } else if ( class(tracks[[i]]) == 'list' ) {  
                pattern <- tracks[[i]]$pattern
                seq_win <- tracks[[i]]$window
                pnam    <- tracks[[i]]$name
                revcomp <- tracks[[i]]$revcomp
                if(remap_chr) { seqlevelsStyle(sel) <- seqlevelsStyle(GENOME) }
                
            } else if ( class(tracks[[i]]) == 'BigWigFile' ) {
                track <- tracks[[i]]
                if(remap_chr) { seqlevelsStyle(sel) <- seqlevelsStyle(track)[1] }
            } else if ( class(tracks[[i]]) == 'BamFile' ) {
                bf <- tracks[[i]]
                if(remap_chr) { seqlevelsStyle(sel) <- seqlevelsStyle(bf)[1] }
            }
            
            if ( (type == 'pf') | (type == 'mf') | (type == 'ef') ) {
                
                gr <- suppressWarnings( GenomicRanges::promoters(sel, xmin, xmax) )
                gr <- trim(gr)
                all_ind  <- seq(-xmin, xmax, by=as.numeric(bin) )
                
                if( class(tracks[[i]]) == 'character' | class(tracks[[i]]) == 'BigWigFile') {
                    M <- extarct_matrix(
                        track, gr, length(all_ind), ignore_strand)
                    
                } else if ( class(tracks[[i]]) == 'list' ) {
                    if(suppressWarnings(is.na(GENOME))) stop('Motif plots are not possible for undefined genomes.')
                    if(verbose) lvl2m("Processing genome...")
                    
                    suppressWarnings( seqlengths(gr) <- seqlengths(GENOME)[seqlevels(gr)] )
                    gr <- trim(gr)
                    
                    if(verbose) 
                        lvl2m("Searching for motif...")
                    
                    
                    M <- suppressWarnings( getSF(
                        GENOME, trim(gr), pattern, seq_win, !add_heatmap, 
                        revcomp=revcomp,  nbins=length(all_ind)
                    ))
                    
                    
                    
                    
                } else if ( class(tracks[[i]]) == 'BamFile' ) {
                    
                    seqinfo(sel) <- seqinfo(bf)[seqlevels(sel)]
                    
                    what <- c("mapq")
                    flag <- scanBamFlag(isUnmappedQuery = FALSE)
                    w <- reduce(GenomicRanges::promoters(sel, xmin*1.2, xmax*1.2), ignore.strand = TRUE)
                    #w <- reduce( c(flank(w, width = 0.5*(xmax-xmin), both = TRUE), w))
                    
                    param <- ScanBamParam(
                        which=w, 
                        flag = flag, simpleCigar = FALSE, what = what
                    )
                    ga <- readGAlignments(bf, param=param)
                    
                    #system.time(ga <- readGAlignments(
                    #    BamViews('test.bam', bamRanges=reduce(gr, ignore.strand=TRUE))
                    #)[[1]])
                    
                    #grng <- trim(resize(GRanges(ga), 200L))
                    grng <- GRanges(ga)
                    covr <- coverage(grng)
                    
                    tmp <- tempfile()
                    export.bw(covr, tmp)
                    bwf <- BigWigFile(tmp)
                    M <- extarct_matrix(
                        bwf, gr, length(all_ind), ignore_strand)
                    
                    #M1 <- do.call(rbind, as.list(NumericList(covr[trim(gr)])))
                    
                    #ii <- seq(1, ncol(M1), by=as.numeric(bin) )
                    #zero <- (which(all_ind==0)*bin)-bin/2
                    
                    #M <- do.call(cbind, Map(function(s, e) {
                    #    rowMeans(M1 [,s:e])
                    #}, ii, ii + (as.numeric(bin)-1)))
                    #M <- cbind(M[,which(all_ind < 0)], M1[zero], M[,which(all_ind > 0)-1])
                
                }
                
            } else if (type == 'af') {
                left_ind <- seq(-xmin, -1, by=bin)
                mid_ind <- seq(0, xanchored, by=bin)   
                right_ind <- seq(xanchored+bin, xanchored+xmax, by=bin) 
                all_ind <- c(left_ind, mid_ind, right_ind)
                
                if( class(tracks[[i]]) == 'character' | class(tracks[[i]]) == 'BigWigFile') {
                    M.left <- extarct_matrix(
                        track, trim(flank(sel, xmin, start=TRUE)), length(left_ind), 
                        ignore_strand) #left
                    M.middle <- extarct_matrix(
                        track, trim(sel), length(mid_ind), ignore_strand) #middle    
                    M.right <- extarct_matrix(
                        track, trim(flank(sel, xmax, start=FALSE)), 
                        length(right_ind), ignore_strand)  #right
                    M <- cbind(M.left, M.middle, M.right)   
                    
                } else if ( class(tracks[[i]]) == 'list' ) {
                    if(suppressWarnings(is.na(GENOME))) stop('Motif plots are not possible for undefined genomes.')
                    #MOTIF - left 
                    if(verbose) lvl2m(paste0(
                        "MOTIF: Processing upsterem ", pattern, " motif..."
                    ))
                    gr <- trim(flank(sel, xmin, start=TRUE))
                    suppressWarnings( seqlengths(gr) <- seqlengths(GENOME)[seqlevels(gr)] )
                    M.left <- suppressWarnings( getSF(
                        GENOME, trim(gr), pattern, seq_win, !add_heatmap, 
                        revcomp=revcomp, nbins=length(left_ind)
                    ) )
                    
                    #MOTIF - middle
                    if(verbose) lvl2m(paste0(
                        "MOTIF: Processing middle ", pattern, " motif..."
                    ))
                    gr <- sel
                    suppressWarnings( seqlengths(gr) <- seqlengths(GENOME)[seqlevels(gr)] )
                    M.middle <- suppressWarnings(getSF(
                        GENOME, trim(gr), pattern, seq_win, !add_heatmap, 
                        revcomp=revcomp, nbins=length(mid_ind)
                    ))
                    
                    #MOTIF - right
                    if(verbose) lvl2m(paste0(
                        "MOTIF: Processing downsteream ", pattern, " motif..."
                    ))
                    gr <- flank(sel, xmin, start=FALSE)
                    suppressWarnings( seqlengths(gr) <-  seqlengths(GENOME)[seqlevels(gr)] )
                    M.right <- suppressWarnings(getSF(
                        GENOME, trim(gr), pattern, seq_win, !add_heatmap,
                        revcomp=revcomp, nbins=length(right_ind)
                    ))
                    M <- cbind(M.left, M.middle, M.right)
                    
                }
            }
            
            if(verbose) lvl2m("Calculeating means/stderr/95%CIs...")
            if (rm0) M[M==0] <- NA
            if( stat == 'median' ) {
                means <- apply(M, 2, median, na.rm=TRUE)
                stderror <- apply(M, 2, function (n) {
                    mad(n, na.rm=TRUE) / sqrt( sum(!is.na(n)) )
                })
                conint  <- apply(M, 2, function (n) {
                    quantile(n, .975, na.rm = TRUE)*mad(n, na.rm = TRUE)/sqrt(sum(!is.na(n)))
                })
            } else {
                means <- colMeans(M, na.rm=TRUE)
                stderror <- apply(M, 2, function (n) {
                    sd(n, na.rm=TRUE) / sqrt( sum(!is.na(n)) )
                })
                conint  <- apply(M, 2, function (n) {
                    qt(0.975,sum(!is.na(n)))*sd(n,na.rm=TRUE)/sqrt(sum(!is.na(n)))
                })
            }
            
            stderror[is.na(stderror)] <- 0
            conint[is.na(conint)] <- 0
            
            if(verbose) lvl2m("Exporting results...")
            proc[[i]] <- list(
                means=means, stderror=stderror, conint=conint, all_ind=all_ind,
                e=if (type == 'af') xanchored else NULL,
                desc=paste(sub("\\.(bw|BW)$", "", tr_name), 
                            sub("\\.(gff|GFF|bed|BED)$", "", fe_name), sep="\n@"),
                heatmap=if (add_heatmap) M else NULL,
                anno=anno_out
            )
            proc_names <- c(proc_names, tr_name)
            k <- k+1
            
        }
        names(proc) <- sub("\\.(bw|BW)$", "", proc_names)
        fe_names <- c(fe_names, sub("\\.(gff|GFF|bed|BED)$", "", fe_name))
        
        TSS[[n]] <- proc
        ANNO[[n]] <- sel
        n <- n+1
    }
    names(TSS) <- fe_names
    return( PlotSetArray(data = TSS, annotations = ANNO) )
}
Any scripts or data that you put into this service are public.
seqplots documentation built on Nov. 8, 2020, 4:52 p.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
seqplots
An interactive tool for visualizing NGS signals and sequence motif densities along genomic features using average plots and heatmaps

R/getPlotSetArray.R
In seqplots: An interactive tool for visualizing NGS signals and sequence motif densities along genomic features using average plots and heatmaps

Defines functions getPlotSetArray

Documented in getPlotSetArray

Try the seqplots package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

seqplots An interactive tool for visualizing NGS signals and sequence motif densities along genomic features using average plots and heatmaps

R/getPlotSetArray.R In seqplots: An interactive tool for visualizing NGS signals and sequence motif densities along genomic features using average plots and heatmaps

Defines functions getPlotSetArray

Documented in getPlotSetArray

Try the seqplots package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

seqplots
An interactive tool for visualizing NGS signals and sequence motif densities along genomic features using average plots and heatmaps

R/getPlotSetArray.R
In seqplots: An interactive tool for visualizing NGS signals and sequence motif densities along genomic features using average plots and heatmaps