R/bounds.R

Defines functions rangeSampleMeans boundingIndicesByChr boundingIndices bounds2Rle

Documented in boundingIndices boundingIndicesByChr bounds2Rle rangeSampleMeans

##' Convert bounding indices into a Rle
##'
##' Given a matrix of first/last indices, like from boundingIndicesByChr, and values for
##' each range, convert to a Rle.  This function takes the expected length of the Rle, n,
##' so that any portion of the full length not covered by a first/last range will be a
##' run with the value NA.  This is typical in the case where data is segmented with CBS
##' and some of the data to be segmented is NA.
##' @export bounds2Rle
##' @param bounds matrix, two columns, with first and last index, like from boundingIndicesByChr
##' @param values ANY, some value to be associated with each range, like segmented copy number.
##' @param n integer, the expected length of the Rle, i.e. the number of features in the
##' genome/target ranges processed by boundingIndicesByChr.
##' @return Rle
##' @family 'segmented data'
bounds2Rle <- function(bounds, values, n) {
    if (length(values) != nrow(bounds)) {
        stop("must have one value for each bound")
    }
    if (n < length(values)) {
        stop("n must be >= length(values)")
    }
    run.length = integer((2 * nrow(bounds)) + 1)
    run.length[1] = bounds[1] - 1
    run.value = rep(NA_real_, (2 * nrow(bounds)) + 1)
    data.indices = seq.int(from = 2, by = 2, length.out = nrow(bounds))
    widths = (bounds[, 2] - bounds[, 1]) + 1
    run.length[data.indices] = widths
    run.value[data.indices] = values
    run.length[data.indices + 1] = diff(c(bounds[, 2], n)) - c(widths[-1], 0)
    
    if (sum(run.length) != n) {
        stop("Rle is the wrong length. Look for double counted features in your bounds table.")
    }
    return(Rle(run.value, run.length))
}

##' Find indices of features bounding a set of chromosome ranges/genes
##'
##' This function is similar to findOverlaps but it guarantees at least two features will be
##' covered. This is useful in the case of finding features corresponding to a set of genes.
##' Some genes will fall entirely between two features and thus would not return any ranges
##' with findOverlaps. Specifically, this function will find the indices of the features
##' (first and last) bounding the ends of a range/gene (start and stop) such that
##' first <= start < stop <= last. Equality is necessary so that multiple conversions between
##' indices and genomic positions will not expand with each conversion. Ranges/genes that are
##' outside the range of feature positions will be given the indices of the corresponding
##' first or last index rather than 0 or n + 1 so that genes can always be connected to some data.
##'
##' This function uses some tricks from findIntervals, where is for k queries and n features it
##' is O(k * log(n)) generally and ~O(k) for sorted queries. Therefore will be dramatically
##' faster for sets of query genes that are sorted by start position within each chromosome.
##' The index of the stop position for each gene is found using the left bound from the start
##' of the gene reducing the search space for the stop position somewhat. boundingIndices does not
##' check for NAs or unsorted data in the subject positions. These assumptions are safe for
##' position info coming from a GenoSet or GRanges.
##'
##' @param starts integer vector of first base position of each query range
##' @param stops integer vector of last base position of each query range
##' @param positions Base positions in which to search
##' @param all.indices logical, return a list containing full sequence of indices for each query
##' @return integer matrix of 2 columms for start and stop index of range in data or a list of full sequences of indices for each query (see all.indices argument)
##' @family 'range summaries'
##' @export boundingIndices
##' @examples
##'   starts = seq(10,100,10)
##'   boundingIndices( starts=starts, stops=starts+5, positions = 1:100 )
boundingIndices <- function(starts, stops, positions, all.indices = FALSE) {
    if (length(starts) != length(stops)) {
        stop("starts and stops must be the same length.")
    }
    bounds = .Call("binary_bound", as.integer(starts), as.integer(stops), as.integer(positions))
    
    if (all.indices == TRUE) {
        # Return all covered and bounding indices
        return(apply(bounds, 1, function(x) {
            seq(from = x[1], to = x[2])
        }))
    } else {
        # Just return left and right indices
        return(bounds)
    }
    
}

##' Find indices of features bounding a set of chromosome ranges/genes, across chromosomes
##'
##' Finds subject ranges corresponding to a set of genes (query ranges), taking chromosome
##' into account. Specifically, this function will find the indices of the features
##' (first and last) bounding the ends of a range/gene (start and stop) such that
##' first <= start < stop <= last. Equality is necessary so that multiple conversions between
##' indices and genomic positions will not expand with each conversion. Ranges/genes that are
##' outside the range of feature positions will be given the indices of the corresponding
##' first or last index on that chromosome, rather than 0 or n + 1 so that genes can always be
##' connected to some data. Checking the left and right bound for equality will tell you when
##' a query is off the end of a chromosome.
##'
##' This function uses some tricks from findIntervals, where is for k queries and n features it
##' is O(k * log(n)) generally and ~O(k) for sorted queries. Therefore will be dramatically
##' faster for sets of query genes that are sorted by start position within each chromosome.
##' The index of the stop position for each gene is found using the left bound from the start
##' of the gene reducing the search space for the stop position somewhat.
##'
##' This function differs from boundingIndices in that 1. it uses both start and end positions for
##' the subject, and 2. query and subject start and end positions are processed in blocks corresponding
##' to chromosomes.
##'
##' Both query and subject must be in at least weak genome order (sorted by start within chromosome blocks).
##'
##' @param query GRanges or something coercible to GRanges
##' @param subject GenomicRanges
##' @return integer matrix with two columns corresponding to indices on left and right bound of queries in subject
##' @export boundingIndicesByChr
##' @family 'range summaries'
boundingIndicesByChr <- function(query, subject) {
    if (!is(query, "GRanges")) {
        tryCatch({
            query = as(query, "GRanges")
        }, error = function(e) {
            stop("Could not convert query into GRanges.\n")
        })
    }
    
    # Subject must have features ordered by start within chromosome. Query need not
    # really, but it's faster.  Just checking query genome order to assure data are
    # in blocks by chromosome in a GRanges. Chromosome order doesn't matter.
    if (!isGenomeOrder(subject, strict = FALSE)) {
        stop("subject must be in genome order.\n")
    }
    if (!isGenomeOrder(query, strict = FALSE)) {
        stop("query must be in genome order.\n")
    }
    query.chr.indices = chrIndices(query)
    subject.chr.indices = chrIndices(subject)
    if (!all(rownames(query.chr.indices) %in% rownames(subject.chr.indices))) {
        stop("Some query chromosomes not represented in subject. Try query = keepSeqlevels( query, value=chrNames(subject) )")
    }
    subject.chr.indices = subject.chr.indices[rownames(query.chr.indices), , drop = FALSE]
    nquery = as.integer(sum(query.chr.indices[, 2] - query.chr.indices[, 3]))  # !!!
    query.start = start(query)
    query.end = end(query)
    query.names = names(query)
    if (is.null(query.names)) {
        query.names = as.character(seq.int(from = 1, to = nquery))
    }
    subject.start = start(subject)
    subject.end = end(subject)
    return(.Call("binary_bound_by_chr", nquery, query.chr.indices, query.start, query.end, 
        query.names, subject.chr.indices, subject.start, subject.end))
}

##' Average features in ranges per sample
##'
##' This function takes per-feature genomic data and returns averages for each of a set of genomic ranges.
##' The most obvious application is determining the copy number of a set of genes. The features
##' corresponding to each gene are determined with boundingIndices such that all features with the bounds
##' of a gene (overlaps). The features on either side of the gene unless those positions
##' exactly match the first or last base covered by the gene.  Therefore, genes falling between two features
##' will at least cover two features. Range bounding is performed by the boundingIndices function.
##'
##' @param query GRanges object representing genomic regions (genes) to be averaged.
##' @param subject A GenoSet object or derivative
##' @param assay.element character, name of element in assayData to use to extract data
##' @param na.rm scalar logical, ignore NAs?
##' @return numeric matrix of features in each range averaged by sample
##' @family 'range summaries'
##' @export rangeSampleMeans
##' @examples
##'   data(genoset)
##'   my.genes = GRanges( ranges=IRanges(start=c(35e6,128e6),end=c(37e6,129e6),names=c('HER2','CMYC')), seqnames=c('chr17','chr8') )
##'   rangeSampleMeans( my.genes, genoset.ds, 'lrr' )
rangeSampleMeans <- function(query, subject, assay.element, na.rm = FALSE) {
    ## Find feature bounds of each query in subject genoset, get feature data average
    ## for each sample
    all.indices = boundingIndicesByChr(query, subject)
    data.matrix = assay(subject, assay.element)
    range.means = rangeMeans(data.matrix, all.indices, na.rm = na.rm)
    return(range.means)
}

Try the genoset package in your browser

Any scripts or data that you put into this service are public.

genoset documentation built on Nov. 8, 2020, 6:07 p.m.