R/segsample.R

Defines functions segsample

Documented in segsample

#' @title Calculate the median of the sampled copy number values from bins 
#' associated to selected segments.
#' 
#' @description This function calculates the median of the sampled copy number
#' values from bins associated to selected segments. The median of the samples
#' copy number values can be calculated multiple times for the same segments 
#' (sampling with replacement, bootstrap).
#' There is two way to select the number of times the median is calculated for
#' a segment. The first way is using a minimum number of bins per 
#' segments; the integer division of the current number of bins with the 
#' specified minimum number of bins gives the number of times the median is
#' calculated. The second way is to pass an integer that directly specifying 
#' the number of times the median is calculated for each segment. 
#' 
#' @param mysegs a \code{data.frame} containing information about the 
#' segments in 5 columns: \itemize{
#' \item{\code{StartProbe}}{ a \code{numeric} that tabulates the (integer) 
#'     start position of each segment in internal units such as probe numbers.}
#' \item{\code{EndProbe}}{ a \code{numeric} that tabulates the (integer) 
#'     end position of each segment in internal units such as probe numbers.}
#' \item{\code{chrom}}{ a \code{numeric} representing the chromosome.}
#' \item{\code{segmedian}}{ a \code{numeric} representing the median for the 
#'     group of bins associated to one segment}
#' \item{\code{segmad}}{ a \code{numeric} representing the median absolute 
#'     deviation for the group of bins associated to one segment}
#' }
#' 
#' @param ratcol a \code{vector} containing the copy number values (usually in
#' log2) for each bin associated to a segment present in \code{mysegs}. The
#' length of the \code{vector} should correspond to the total number of bins
#' present in \code{mysegs}. 
#' 
#' @param startcol a \code{character} string specifying the name of column 
#' in \code{mysegs} that tabulates the (integer) start position of each segment 
#' in internal units such as probe numbers for data of CGH microarray origin.
#' Default: "StartProbe".
#' 
#' @param endcol a \code{character} string specifying the name of column 
#' in \code{mysegs} that tabulates the (integer) end postion of each segment 
#' in internal units such as probe numbers for data of CGH microarray origin.
#' Default: "EndProbe".
#' 
#' @param blocksize a \code{integer} specifying how many bins must be
#' present in a segment so that the segment is selected to be sampled. 
#' Either \code{blocksize} or \code{times} must be specified by user.
#' Default: \code{0}.
#' 
#' @param times a \code{integer} specifying the number of times
#' each segment must be sampled. 
#' Either \code{blocksize} or \code{times} must be specified by user.
#' Default: \code{0}.
#' 
#' @return a \code{data.frame} containing the information about the selected
#' segments and the median of the sampled copy number values with replacement 
#' from the associated bins. It contains 3 columns:
#' \itemize{
#' \item{\code{StartProbe}}{ a \code{numeric} that tabulates the (integer) 
#'     start position of each segment in internal units such as probe numbers.}
#' \item{\code{EndProbe}}{ a \code{numeric} that tabulates the (integer) 
#'     end position of each segment in internal units such as probe numbers.}
#' \item{\code{NoName}}{ a \code{numeric} representing the median value of the
#' sampled bins.}
#' }
#' 
#' @examples
#' 
#' ## Create a data.frame with 3 segments on chromosome 1
#' segData <- data.frame(StartProbe=c(1, 9, 13), EndProbe=c(8, 12, 15),
#'     chrom=c(1,1,1), segmedian=c(0.06662475, 0.06719237, 0.07111544),
#'     segmad=c(0.06213208, 0.04722233, 0.07633202))
#'     
#' ## Copy number ratio (in log2) for each bin 
#' ## Multiples bins are associated to 1 segment
#' ratcol <- c(0.062073840, 0.10913919,  0.143459489,  0.033994620, 
#'     -0.072243732, 0.082252725,  0.151908930,  0.101589490,  0.08554752, 
#'     -0.011155011, -0.122291649, 0.063634112,  0.110149474,  0.043328961,  
#'     0.1632174529)
#'     
#' ## Use an integer division to determine the number of times each
#' ## segment is sampled
#' CNprep:::segsample(mysegs=segData, ratcol=ratcol, blocksize=4)
#' 
#' ## Each segment is sampled the same number of times
#' CNprep:::segsample(mysegs=segData, ratcol=ratcol, times=2)
#' 
#' @author Alexander Krasnitz, Guoli Sun
#' @keywords internal
segsample <- function(mysegs, ratcol, startcol="StartProbe", 
                        endcol="EndProbe", blocksize=0, times=0)
{
    ## At least one parameter (blocksize of times) must be set
    if(blocksize == 0 & times == 0) {
        stop("One of blocksize or times must be set")
    }
    
    ## Only one parameter (blocksize of times) must be set
    if(blocksize != 0 & times != 0) {
        stop("Only one of blocksize or times can be set")
    }
    
    ## Number of bootstraps done one each segment depends of the
    ## segment length if blocksize parameter is used
    ## Otherwise, the number of bootstraps is the same for all segments
    segtable <- mysegs[,c(startcol, endcol), drop=FALSE]
    ## Comment Pascal: at least one result should be different from zero
    if (blocksize != 0) {
        segtable <- segtable[rep(seq_len(nrow(segtable)),
                times=(segtable[,endcol] - segtable[, startcol] + 1) %/% 
                        blocksize),]
    }
    if (times != 0) {
        segtable <- segtable[rep(seq_len(nrow(segtable)), each=times),]
    }
    
    ## Calculate the mean of the sampled bins for each segment
    ## Each segment may be sampled more than once depending of the
    ## blocksize and times parameters
    return(cbind(segtable, apply(segtable, 1, smedian.sample, v=ratcol)))
}
KrasnitzLab/CNprep documentation built on May 28, 2022, 8:32 p.m.