Nothing
#' @include Multicore.R
#' @name scoreShift
#'
#' @title Calculate promoter shifting score
#'
#' @description Calculates the shifting score for all consensus clusters (promoters) between
#' two specified (groups of) CAGE datasets. Shifting score is a measure of differential
#' usage of TSSs within consensus cluster between two samples, which indicates the degree of
#' physical separation of TSSs used in these samples within given consensus cluster. In
#' addition to shifting score, a statistical significance (P-value and FDR) of differential
#' TSS usage is calculated for each consensus cluster using Kolmogorov-Smirnov test.
#'
#' @param object A \code{\link{CAGEr}} object.
#'
#' @param groupX,groupY Character vector of the one or more CAGE dataset labels in the first
#' (\code{groupX}) and in the second group (\code{groupY}). Shifting score for each consensus
#' cluster will be calculated by comparing CAGE signal in the samples from \code{groupX} against
#' the signal in the samples from \code{groupY}. If there is more than one CAGE dataset in the
#' group, the datasets within that group will be merged together before comparison with the
#' other group. See Details.
#'
#' @param testKS Logical, should Kolomogorov-Smirnov test for statistical significance of
#' differential TSS usage be performed, and P-values and FDR returned. See Details.
#'
#' @param useTpmKS Logical, should normalized (tpm) values (\code{TRUE}) or raw tag counts
#' (\code{FALSE}) be used to derive sample sizes for Kolomogorov-Smirnov test. Used only when
#' \code{testKS = TRUE}, otherwise ignored. See Details.
#'
#' @param useMulticore Logical, should multicore be used. \code{useMulticore = TRUE} is
#' supported only on Unix-like platforms.
#'
#' @param nrCores Number of cores to use when \code{useMulticore = TRUE}. Default value
#' \code{NULL} uses all detected cores.
#'
#' @details TSSs within one consensus cluster (promoter) can be used differently in different
#' samples (cell types, tissues, developmental stages), with respect to their position and
#' frequency of usage detected by CAGE. This function calculates shifting scores of all
#' consensus clusters between two specified (groups of) CAGE samples to detect promoters that
#' are used differently in these two samples. Shifting score is a measure of differential
#' TSS usage defined as:
#'
#' \code{score = max(F1 - F2) / max(F1)}
#'
#' where F1 is a cumulative sum of CAGE signal along consensus cluster in the group of samples
#' with lower total signal in that consensus cluster, and F2 in the opposite group. Since
#' cumulative sum can be calculated in both forward (5' -> 3') and reverse (3' -> 5')
#' direction, shifting score is calculated for both cases and the bigger value is selected as
#' final shifting score. Value of the shifting score is in the range \code{[-Inf, 1]}, where
#' value of \code{1} means complete physical separation of TSSs used in the two samples for
#' given consensus cluster. In general, any non-negative value of the shifting score can be
#' interpreted as the proportion of transcription initiation in the sample with lower expression
#' that is happening "outside" (either upstream or downstream) of the region used for
#' transcription initiation in the other sample. Negative values indicate no physical
#' separation, \emph{i.e.} the region used for transcription initiation in the sample with
#' lower expression is completely contained within the region used for transcription
#' initiation in the other sample.
#'
#' In addition to shifting score which indicates only physical separation (upstream or
#' downstream shift of TSSs), a more general assessment of differential TSS usage can be
#' obtained by performing a two-sample Kolmogorov-Smirnov test on cumulative sums of CAGE
#' signal along the consensus cluster. In that case, cumulative sums in both samples are
#' scaled to range `[0,1]` and are considered to be empirical cumulative distribution functions
#' (ECDF) reflecting sampling of TSS positions during transcription initiation.
#' Kolmogorov-Smirnov test is performed to assess whether the two underlying probability
#' distributions differ. To obtain P-value (\emph{i.e.} the level at which the
#' null-hypothesis can be rejected), sample sizes that generated the ECDFs are required, in
#' addition to actual K-S statistics calculated from ECDFs. These are derived either from
#' raw tag counts, \emph{i.e.} exact number of times each TSS in the cluster was sampled
#' during sequencing (when \code{useTpmKS = FALSE}), or from normalized tpm values (when
#' \code{useTpmKS = TRUE}). P-values obtained from K-S tests are further adjusted for
#' multiple testing using Benjamini & Hochberg (BH) method and for each P-value a
#' corresponding false-discovery rate (FDR) is also reported.
#'
#' Since calculation of shifting scores and Kolmogorov-Smirnov test require cumulative sums
#' along consensus clusters, they have to be calculated beforehand by calling
#' \code{\link{cumulativeCTSSdistribution}} function.
#'
#' The slots \code{shiftingGroupX}, \code{shiftingGroupY} and
#' \code{consensusClustersShiftingScores} of the provided \code{\link{CAGEset}} object will
#' be occupied by the information on the groups of CAGE datasets that have been compared and
#' shifting scores of all consensus clusters. Consensus clusters (promoters) with shifting
#' score and/or FDR above specified threshold can be extracted by calling
#' \code{\link{getShiftingPromoters}} function.
#'
#' @author Vanja Haberle
#'
#' @seealso \code{\link{cumulativeCTSSdistribution}}
#' @family CAGEr promoter shift functions
#'
#' @examples
#' scoreShift( exampleCAGEset
#' , groupX = c("sample1", "sample2")
#' , groupY = "sample3"
#' , testKS = TRUE, useTpmKS = FALSE)
#' head(getShiftingPromoters(exampleCAGEset))
#'
#' @importFrom stats p.adjust
#' @importFrom utils head
#' @export
setGeneric( "scoreShift"
, function( object, groupX, groupY
, testKS = TRUE, useTpmKS = TRUE, useMulticore = F, nrCores = NULL)
standardGeneric("scoreShift"))
#' @rdname scoreShift
setMethod( "scoreShift", "CAGEset"
, function (object, groupX, groupY, testKS, useTpmKS, useMulticore, nrCores) {
objName <- deparse(substitute(object))
sample.labels <- sampleLabels(object)
if(!(all(groupX %in% sample.labels) & all(groupY %in% sample.labels))){
stop("'groupX' and 'groupY' must be vectors of sample labels! Check 'sampleLabels(", objName, ")' for labels of your CAGE datasets!")
}
message("\nCalculating shifting score...")
a <- object@CTSScumulativesConsensusClusters
# Problem: originally, CTSScumulativesConsensusClusters were in 0-based
# coordinates, meaning that the first Rle value was always 0.
# Then I (Charles) rewrote many functions for the introduction of the
# CAGEexp class and switched to 1-based coordinates (a bit inadvertently
# at the beginning). Most CAGEr functions now expect 1-based coordinates,
# but it is not yet the case of scoreShift(), and I am running out of time.
# Hence, the ugly hack that follows.
# Are we 0-based ?
firstIsZero <- function(x) decode(head(x,1) == 0)
allFirstIsZero <- function(x) all(sapply(x, firstIsZero))
zeroBased <- function(x) all(sapply(x, allFirstIsZero))
# If not, ...
makeZeroBased <- function(l) {
lapply(l, function(s) {
lapply(s, function(x) {
append(0,x)
})
})
}
if(!zeroBased(a)) a <- makeZeroBased(a)
# -- end of ugly hack ---
a <- a[names(a) %in% c(groupX, groupY)]
b <- object@consensusClusters
cumsum.list <- bplapply(a, function(x) {
n <- names(x)
y <- subset(b, !(b$consensus.cluster %in% as.integer(names(x))))
if (nrow(y)>0) {
nulls <- lapply(as.list(c(1:nrow(y))), function(t) {
Rle(rep(0, y[t, "end"] - y[t, "start"] + 1))
})
x <- append(x, nulls)
}
names(x) <- c(n, as.character(y$consensus.cluster))
return(x)
}, BPPARAM = CAGEr_Multicore(useMulticore, nrCores))
cumsum.list.r <- bplapply( cumsum.list, .reverse.cumsum
, useMulticore = useMulticore, nrCores = nrCores
, BPPARAM = CAGEr_Multicore(useMulticore, nrCores))
cumsum.matrices.list.f <- bplapply(as.list(names(cumsum.list[[1]])), function(x) {
sapply(names(cumsum.list), function(y) {as.numeric(cumsum.list[[y]][[x]])})
}, BPPARAM = CAGEr_Multicore(useMulticore, nrCores))
cumsum.matrices.list.r <- bplapply(as.list(names(cumsum.list.r[[1]])), function(x) {
m <- matrix(ncol = length(names(cumsum.list.r)), sapply(names(cumsum.list.r), function(y) {
as.numeric(cumsum.list.r[[y]][[x]])
}))
colnames(m) <- names(cumsum.list.r)
return(m)
}, BPPARAM = CAGEr_Multicore(useMulticore, nrCores))
.poolGroups <- function(x) {
if (!is.matrix(x))
x <- t(x) # Clusters of length 1 produced vectors instead of matrices...
cbind( groupX = rowSums(x[, groupX, drop = FALSE])
, groupY = rowSums(x[, groupY, drop = FALSE]))
}
cumsum.matrices.groups.f <- bplapply( cumsum.matrices.list.f, .poolGroups
, BPPARAM = CAGEr_Multicore(useMulticore, nrCores))
cumsum.matrices.groups.r <- bplapply( cumsum.matrices.list.r, .poolGroups
, BPPARAM = CAGEr_Multicore(useMulticore, nrCores))
names(cumsum.matrices.groups.f) <- names(cumsum.list[[1]])
names(cumsum.matrices.groups.r) <- names(cumsum.list[[1]])
dominant.ctss.pos <- bplapply(as.list(names(cumsum.matrices.groups.f)), function(x) {
sapply(c("groupX", "groupY"), function(y) {.get.dominant.ctss(cumsum.matrices.groups.f[[x]][,y], isCumulative = T)})
}, BPPARAM = CAGEr_Multicore(useMulticore, nrCores))
dominant.ctss.pos <- data.frame(consensus.cluster = as.integer(names(cumsum.matrices.groups.f)), do.call(rbind, dominant.ctss.pos))
dominant.ctss.pos <- dominant.ctss.pos[order(dominant.ctss.pos$consensus.cluster),]
colnames(dominant.ctss.pos) <- c("consensus.cluster", "groupX.pos", "groupY.pos")
clusters.info <- merge(b[,c(1:5)], dominant.ctss.pos, by.x = "consensus.cluster", by.y = "consensus.cluster")
clusters.info$groupX.pos <- clusters.info$groupX.pos + clusters.info$start
clusters.info$groupY.pos <- clusters.info$groupY.pos + clusters.info$start
n <- names(cumsum.matrices.groups.f)
cumsum.matrices.groups.f <- lapply(cumsum.matrices.groups.f, function(x) {x[-1,,drop=F]})
scores.f <- .score.promoter.shifting(cumsum.matrices.groups.f, useMulticore = useMulticore, nrCores = nrCores)
scores.r <- .score.promoter.shifting(cumsum.matrices.groups.r, useMulticore = useMulticore, nrCores = nrCores)
scores <- pmax(scores.f, scores.r)
names(scores) <- n
groupX.tpm <- unlist(lapply(cumsum.matrices.groups.f, function(x) {max(x[,"groupX"])}))
groupY.tpm <- unlist(lapply(cumsum.matrices.groups.f, function(x) {max(x[,"groupY"])}))
scores.df <- data.frame(consensus.cluster = as.integer(names(scores)), shifting.score = scores, groupX.tpm = groupX.tpm, groupY.tpm = groupY.tpm)
clusters.info <- merge(clusters.info, scores.df, by.x = "consensus.cluster", by.y = "consensus.cluster")
clusters.info <- clusters.info[,c("consensus.cluster", "shifting.score", "groupX.pos", "groupY.pos", "groupX.tpm", "groupY.tpm")]
if(testKS){
message("Testing for significance using Kolmogorov-Smirnov test...\n")
if(useTpmKS){
n <- (clusters.info$groupX.tpm * clusters.info$groupY.tpm)/(clusters.info$groupX.tpm + clusters.info$groupY.tpm)
names(n) <- clusters.info$consensus.cluster
}else{
template.tagcount <- rep(0L, length(consensusClustersGR(object)))
names(template.tagcount) <- consensusClustersGR(object)$consensus.cluster
tag.count.list <- list()
for(s in c(groupX, groupY)) {
ctss <- CTSStagCountGR(object, s)
ctss <- ctss[ctss$filteredCTSSidx]
ctss.clusters <- consensusClustersGR(object, s)
tag.count <- .getTotalTagCount(ctss, ctss.clusters)
tag.count.new <- template.tagcount
tag.count.new[ctss.clusters$consensus.cluster] <- as.integer(tag.count) # for some reason at this step the vector is converted to list when run within this function (does not happen when run normally outside the function)!!!
tag.count.list[[s]] <- unlist(tag.count.new)
invisible(gc())
}
tag.count.m <- do.call(cbind, tag.count.list)
tag.count.m.new <- cbind(groupX = rowSums(tag.count.m[,groupX,drop=F]), groupY = rowSums(tag.count.m[,groupY,drop=F]))
n <- (tag.count.m.new[,"groupX"] * tag.count.m.new[,"groupY"])/(tag.count.m.new[,"groupX"] + tag.count.m.new[,"groupY"])
names(n) <- rownames(tag.count.m.new)
}
ks.stat <- unlist(bplapply(cumsum.matrices.groups.f, function(x) {ks.s <- .ksStat(x)}, BPPARAM = CAGEr_Multicore(useMulticore, nrCores)))
p.vals <- .ksPvalue(d = ks.stat, n = n[names(cumsum.matrices.groups.f)])
fdr <- p.adjust(p.vals, method = "BH")
p.vals <- data.frame(consensus.cluster = as.integer(names(cumsum.matrices.groups.f)), pvalue.KS = p.vals, fdr.KS = fdr)
clusters.info <- merge(clusters.info, p.vals, by.x = "consensus.cluster", by.y = "consensus.cluster")
}
object@consensusClustersShiftingScores <- clusters.info
object@shiftingGroupX <- groupX
object@shiftingGroupY <- groupY
assign(objName, object, envir = parent.frame())
invisible(1)
})
#' Select consensus clusters with shifting score above threshold
#'
#' Extracts consensus clusters with shifting score and/or FDR (adjusted P-value from
#' Kolmogorov-Smirnov test) above specified threshold. Returns their genomic coordinates,
#' total CAGE signal and the position of dominant TSS in the two compared groups of CAGE
#' samples, along with the value of the shifting score, P-value and FDR. Scores and
#' P-values/FDR have to be calculated beforehand by calling \code{\link{scoreShift}} function.
#'
#' @param object A \code{\link{CAGEset}} object.
#'
#' @param tpmThreshold Consensus clusters with total CAGE signal \code{>= tpmThreshold}
#' in each of the compared groups will be returned.
#'
#' @param scoreThreshold Consensus clusters with shifting score \code{>= scoreThreshold}
#' will be returned. The default value \code{-Inf} returns all consensus clusters
#' (for which score could be calculated, \emph{i.e.} the ones that have at least
#' one tag in each of the comapred samples).
#'
#' @param fdrThreshold Consensus clusters with adjusted P-value (FDR) from
#' Kolmogorov-Smirnov test \code{>= fdrThreshold} will be returned. The default
#' value \code{1} returns all consensus clusters (for which K-S test could be
#' performed, \emph{i.e.} the ones that have at least one tag in each of the
#' compared samples).
#'
#' @return Returns a \code{data.frame} of shifting promoters with genomic coordinates and
#' positions of dominant TSS and CAGE signal in the two compared (groups of) samples, along
#' with shifting score and adjusted P-value (FDR).
#'
#' @author Vanja Haberle
#'
#' @family CAGEr promoter shift functions
#'
#' @examples
#' head(getShiftingPromoters( exampleCAGEset, tpmThreshold = 100
#' , scoreThreshold = 0.4, fdrThreshold = 0.01))
#'
#' @export
setGeneric( "getShiftingPromoters"
, function( object, tpmThreshold = 0, scoreThreshold = -Inf, fdrThreshold = 1)
standardGeneric("getShiftingPromoters"))
#' @rdname getShiftingPromoters
setMethod( "getShiftingPromoters", "CAGEset"
, function (object, tpmThreshold, scoreThreshold, fdrThreshold) {
shifting.scores <- object@consensusClustersShiftingScores
clusters <- object@consensusClusters
sig.shifting <- shifting.scores[ ( shifting.scores$groupX.tpm >= tpmThreshold &
shifting.scores$groupY.tpm >= tpmThreshold ) &
shifting.scores$shifting.score >= scoreThreshold
, ]
if("fdr.KS" %in% colnames(shifting.scores)){
sig.shifting <- sig.shifting[sig.shifting$fdr.KS <= fdrThreshold,]
}
sig.shifting <- merge(clusters[,c(1:5)], sig.shifting, by.x = "consensus.cluster", by.y = "consensus.cluster", all.x = F, all.y = T)
return(sig.shifting)
})
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.