R/depeche.R

Defines functions depeche

Documented in depeche

#' Perform optimization and penalized K-means clustering
#'
#'
#' This is the central function of the package. As input, only a dataset is
#' required. It starts by performing optimizations and then performs clustering
#' based on the values identified in the optimization step.
#' @importFrom grDevices col2rgb colorRampPalette densCols dev.off palette pdf
#' png
#' @importFrom graphics axis contour hist image legend mtext par plot plot.new
#' text box
#' @importFrom stats median p.adjust predict quantile rnorm sd var wilcox.test
#' runif
#' @importFrom utils write.csv tail
#' @importFrom methods is
#' @importFrom ggplot2 ggplot aes geom_line ggtitle xlab ylab ylim ggsave
#' @importFrom dplyr sample_n
#' @importFrom parallel detectCores makeCluster stopCluster
#' @importFrom doSNOW registerDoSNOW
#' @importFrom foreach foreach %dopar%
#' @param inDataFrame A dataframe or matrix with the data that will be used to
#' create the clustering. Cytometry data should be transformed using
#' biexponential, arcsinh transformation or similar, and day-to-day
#' normalizations should to be performed for all data if not all data has been
#' acquired on the same run. Scaling, etc, is on the other hand performed
#' within the function.
#' @param samplingSubset If the dataset is made up of an unequal number of cells
#' from multiple individuals, it might be wise to pre-define a subset of the 
#' rows, which includes equal or near-equal numbers of cells from each
#' individual, to avoid a few outliers to dominate the analysis. This can be 
#' done here. Should be a vector of row numbers in the inDataFrame.
#' @param penalties This argument decides whether a single penalty will be used
#' for clustering, or if multiple penalties will be evaluated to identify the
#' optimal one. A single value, a vector of values, or possibly a list of two
#' vectors, if dual clustering is performed can be given here. The suggested
#' default values are empirically defined and might not be optimal for a
#' specific dataset, but the algorithm will warn if the most optimal values are
#' on the borders of the range. Note that when the penalty is 0, there is no
#' penalization, which means that the algorithm runs standard K-means
#' clustering.
#' @param sampleSize This controls what fraction of the dataset that will be
#' used to run the penalty optimization. 'default' results in the full file in
#' files up to 10000 events. In cases where the sampleSize argument is larger
#' than 10000, default leads to the generation of a random subset to the same
#' size also for the selectionSampleSize. A user specified number is also
#' accepted.
#' @param selectionSampleSize The size of the dataset used to find the optimal
#' solution out of the many generated by the penalty optimization at each sample
#' size. 'default' results in the full file in files up to 10000 events. In
#' cases where the sampleSize argument is larger than 10000, default leads to
#' the generation of a random subset to the same size also for the
#' selectionSampleSize. A user specified number is also accepted.
#' @param k Number of initial cluster centers. The higher the number, the
#' greater the precision of the clustering, but the computing time also
#' increases linearly with the number of starting points. Default is 30. If
#' penalties=0, k-means clustering with k clusters will be performed.
#' @param minARIImprovement This is the stop criterion for the penalty
#' optimization algorithm: the more iterations that are run, the smaller will
#' the improvement of the corrected Rand index be, and this sets the threshold
#' when the inner iterations stop. Defaults to 0.01.
#' @param maxIter The maximal number of iterations that are performed in the
#' penalty optimization.
#' @param log2Off If the automatic detection for high kurtosis, and followingly,
#' the log2 transformation, should be turned off.
#' @param optimARI Above this level of ARI, all solutions are considered equally
#' valid, and the median solution is selected among them.
#' @param center If centering should be performed. Alternatives are 'default',
#' 'mean', 'peak', FALSE and a vector of numbers with the same length as the 
#' number of columns in the inDataFrame. 'peak' results in centering around the 
#' highest peak in the data, which is useful in most cytometry situations. 
#' 'mean' results in  mean centering. 'default' gives different results 
#' depending on the data: datasets with 100+ variables are mean centered, 
#' and otherwise, peak centering is used. If a numeric vector is provided, it
#' is used to center the values to the numbers. This is preferable to 
#' pre-centering the data and using the FALSE command, as it will lead to better
#' internal visualization procedures, etc. FALSE results in no centering, mainly
#' for testing purposes.
#' @param scale If scaling should be performed. If TRUE, the dataset will be 
#' divided by the combined standard deviation of the whole dataset. If a number 
#' is provided, the dataset is divided by this number. This scaling
#' procedure makes the default penalties fit most datasets with some precision.
#' @param nCores If multiCore is TRUE, then this sets the number of parallel
#' processes. The default is currently 87.5 percent with a cap on 10 cores, as
#' no speed increase is generally seen above 10 cores for normal computers.
#' @param plotDir Where should the diagnostic plots be printed?
#' @param createOutput For testing purposes. Defaults to TRUE. If FALSE, no
#' plots are generated.
#' @return A nested list:
#' \describe{
#'    \item{clusterVector}{A vector with the same length as number of rows in
#'    the inDataFrame, where the cluster identity of each observation is
#'    noted.}
#'    \item{clusterCenters}{A matrix containing information
#'    about where the centers are in all the variables that contributed to
#'    creating the cluster with the given penalty term. An exact zero here
#'    indicates that the variable in question was sparsed out for that cluster.
#'    If a variable did not contribute to the separation of any cluster, it will
#'    not be present here.}
#'    \item{essenceElementList}{A per-cluster list of the items that were 
#'    used to separate that cluster from the rest, i.e. the items that survived
#'    the penalty.}
#'    \item{penaltyOptList}{A list of two dataframes:
#'    \describe{
#'              \item{penaltyOpt.df}{A one row dataframe with the settings for
#'              the optimal penalty.}
#'              \item{meanOptimDf}{A dataframe with the information about the
#'              results with all tested penalty values.}
#'            }
#'     }
#'     \item{logCenterScale}{The values used to center and scale the 
#'     data and information on if the data was log transformed. This 
#'     information is used internally in dAllocate.}
#' }
#' 
#' @examples
#' # Load some data
#' data(testData)
#'
#' # Here a run with the standard settings
#' \dontrun{
#' testDataDepecheResult <- depeche(testData[, 2:15])
#'
#' # Look at the result
#' str(testDataDepecheResult)
#' }
#'
#' @export depeche
depeche <- function(inDataFrame, samplingSubset = seq_len(nrow(inDataFrame)), 
                    penalties = 2^seq(0, 5, by = 0.5),
                    sampleSize = "default", selectionSampleSize = "default",
                    k = 30, minARIImprovement = 0.01, optimARI = 0.95,
                    maxIter = 100, log2Off = FALSE, center = "default", 
                    scale = TRUE, nCores = "default", plotDir = ".",
                    createOutput = TRUE) {
    if (is.matrix(inDataFrame)) {
        inDataFrame <- as.data.frame.matrix(inDataFrame)
    }
    if (plotDir != ".") {
        dir.create(plotDir)
    }
    
    logCenterSdResult <- depecheLogCenterSd(inDataFrame, log2Off, 
                                               center, scale)
    inDataFrameScaled <- logCenterSdResult[[1]]
    logCenterSd <- logCenterSdResult[[2]]
    
    # Here, if the dataset is big, a subset of it is used to sample from. 
    # Here we have also introduced
    # the samplingSubset that makes it possible to balance a dataset so that each
    # individual gets an equal number of cells as input. 
    if (length(samplingSubset) > 1e+06) {
        warning("This dataset is very large, and it is predicted to run for ",
                "a considerable time, without clear benefits in terms of the ",
                "result. A preferred option is to generate the model with a ", 
                "sampling subset and then allocate the remaining data points ", 
                "using dAllocate.")
    }
    inDataFrameUsed <- inDataFrameScaled[samplingSubset,]

    # Here, the sampleSize is set in cases it
    # is 'default'.
    if (sampleSize == "default") {
        if (nrow(inDataFrameUsed) <= 10000) {
            sampleSize <- nrow(inDataFrameUsed)
        } else {
            sampleSize <- 10000
        }
    }
    
    if (nCores == "default") {
        nCores <- floor(detectCores() * 0.875)
        if (nCores > 10) {
            nCores <- 10
        }
    }
    
    penaltyOptResult <- depechePenaltyOpt(inDataFrameUsed,
                                     k = k, maxIter = maxIter,
                                     sampleSize = sampleSize,
                                     penalties = penalties,
                                     createOutput = createOutput,
                                     minARIImprovement = minARIImprovement,
                                     optimARI = optimARI, nCores = nCores,
                                     plotDir = plotDir
    )

    # Now over to creating the final solution
    # Here, the selectionDataSet is created
    if (selectionSampleSize == "default") {
        selectionSampleSize <- sampleSize
    }
    if (nrow(inDataFrameUsed) <= selectionSampleSize) {
        selectionDataSet <- inDataFrameUsed
    } else {
        selectionDataSet <-
            inDataFrameUsed[sample(seq_len(nrow(inDataFrameUsed)),
                                   selectionSampleSize,
                                   replace = TRUE
            ), ]
    }
    
    # If the dataset is small, a new set of
    # seven clusterings are performed (on all
    # the data or on a subsample, depending
    # on the sample size), and the maximum
    # likelihood solution is returned as the
    # result
    if (nrow(inDataFrameUsed) < 10000) {
        penalty <- penaltyOptResult$bestPenalty
        depecheAllDataResult <-
            depecheAllData(inDataFrameUsed,
                           penalty = penalty, k = k,
                           nCores = nCores
            )
        clusterVector <- depecheAllDataResult[[1]]
        reducedClusterCenters <- depecheAllDataResult[[2]]
    } else {
        # Now, the best run amongst all the runs
        # with the largest sample size is
        # defined, by identifying the solution
        # that gives the highest mean f-measure
        # for all the others. 
        reducedClusterCenters <- 
            depecheClusterCenterSelection(allSolutions = penaltyOptResult[[3]],
                                          selectionDataSet = selectionDataSet, 
                                          k = k, nCores = nCores)
        
        # And here, the optimal solution is created with the full dataset
        clusterVector <- 
            dAllocate(inDataFrameScaled, depModel = 
                                   list("clusterCenters" = 
                                            reducedClusterCenters, 
                                        "logCenterSd" = FALSE))

    }
    
    # Now, the clusters are ordered according to their size, and re-numbered
    clusterSizeNums <- as.numeric(names(sort(table(clusterVector),
                                             decreasing = TRUE)))
    nClust <- length(clusterSizeNums)
    newNums <- seq(1, nClust)
    clusterVectorNewNums <- clusterVector
    sortedClustSizeNums <- sort(clusterSizeNums)
    row.names(reducedClusterCenters) <- sortedClustSizeNums
    for (i in seq_len(nClust)) {
        localNum <- clusterSizeNums[i]
        clusterVectorNewNums[clusterVector == localNum] <- newNums[i]
        row.names(reducedClusterCenters)[sortedClustSizeNums == localNum] <-
            newNums[i]
    }

    # And here, the rows of the reducedClusterCenters are reordered
    reducedClusterCenters <- reducedClusterCenters[
        order(as.numeric(row.names(reducedClusterCenters))),]
    
    # Here, the optPenalty information is
    # retrieved from the optimal sample size
    # run.
    optPenalty <- list("bestPenalty" = penaltyOptResult[[1]], 
                       "allPenalties" = penaltyOptResult[[2]])
    
    #Here, we generate a matrix shoing the ground truth of where the cluster
    #centres lie in the data, before all the normalisation procedures.
    groundClusterCenters <- 
        depecheGroundCenters(reducedClusterCenters, logCenterSd)
    
    # Here, a list of the essential elements for each cluster is generated.
    essenceElementList <- 
        lapply(seq_len(nrow(reducedClusterCenters)), 
               function(x) {
                   return(colnames(reducedClusterCenters)
                          [which(reducedClusterCenters[x, ] != 0)])
                   })
    
    #Now, in applicable cases, we produce a heatmap of the cluster centres
    depecheHeatMap(inDataFrameUsed, reducedClusterCenters, 
                   plotDir, logCenterSd, createOutput)
    
    
    # And now, the result that should be
    # returned is compiled
    depecheResult <- list(
        "clusterVector" = clusterVectorNewNums, 
        "clusterCenters" = groundClusterCenters,
        "essenceElementList" = essenceElementList, 
        "penaltyOptList" = optPenalty, 
        "logCenterSd" = logCenterSd
    )

    if (logCenterSd[[1]]) {
        names(depecheResult)[2] <- "log2ClusterCenters"
    }
    
    depecheResult
}
Theorell/DepecheR documentation built on July 27, 2023, 8:13 p.m.