R/depeche.R
In DepecheR: Determination of essential phenotypic elements of clusters in high-dimensional entities

Documented in depeche

#' Perform optimization and penalized K-means clustering
#'
#'
#' This is the central function of the package. As input, only a dataset is
#' required. It starts by performing optimizations and then performs clustering
#' based on the values identified in the optimization step.
#' @importFrom moments kurtosis
#' @importFrom grDevices col2rgb colorRampPalette densCols dev.off palette pdf
#' png
#' @importFrom graphics axis contour hist image legend mtext par plot plot.new
#' text
#' @importFrom stats median p.adjust predict quantile rnorm sd var wilcox.test
#' runif
#' @importFrom utils write.csv tail
#' @importFrom methods is
#' @param inDataFrame A dataframe or matrix with the data that will be used to
#' create the clustering. Cytometry data should be transformed using
#' biexponential, arcsinh transformation or similar, and day-to-day
#' normalizations should to be performed for all data if not all data has been
#' acquired on the same run. Scaling, etc, is on the other hand performed
#' within the function.
#' @param samplingSubset If the dataset is made up of an unequal number of cells
#' from multiple individuals, it might be wise to pre-define a subset of the 
#' rows, which includes equal or near-equal numbers of cells from each
#' individual, to avoid a few outliers to dominate the analysis. This can be 
#' done here. Should be a vector of row numbers in the inDataFrame.
#' @param penalties This argument decides whether a single penalty will be used
#' for clustering, or if multiple penalties will be evaluated to identify the
#' optimal one. A single value, a vector of values, or possibly a list of two
#' vectors, if dual clustering is performed can be given here. The suggested
#' default values are empirically defined and might not be optimal for a
#' specific dataset, but the algorithm will warn if the most optimal values are
#' on the borders of the range. Note that when the penalty is 0, there is no
#' penalization, which means that the algorithm runs standard K-means
#' clustering.
#' @param sampleSize This controls what fraction of the dataset that will be
#' used to run the penalty optimization. 'default' results in the full file in
#' files up to 10000 events. In cases where the sampleSize argument is larger
#' than 10000, default leads to the generation of a random subset to the same
#' size also for the selectionSampleSize. A user specified number is also
#' accepted.
#' @param selectionSampleSize The size of the dataset used to find the optimal
#' solution out of the many generated by the penalty optimization at each sample
#' size. 'default' results in the full file in files up to 10000 events. In
#' cases where the sampleSize argument is larger than 10000, default leads to
#' the generation of a random subset to the same size also for the
#' selectionSampleSize. A user specified number is also accepted.
#' @param dualDepecheSetup Optionally, a dataframe with two columns: the first
#' specifying which step (1 or 2) the variable should be included in, the second
#' specifying the column name for the variable in question. It is used if a
#' two-step clustering should be performed, e.g. in the case where phenotypic
#' clustering should be performed, followed by clustering on functional
#' variables.
#' @param k Number of initial cluster centers. The higher the number, the
#' greater the precision of the clustering, but the computing time also
#' increases linearly with the number of starting points. Default is 30. If
#' penalties=0, k-means clustering with k clusters will be performed.
#' @param minARIImprovement This is the stop criterion for the penalty
#' optimization algorithm: the more iterations that are run, the smaller will
#' the improvement of the corrected Rand index be, and this sets the threshold
#' when the inner iterations stop. Defaults to 0.01.
#' @param maxIter The maximal number of iterations that are performed in the
#' penalty optimization.
#' @param log2Off If the automatic detection for high kurtosis, and followingly,
#' the log2 transformation, should be turned off.
#' @param optimARI Above this level of ARI, all solutions are considered equally
#' valid, and the median solution is selected among them.
#' @param center If centering should be performed. Alternatives are 'default',
#' 'mean', 'peak' and FALSE. 'peak' results in centering around the highest peak
#' in the data, which is useful in most cytometry situations. 'mean' results in
#' mean centering. 'default' gives different results depending on the data:
#' datasets with 100+ variables are mean centered, and otherwise, peak centering
#' is used. FALSE results in no centering, mainly for testing purposes.
#' @param nCores If multiCore is TRUE, then this sets the number of parallel
#' processes. The default is currently 87.5 percent with a cap on 10 cores, as
#' no speed increase is generally seen above 10 cores for normal computers.
#' @param createOutput For testing purposes. Defaults to TRUE. If FALSE, no
#' plots are generated.
#' @return A nested list with varying components depending on the setup above:
#' \describe{
#'    \item{clusterVector}{A vector with the same length as number of rows in
#'    the inDataFrame, where the cluster identity of each observation is
#'    noted.}
#'    \item{clusterCenters}{A matrix containing information
#'    about where the centers are in all the variables that contributed to
#'    creating the cluster with the given penalty term. An exact zero here
#'    indicates that the variable in question was sparsed out for that cluster.
#'    If a variable did not contribute to the separation of any cluster, it will
#'    not be present here.}
#'    \item{essenceElementList}{A per-cluster list of the items that were 
#'    used to separate that cluster from the rest, i.e. the items that survived
#'    the penalty.}
#'    \item{penaltyOptList}{A list of two dataframes:
#'    \describe{
#'              \item{penaltyOpt.df}{A one row dataframe with the settings for
#'              the optimal penalty.}
#'              \item{meanOptimDf}{A dataframe with the information about the
#'              results with all tested penalty values.}
#'            }
#'     }
#' } If a dual setup is used, the result will be a nested list, where the first
#' sublist with the information above of the result of the primary clustering
#' and the following list components are the result of all the secondary
#' clusterings combined.
#' @examples
#' # Load some data
#' data(testData)
#'
#' # First, just run with the standard settings
#' \dontrun{
#' testDataDepecheResult <- depeche(testData[, 2:15])
#'
#' # Look at the result
#' str(testDataDepecheResult)
#'
#' # Now, a dual depeche setup is used
#' testDataDepecheResultDual <- depeche(testData[, 2:15],
#'     dualDepecheSetup = data.frame(
#'         rep(1:2, each = 7),
#'         colnames(testData[, 2:15])
#'     ), penalties = c(64, 128), sampleSize = 500,
#'     selectionSampleSize = 500, maxIter = 20
#' )
#'
#' # Look at the result
#' str(testDataDepecheResultDual)
#' }
#'
#' @export depeche
depeche <- function(inDataFrame, samplingSubset = seq_len(nrow(inDataFrame)), 
                    dualDepecheSetup,
                    penalties = 2^seq(0, 5, by = 0.5),
                    sampleSize = "default", selectionSampleSize = "default",
                    k = 30, minARIImprovement = 0.01, optimARI = 0.95,
                    maxIter = 100, log2Off = FALSE, center = "default",
                    nCores = "default", createOutput = TRUE) {
    if (is.matrix(inDataFrame)) {
        inDataFrame <- as.data.frame.matrix(inDataFrame)
    }

    logCenterSd <- list(FALSE, FALSE, FALSE)
    # Here it is checked if the data has very
    # extreme tails, and if so, the data is
    # log2 transformed
    if (log2Off == FALSE && kurtosis(as.vector(as.matrix(inDataFrame))) >
        100) {
        kurtosisValue1 <- kurtosis(as.vector(as.matrix(inDataFrame)))
        # Here, the log transformation is
        # performed. In cases where the lowest
        # value is 0, everything is simple. In
        # other cases, a slightly more
        # complicated formula is needed
        if (min(inDataFrame) >= 0) {
            inDataFrame <- log2(inDataFrame + 1)
            logCenterSd[[1]] <- TRUE
        } else {
            # First, the data needs to be reasonably
            # log transformed to not too extreme
            # values, but still without loosing
            # resolution.
            inDataMatrixLog <- log2(apply(
                inDataFrame, 2,
                function(x) x - min(x)
            ) + 1)
            # Then, the extreme negative values will
            # be replaced by 0, as they give rise to
            # artefacts.
            inDataMatrixLog[which(is.nan(inDataMatrixLog))] <- 0
            inDataFrame <- as.data.frame(inDataMatrixLog)
            logCenterSd[[1]] <- TRUE
        }

        kurtosisValue2 <- kurtosis(as.vector(as.matrix(inDataFrame)))
        message(
            "The data was found to be heavily tailed (kurtosis ",
            kurtosisValue1, "). Therefore, it was log2-transformed, leading to 
            a new kurtosis value of ", kurtosisValue2, "."
        )
    }

    # Centering and overall scaling is
    # performed

    if (ncol(inDataFrame) < 100) {
        if (center == "mean") {
            message("Mean centering is applied although the data has less than",
                  " 100 columns")
            inDataFrameScaleList <- dScale(inDataFrame,
                scale = FALSE,
                center = "mean", returnCenter = TRUE
            )
            inDataFramePreScaled <- inDataFrameScaleList[[1]]
            logCenterSd[[2]] <- inDataFrameScaleList[[2]]
        } else if (center == "default" || center == "peak") {
            message("As the dataset has less than 100 columns, ",
                    "peak centering is applied.")
            inDataFrameScaleList <- dScale(inDataFrame,
                scale = FALSE,
                center = "peak", returnCenter = TRUE
            )
            inDataFramePreScaled <- inDataFrameScaleList[[1]]
            logCenterSd[[2]] <- inDataFrameScaleList[[2]]
        }
    } else if (center == "peak") {
        message("Peak centering is applied although the data has more than ",
                "100 columns")
        inDataFrameScaleList <- dScale(inDataFrame,
            scale = FALSE,
            center = "peak", returnCenter = TRUE
        )
        inDataFramePreScaled <- inDataFrameScaleList[[1]]
        logCenterSd[[2]] <- inDataFrameScaleList[[2]]
    } else if (center == "default" || center == "mean") {
        message("As the dataset has more than 100 columns, mean centering is ",
              "applied.")
        inDataFrameScaleList <- dScale(inDataFrame,
            scale = FALSE,
            center = "mean", returnCenter = TRUE
        )
        inDataFramePreScaled <- inDataFrameScaleList[[1]]
        logCenterSd[[2]] <- inDataFrameScaleList[[2]]
    } else if (center == FALSE) {
        message("No centering performed")
        inDataFramePreScaled <- inDataFrame
    }

    # Here, all the data is divided by the
    # standard deviation of the full dataset
    sdInDataFramePreScaled <- sd(as.matrix(inDataFramePreScaled))
    inDataFrameScaled <-
        as.data.frame(inDataFramePreScaled / sdInDataFramePreScaled)
    logCenterSd[[3]] <- sdInDataFramePreScaled

    # Here, the algorithm forks, depending on if a dual depeche setup has been
    # chosen or not
    if (missing(dualDepecheSetup)) {
        depecheResult <- depecheCoFunction(
            inDataFrameScaled, samplingSubset = samplingSubset,
            plotDir = ".", penalties = penalties,
            sampleSize = sampleSize, selectionSampleSize = selectionSampleSize,
            k = k, minARIImprovement = minARIImprovement,
            optimARI = optimARI, maxIter = maxIter,
            nCores = nCores, createOutput = createOutput,
            logCenterSd = logCenterSd
        )
        return(depecheResult)
    } else {
        inDataColumns <- as.character(dualDepecheSetup[, 2])
        inDataFrameFirst <-
            inDataFrameScaled[inDataColumns[which(dualDepecheSetup[, 1] == 1)]]
        if (is.list(penalties) == FALSE) {
            penaltyList <- list(penalties, penalties)
        }
        dirName1 <- "Level_one_depeche"
        depecheResultFirst <- depecheCoFunction(
            inDataFrameFirst,
            plotDir = dirName1, penalties = penaltyList[[1]],
            sampleSize = sampleSize, selectionSampleSize = selectionSampleSize,
            k = k, minARIImprovement = minARIImprovement, optimARI = optimARI,
            maxIter = maxIter, nCores = nCores, createOutput = createOutput,
            logCenterSd = logCenterSd
        )

        message(
            "Done with level one clustering where ",
            length(unique(depecheResultFirst$clusterVector)),
            " clusters were created. Now initiating level two."
        )

        # After this first step, clustering is
        # performed within each of the clusters
        # produced by the depecheResultFirst
        inDataFrameSecond <-
            inDataFrameScaled[inDataColumns[which(dualDepecheSetup[, 1] == 2)]]

        allClusterN <- unique(depecheResultFirst$clusterVector)
        inDataFrameSecondList <- lapply(
            seq_along(allClusterN),
            function(i) {
                return(
                    inDataFrameSecond[
                        which(depecheResultFirst$clusterVector == i), ]
                )
            }
        )

        # Now create the list of cluster numbers
        firstClusterNumberList <- lapply(seq_along(allClusterN), function(i) {
              (100 * i) + 1
          })

        # Here, a list of cluster names are
        # created, so that the results are sorted
        # in a correct manner
        directoryNames <-
            file.path(
                dirName1,
                paste0(
                    "Cluster_", seq_along(allClusterN),
                    "_level_two_depeche"
                )
            )

        # Here, the secondary clusters are
        # generated for each subframe created by
        # the primary clusters
        depecheResultSecondList <- mapply(depecheCoFunction,
            inDataFrameSecondList, firstClusterNumberList,
            directoryNames,
            MoreArgs = list(
                penalties = penaltyList[[2]],
                sampleSize = sampleSize,
                selectionSampleSize = selectionSampleSize,
                k = k, minARIImprovement = minARIImprovement,
                optimARI = optimARI, maxIter = maxIter, nCores = nCores,
                createOutput = createOutput, logCenterSd = logCenterSd
            ),
            SIMPLIFY = FALSE
        )

        complexClusterVector <- inDataFrameScaled[, 1]
        clusterCentersList <- list()
        colnamesList <- list()
        for (i in seq_along(depecheResultSecondList)) {
            # Here, all the clustering data is
            # recompiled to one long cluster vector
            complexClusterVector[which(depecheResultFirst$clusterVector ==
                i)] <- depecheResultSecondList[[i]][[1]]
            # And the cluster centers are also compiled
            clusterCentersList[[i]] <- depecheResultSecondList[[i]][[2]]
            # And a list of all unique colnames is created
            colnamesList[[i]] <- colnames(clusterCentersList[[i]])
        }
        uniqueColnamesVector <- sort(unique(unlist(colnamesList)))

        # Now, if a variable is missing in a
        # certain cluster center matrix, it is
        # added with zeros. Also the variables
        # are sorted.
        for (i in seq_along(clusterCentersList)) {
            if (length(clusterCentersList[[i]]) != length(
                uniqueColnamesVector
            )) {
                missingColnames <-
                    uniqueColnamesVector[!uniqueColnamesVector %in%
                        colnames(clusterCentersList[[i]])]
                zeroDataFrame <-
                    as.data.frame(matrix(0,
                        nrow = nrow(clusterCentersList[[i]]),
                        ncol = length(missingColnames)
                    ))
                colnames(zeroDataFrame) <- missingColnames
                clusterCentersList[[i]] <- cbind(
                    clusterCentersList[[i]],
                    zeroDataFrame
                )
            }

            clusterCentersList[[i]] <-
                clusterCentersList[[i]][, order(
                    colnames(clusterCentersList[[i]])
                )]
        }

        secondLevelClusterCenters <- do.call("rbind", clusterCentersList)

        # And after all these centers have been
        # compiled, the fist set of clusster
        # centers are also included
        firstClusterCenters <- depecheResultFirst$clusterCenters
        firstOnSecondClusterCentersList <- list()
        clusterClusters <-
            substr(as.character(row.names(secondLevelClusterCenters)), 1, 1)
        for (i in seq_along(clusterClusters)) {
            firstOnSecondClusterCentersList[[i]] <-
                firstClusterCenters[which(row.names(firstClusterCenters) ==
                    clusterClusters[i]), ]
        }

        firstOnSecondClusterCenters <- do.call(
            "rbind",
            firstOnSecondClusterCentersList
        )
        colnames(firstOnSecondClusterCenters) <- colnames(firstClusterCenters)
        row.names(firstOnSecondClusterCenters) <-
            row.names(secondLevelClusterCenters)

        # And finally, these new columns are
        # added to the complexClusterCenters
        complexClusterCenters <- data.frame(
            firstOnSecondClusterCenters,
            secondLevelClusterCenters
        )

        # And now, all the penalty optimization
        # and possible sample size optimizations
        # are saved

        penaltyOptListList <- do.call(
            "list",
            lapply(depecheResultSecondList, "[[", 3)
        )

        depecheResult <- list(
            levelOneCLusterResult = depecheResultFirst,
            levelTwoClusterVector = complexClusterVector,
            levelTwoClusterCenters = complexClusterCenters,
            levelTwoPenaltyOptList = penaltyOptListList
        )

        if (length(sampleSize) > 1) {
            funval <- depecheResultSecondList[[1]][[4]]
            sampleSizeOptList <- do.call(
                "list",
                vapply(depecheResultSecondList, FUN.VALUE = funval, "[[", 4)
            )

            nextClustResultPosition <- length(depecheResult) + 1
            depecheResult[[nextClustResultPosition]] <-
                as.data.frame.matrix(sampleSizeOptList)
            names(depecheResult)[[length(depecheResult)]] <-
                "levelTwoSampleSizeOptList"
        }

        return(depecheResult)
    }
}
Any scripts or data that you put into this service are public.
DepecheR documentation built on Nov. 8, 2020, 5:44 p.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
DepecheR
Determination of essential phenotypic elements of clusters in high-dimensional entities

R/depeche.R
In DepecheR: Determination of essential phenotypic elements of clusters in high-dimensional entities

Defines functions depeche

Documented in depeche

Try the DepecheR package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

DepecheR Determination of essential phenotypic elements of clusters in high-dimensional entities

R/depeche.R In DepecheR: Determination of essential phenotypic elements of clusters in high-dimensional entities

Defines functions depeche

Documented in depeche

Try the DepecheR package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

DepecheR
Determination of essential phenotypic elements of clusters in high-dimensional entities

R/depeche.R
In DepecheR: Determination of essential phenotypic elements of clusters in high-dimensional entities