R/dSplsda.R
In DepecheR: Determination of essential phenotypic elements of clusters in high-dimensional entities

Documented in dSplsda

#' Sparse partial least squares discriminant analysis with paired and unpaired
#' data
#'
#'
#' This function is used to compare groups of individuals from whom comparable
#' cytometry or other complex data has been generated. It is superior to just
#' running a Wilcoxon analysis in that it does not consider each cluster
#' individually, but instead uses a sparse partial least squares discriminant
#' analysis to first identify which vector thourgh the multidimensional data
#' cloud, created by the cluster-donor matrix, that optimally separates the
#' groups, and as it is a sparse algorithm, applies a penalty to exclude the
#' clusters that are orthogonal, or almost orthogonal to the discriminant
#' vector, i.e. that do not contribute to separating the groups. This is in
#' large a wrapper for the \code{\link[mixOmics]{splsda}} function from
#' the mixOmics package.
#' @importFrom matrixStats rowMedians
#' @importFrom mixOmics splsda tune.splsda
#' @importFrom ggplot2 ggplot aes geom_density scale_fill_manual
#' scale_x_continuous theme element_blank element_rect ggsave
#' @param xYData A dataframe or matrix with two columns. Each row contains
#' information about the x and y positition in the field for that observation.
#' @param idsVector Vector with the same length as xYData containing information
#' about the id of each observation.
#' @param groupVector Vector with the same length as xYData containing
#' information about the group identity of each observation.
#' @param clusterVector Vector with the same length as xYData containing
#' information about the cluster identity of each observation.
#' @param displayVector Optionally, if the dataset is very large
#' (>100 000 observations) and hence the SNE calculation becomes impossible to
#' perform for the full dataset, this vector can be included. It should contain
#' the set of rows from the data used for statistics, that has been used to
#' generate the xYData.
#' @param testSampleRows Optionally, if a train-test setup is wanted, the rows
#' specified in this vector are used to divide the dataset into a training set,
#' used to generate the analysis, and a test set, where the outcome is predicted
#' based on the outcome of the training set. All rows that are not labeled as
#' test rows are assumed to be train rows.
#' @param paired Defaults to FALSE, i.e. no assumption of pairing is made and
#' Wilcoxon rank sum-test is performed. If true, the software will by default
#' pair the first id in the first group with the firs id in the second group
#' and so forth, so make sure the order is correct!
#' @param plotName The main name for the graph and the analysis.
#' @param densContour If density contours should be created for the plot(s) or
#' not. Defaults to TRUE. a
#' @param groupName1 The name for the first group
#' @param groupName2 The name for the second group
#' @param thresholdMisclassRate This threshold corresponds to the usefulness of
#' the model in separating the groups: a misclassification rate of the default
#' 0.05 means that 5 percent of the individuals are on the wrong side of the
#' theoretical robust middle line between the groups along the sPLS-DA axis,
#' defined as the middle point between the 3:rd quartile of the lower group and
#' the 1:st quartile of the higher group.
#' @param title If there should be a title displayed on the plotting field. As
#' the plotting field is saved as a png, this title cannot be removed as an
#' object afterwards, as it is saved as coloured pixels. To simplify usage for
#' publication, the default is FALSE, as the files are still named, eventhough
#' no title appears on the plot.
#' @param plotDir If different from the current directory. If specified and
#' non-existent, the function creates it. If "." is specified, the plots will be
#' saved at the current directory.
#' @param bandColor The color of the contour bands. Defaults to black.
#' @param dotSize Simply the size of the dots. The default makes the dots
#' smaller the more observations that are included.
#' @param createOutput For testing purposes. Defaults to TRUE. If FALSE, no
#' output is generated.
#' @seealso \code{\link[mixOmics]{splsda}}, \code{\link{dColorPlot}},
#' \code{\link{dDensityPlot}}, \code{\link{dResidualPlot}}
#' @return This function returns the full result of the sPLS-DA. It also returns
#' a SNE based plot showing which events that belong to a cluster dominated by
#' the first or the second group defined by the sparse partial least squares
#' loadings of the clusters.
#' @examples
#'
#' # Load some data
#' data(testData)
#' \dontrun{
#' # Load or create the dimensions that you want to plot the result over. 
#' # uwot::umap recommended due to speed, but tSNE or other method would
#' # work as fine. 
#' data(testDataSNE)
#'
#' # Run the clustering function. For more rapid example execution,
#' # a depeche clustering of the data is inluded
#' # testDataDepeche <- depeche(testData[,2:15])
#' data(testDataDepeche)
#'
#'
#' # Run the function. This time without pairing.
#' sPLSDAObject <- dSplsda(
#'     xYData = testDataSNE$Y, idsVector = testData$ids,
#'     groupVector = testData$label, 
#'     clusterVector = testDataDepeche$clusterVector
#' )
#'
#' # Here, pairing is used. NB!! This artificial example is only present to
#' # show how to use the function. In reality, pairing should only be used in
#' # situations where true paired data is present! The only reason this works
#' # although this is non-paired data is that the number of donors is identical.
#' # As it is, the algorithm internally converts the idsVector so that the first
#' # individual in group1 is associated with the first individual in group2.
#' # This can lead to erratic problems, so make sure that either a valid id
#' # vector, with the same id occuring two times for each individual is
#' # provided, or that the individuals occur in the exact same order in both
#' # groups.
#'
#' sPLSDAObject <- dSplsda(
#'     xYData = testDataSNE$Y, idsVector = testData$ids,
#'     groupVector = testData$label, clusterVector =
#'         testDataDepeche$clusterVector,
#'     paired = TRUE, plotName = "sPLSDAPlot_paired", 
#'     groupName1 = "Stimulation 1",
#'     groupName2 = "Stimulation 2"
#' )
#'
#' # Here is an example of how the display vector can be used.
#' subsetVector <- sample(1:nrow(testData), size = 10000)
#'
#' # Now, the SNE for this displayVector could be created
#' # testDataSubset <- testData[subsetVector, 2:15]
#' # testDataSNESubset <- Rtsne(testDataDisplay, pca=FALSE)$Y
#' # But we will just subset the testDataSNE immediately
#' testDataSNESubset <- testDataSNE$Y[subsetVector, ]
#'
#' # And now, this new SNE can be used for display, although all
#' # the data is used for the sPLS-DA calculations
#' sPLSDAObject <- dSplsda(
#'     xYData = testDataSNESubset, idsVector = testData$ids,
#'     groupVector = testData$label, clusterVector =
#'         testDataDepeche$clusterVector,
#'     displayVector = subsetVector
#' )
#'
#' # Finally, an example of a train-test set situation, where a random half the
#' # dataset is used for training and the second half is used for testing. It
#' # is naturally more biologically interesting to use two independent datasets
#' # for training and testing in the real world.
#' sPLSDAObject <- dSplsda(
#'     xYData = testDataSNE$Y, idsVector = testData$ids,
#'     groupVector = testData$label, clusterVector =
#'         testDataDepeche$clusterVector, testSampleRows = subsetVector
#' )
#' }
#' @export dSplsda
dSplsda <- function(xYData, idsVector, groupVector, clusterVector,
                    displayVector, testSampleRows, paired = FALSE,
                    densContour = TRUE, plotName = "default",
                    groupName1 = unique(groupVector)[1],
                    groupName2 = unique(groupVector)[2],
                    thresholdMisclassRate = 0.05, title = FALSE, plotDir = ".",
                    bandColor = "black", dotSize = 500 / sqrt(nrow(xYData)),
                    createOutput = TRUE) {
    if (plotDir != ".") {
        dir.create(plotDir)
    }


    if (length(unique(groupVector)) != 2) {
        stop("More or less than two groups are present. This is currently not 
             supported.")
    }

    if (length(unique(idsVector)) < 8) {
        warning("NB! The number of unique ids is smaller than 8, so statistical 
                comparison is not suitable. Use dResidualPlot instead to view 
                differences.")
    }

    if (plotName == "default") {
        plotName <- paste0(groupName1, "_vs_", groupName2)
    }

    if (paired) {
        # As the algorithm does not like repeated
        # ids, if the id vector contains the same
        # values for the first group and the
        # second, a new id vector is introduced
        # here
        if (identical(
            unique(idsVector[groupVector == unique(groupVector)[1]]),
            unique(idsVector[groupVector == unique(groupVector)[2]])
        )) {
            pairingVector <- idsVector
            idsVector <- paste0(
                idsVector,
                groupVector
            )
        } else if (length(unique(idsVector[groupVector ==
            unique(groupVector)[1]])) ==
            length(unique(idsVector[groupVector ==
                unique(groupVector)[2]]))) {
            pairingVector <- c(
                idsVector[groupVector == unique(groupVector)[1]],
                idsVector[groupVector == unique(groupVector)[1]]
            )
        } else {
            stop("Pairing cannot be performed, as the first and second" , 
                 "datasets contain different number of individual Ids")
        }
    }

    if (is.matrix(xYData)) {
        xYData <- as.data.frame(xYData)
    }

    if (missing(testSampleRows) == FALSE) {
        clusterVectorTrain <- clusterVector[-testSampleRows]
        idsVectorTrain <- idsVector[-testSampleRows]
        groupVectorTrain <- groupVector[-testSampleRows]
        clusterVectorTest <- clusterVector[testSampleRows]
        idsVectorTest <- idsVector[testSampleRows]
        groupVectorTest <- groupVector[testSampleRows]

        if (paired) {
            pairingVectorTrain <- pairingVector[-testSampleRows]
            pairingVectorTest <- pairingVector[testSampleRows]
        }
    } else {
        clusterVectorTrain <- clusterVector
        idsVectorTrain <- idsVector
        groupVectorTrain <- groupVector
        if (paired) {
            pairingVectorTrain <- pairingVector
        }
    }

    if (paired == FALSE) {
        dSplsdaInData <- dSplsdaPreCalculations(clusterVectorTrain,
            idsVectorTrain,
            groupVectorTrain,
            groupName1 = groupName1,
            groupName2 = groupName2
        )
    } else {
        dSplsdaInData <- dSplsdaPreCalculations(clusterVectorTrain,
            idsVectorTrain,
            groupVectorTrain,
            groupName1 = groupName1,
            groupName2 = groupName2,
            pairingVector
            = pairingVectorTrain
        )
    }

    # Here, the number of clusters that
    # should be kept in the sPLS-DA is chosen
    nVarSPLSDA <- tune.splsda(
        X = t(dSplsdaInData[[1]]), Y = dSplsdaInData[[2]],
        ncomp = 1, logratio = "none",
        test.keepX = dSplsdaInData[[3]],
        validation = "loo", dist = "mahalanobis.dist",
        multilevel = dSplsdaInData[[4]]
    )

    # And here the sPLS-DA is performed.
    # Scaling is performed internally in the
    # algorithm.
    sPLSDAObject <- splsda(
        X = t(dSplsdaInData[[1]]), Y = dSplsdaInData[[2]],
        ncomp = 1, keepX = nVarSPLSDA$choice.keepX,
        multilevel = dSplsdaInData[[4]]
    )

    # Retrieve the x variates for plotting
    sPLSDA_vector <- data.frame(sPLSDAObject$variates$X)
    Group <- dSplsdaInData[[2]]
    densityHist <- cbind(sPLSDA_vector, Group)

    colnames(densityHist) <- c("sPLSDA_vector", "Group")

    # Density plots with semi-transparent
    # fill
    ggplot(densityHist, aes(x = sPLSDA_vector, fill = Group)) +
        geom_density(adjust = 0.2, alpha = 0.4) +
        scale_fill_manual(values = c("red", "blue")) +
        scale_x_continuous(
            limits =
                c(min(densityHist$sPLSDA_vector) -
                    abs(max(densityHist$sPLSDA_vector) -
                        min(densityHist$sPLSDA_vector)) *
                        0.3, max(densityHist$sPLSDA_vector)
                + abs(max(densityHist$sPLSDA_vector) -
                        min(densityHist$sPLSDA_vector))
                    * 0.3)
        ) +
        theme(
            line = element_blank(),
            panel.background = element_rect(fill = "white")
        )

    if (createOutput) {
        ggsave(file.path(plotDir, paste0(plotName, "_sPLSDA_vector.pdf")),
            dpi = 300
        )
    }

    # Retrieve the sparse loadings
    sPLSDALoadings <- sPLSDAObject$loadings$X

    # Here, the maximum values for the
    # plotting are defined. If not added by
    # the user, they are related to the
    # overlap between the populations.
    # Furthermore, the data is re-scaled in
    # here, to be more visually
    # understandable.
    group1SplsDa <-
        densityHist$sPLSDA_vector[which(densityHist$Group == groupName1)]
    group2SplsDa <-
        densityHist$sPLSDA_vector[which(densityHist$Group == groupName2)]

    # Now, the border between the populations
    # is defined

    if (min(group1SplsDa) > min(group2SplsDa)) {
        lowGroup <- group2SplsDa
        highGroup <- group1SplsDa
    } else {
        lowGroup <- group1SplsDa
        highGroup <- group2SplsDa
    }
    groupBorder <- (quantile(lowGroup, probs = 0.75) +
        quantile(highGroup, probs = 0.25)) / 2
    lowErrors <- length(which(highGroup < groupBorder))
    highErrors <- length(which(lowGroup > groupBorder))
    misclassRate <- (lowErrors + highErrors) / sum(
        length(highGroup),
        length(lowGroup)
    )
    if (max(lowGroup) < min(highGroup)) {
        message("The separation of the datasets was perfect, with no overlap ", 
              "between the groups")
        lowestPlottedOverlap <- 0
        absSPLSDALoadings <- abs(sPLSDALoadings)
    } else if (misclassRate <= thresholdMisclassRate) {
        message(
            "The separation of the datasets was very clear, with the ",
            "misclassification rate being ", round(100 * misclassRate),
            " percent"
        )
        lowestPlottedOverlap <- misclassRate
        absSPLSDALoadings <- abs(sPLSDALoadings)
    } else {
        message(
            "The separation of the datasets was not clear, with the ",
            "misclassification rate being ", round(100 * misclassRate),
            " percent"
        )
        scalingValue <- thresholdMisclassRate / misclassRate
        absSPLSDALoadings <- abs(sPLSDALoadings * scalingValue)
    }


    # Now, the sign is changed of the
    # splsda-loadings, to make the function
    # generate a similar visual output to the
    # dWilcox function.  Now the median for
    # each group and cluster is calculated
    group1Data <- dSplsdaInData[[1]][, which(
        as.character(dSplsdaInData[[2]]) == groupName1)]
    group2Data <- dSplsdaInData[[1]][, which(
        as.character(dSplsdaInData[[2]]) == groupName2)]

    median1 <- rowMedians(group1Data)
    median2 <- rowMedians(group2Data)
    correctSPLSDALoadings <- absSPLSDALoadings
    for (i in seq_len(nrow(group1Data))) {
        if (median1[i] >= median2[i]) {
            correctSPLSDALoadings[i] <- absSPLSDALoadings[i]
        } else {
            correctSPLSDALoadings[i] <- -absSPLSDALoadings[i]
        }
    }

    # Here, a vector with the same length as
    # the cluster vector is generated, but
    # where the cluster info has been
    # substituted with the statistic.  If a
    # displayVector has been included, it is
    # used here, to subset the clusterVector
    if (missing(displayVector) == FALSE) {
        statisticVector <- clusterVector[displayVector]
        clusterVectorUsed <- clusterVector[displayVector]
    } else {
        statisticVector <- clusterVector
        clusterVectorUsed <- clusterVector
    }

    for (i in seq_len(nrow(correctSPLSDALoadings))) {
        statisticVector[clusterVectorUsed ==
            rownames(correctSPLSDALoadings)[i]] <-
            correctSPLSDALoadings[i]
    }

    # Here the data that will be used for
    # plotting is scaled.
    colnames(xYData) <- c("V1", "V2")

    brks <- seq(-1, 1, length.out = 10)

    # assign each value to a bin
    grps <- cut(statisticVector, breaks = brks, include.lowest = TRUE)
    colors <- colorRampPalette(c("#FF0000", "white", "#0000FF"))(9)
    xYData$col <- colors[grps]

    dPlotCoFunction(
        colorVariable = xYData$col, plotName =
            paste0(plotName, "_sPLSDA_result"),
        xYData = xYData, title = title,
        densContour = densContour, bandColor = bandColor,
        dotSize = dotSize, plotDir = plotDir,
        createOutput = createOutput
    )

    # Create a color legend with text

    yname <- "Misclass-corrected sPLS-DA loadings"

    topText <- paste0(groupName1, " is more abundant")
    bottomText <- paste0(groupName2, " is more abundant")

    if (createOutput) {
        pdf(file.path(plotDir, paste0(plotName, "_sPLS-DA_scale.pdf")))
        par(fig = c(0.35, 0.65, 0, 1), xpd = NA)
        z <- matrix(seq_len(9), nrow = 1)
        x <- 1
        y <- seq(-1, 1, len = 9)
        image(x, y, z, col = colors, axes = FALSE, xlab = "", ylab = yname)
        axis(2)
        text(1, 1 * 1.2, labels = topText, cex = 1.1)
        text(1, -1 * 1.2, labels = bottomText, cex = 1.1)
        box()
        dev.off()
    }

    # Return data from the sPLS-DA that was
    # needed for the generation of the graphs

    if (createOutput) {
        write.csv(
            sPLSDALoadings,
            file.path(plotDir, paste0(plotName, "_sPLSDA_loadings.csv"))
        )
        write.csv(
            sPLSDA_vector,
            file.path(plotDir, paste0(plotName, "_sPLSDA_vector.csv"))
        )
    }

    # Now, prediction is performed, if the
    # setup is train-test.
    if (missing(testSampleRows) == FALSE) {
        if (paired == FALSE) {
            dSplsdaInDataTest <-
                dSplsdaPreCalculations(clusterVectorTest, idsVectorTest,
                    groupVectorTest,
                    groupName1 = groupName1,
                    groupName2 = groupName2
                )
        } else {
            dSplsdaInDataTest <- dSplsdaPreCalculations(clusterVectorTest,
                idsVectorTest,
                groupVectorTest,
                groupName1 = groupName1,
                groupName2 = groupName2,
                pairingVector =
                    pairingVectorTest
            )
        }
        # And here the preciction is performed.
        sPLSDAPredictObject <- predict(
            object = sPLSDAObject,
            newdata = t(dSplsdaInDataTest[[1]]),
            multilevel = dSplsdaInDataTest[[4]]
        )

        # Retrieve the x variates for plotting
        sPLSDAX <- data.frame(sPLSDAPredictObject$variates)
        densityHist <- cbind(sPLSDAX, dSplsdaInDataTest[[2]])

        colnames(densityHist) <- c("sPLSDA_vector", "Group")

        # Density plots with semi-transparent
        # fill
        ggplot(densityHist, aes(x = sPLSDA_vector, fill = Group)) +
            geom_density(adjust = 0.2, alpha = 0.4) +
            scale_fill_manual(values = c("red", "blue")) +
            scale_x_continuous(
                limits
                = c(min(densityHist$sPLSDA_vector) -
                    abs(max(densityHist$sPLSDA_vector) -
                        min(densityHist$sPLSDA_vector))
                    * 0.3, max(densityHist$sPLSDA_vector)
                + abs(max(densityHist$sPLSDA_vector) -
                        min(densityHist$sPLSDA_vector)) *
                        0.3)
            ) +
            theme(
                line = element_blank(),
                panel.background
                = element_rect(
                    fill
                    = "white"
                )
            )
        if (createOutput) {
            ggsave(file.path(
                plotDir,
                paste0(plotName, "_sPLSDA_vector_predicted.pdf")
            ),
            dpi = 300
            )
        }
    }

    if (missing(testSampleRows)) {
        return(sPLSDAObject)
    } else {
        return(list(sPLSDAObject, sPLSDAPredictObject))
    }
}
Any scripts or data that you put into this service are public.
DepecheR documentation built on Nov. 8, 2020, 5:44 p.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
DepecheR
Determination of essential phenotypic elements of clusters in high-dimensional entities

R/dSplsda.R
In DepecheR: Determination of essential phenotypic elements of clusters in high-dimensional entities

Defines functions dSplsda

Documented in dSplsda

Try the DepecheR package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

DepecheR Determination of essential phenotypic elements of clusters in high-dimensional entities

R/dSplsda.R In DepecheR: Determination of essential phenotypic elements of clusters in high-dimensional entities

Defines functions dSplsda

Documented in dSplsda

Try the DepecheR package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

DepecheR
Determination of essential phenotypic elements of clusters in high-dimensional entities

R/dSplsda.R
In DepecheR: Determination of essential phenotypic elements of clusters in high-dimensional entities