R/compDis.R

Defines functions compDis

Documented in compDis

#' Calculate compound dissimilarities
#'
#' Function to quantify dissimilarities between phytochemical compounds.
#'
#' This function calculates matrices with pairwise dissimilarities between
#' the chemical compounds in \code{compoundData}, to quantify how
#' different the molecules are to each other. It does so in three
#' different ways, based on the biosynthetic classification or
#' molecular structure of the molecules:
#' 1. Using the classification from the *NPClassifier* tool,
#' \code{type = "NPClassifier"}. *NPClassifier* (Kim et al. 2021) is a
#' deep-learning tool that automatically classifies natural products
#' (i.e. phytochemical compounds) into a hierarchical classification of
#' three levels: pathway, superclass and class. This classification largely
#' corresponds to the biosynthetic groups/pathways the compounds
#' are produced in. Classifications are downloaded from
#' \url{https://npclassifier.ucsd.edu/}. *NPClassifier* does not always
#' manage to classify every compound into all three hierarchical levels. In
#' such cases, it might be beneficial to first run \code{\link{NPCTable}},
#' manually edit the resulting data frame with probable classifications if
#' possible (with help from the Supporting Information in Kim et al. 2021),
#' and then supply this classification to the \code{compDis} function
#' with the \code{npcTable} argument. This will ensure that compound
#' dissimilarities are computed optimally.
#' 2. Using PubChem Fingerprints, \code{type = "PubChemFingerprint"}.
#' This is a binary substructure fingerprint with 881 binary
#' variables describing the chemical structure of a compound.
#' With this method, compounds are therefore compared
#' based on how structurally dissimilar the molecules are.
#' See \url{https://pubchem.ncbi.nlm.nih.gov/docs/data-specification}
#' for more information. (There are many other types of fingerprints,
#' and ways of calculating compound dissimilarities based on them, see
#' e.g. packages \code{fingerprint} and \code{rcdk}). Fingerprint data for
#' molecules is downloaded from PubChem. In association with this,
#' there might be a Warning message about closing unused connections,
#' which is not important.
#' 3. fMCS, flexible Maximum Common Substructure,
#' \code{type = "fMCS"}. This is a pairwise graph matching concept.
#' The fMCS of two compounds is the largest substructure that occurs in both
#' compounds allowing for atom and/or bond mismatches (Wang et al 2013).
#' As with the fingerprints, compounds are compared based on how
#' structurally dissimilar the molecules are. While potentially a very
#' accurate similarity measure, fMCS is much more computationally demanding
#' than the other methods, and will take a significant amount of time for
#' larger data sets. Data on molecules is downloaded from PubChem.
#' In association with this, there might be a Warning message about closing
#' unused connections, which is not important.
#'
#' Dissimilarities using NPClassifier and PubChem Fingerprints
#' are generated by calculating Jaccard (Tanimoto) dissimilarities from a
#' 0/1 table with compounds as rows and group (NPClassifier) or binary
#' fingerprint variable (PubChem Fingerprints) as columns. fMCS generates
#' dissimilarity values by calculating Jaccard dissimilarities based on the
#' number of atoms in the maximum common substructure, allowing for one
#' atom and one bond mismatch. Dissimilarities are outputted as
#' dissimilarity matrices.
#'
#' If dissimilarities are calculated with more than one method,
#' the function will output additional dissimilarity matrices.
#' This always includes a matrix with the mean dissimilarity values of the
#' selected methods. If \code{"NPClassifier"} is included in \code{type},
#' a matrix of "mix" values is also calculated. The values in this matrix
#' are the dissimilarities from NPClassifier when these are > 0.
#' For pairs of compounds where dissimilarities from NPClassifier
#' equals 0 (i.e. when the compounds belong to the same pathway, superclass
#' and class), values are equal to half of the (mean) value(s) of the
#' structural dissimilarity/-ies from PubChem Fingerprints and/or fMCS.
#' With this method, compound dissimilarities are primarily based on
#' NPClassifier, but instead of compounds with identical classification having
#' 0 dissimilarity, these have a dissimilarity based on PubChem Fingerprints
#' and/or fMCS, scaled to always be less (< 0.5) than compounds being in the
#' same pathway and superclass, but different class.
#'
#' If there are unknown compounds, which do not have a
#' corresponding SMILES or InChIKey, this can be handled in three
#' different ways. First, these can be completely removed from the list
#' of compounds and the sample data set, and hence excluded from all analyses.
#' Second, if \code{unknownCompoundsMean = FALSE}, unknown compounds will
#' be given a dissimilarity value of 1 to all other compounds. Third, if
#' \code{unknownCompoundsMean = TRUE}, unknown compounds will be given
#' a dissimilarity value to all other compounds which equals the mean
#' dissimilarity value between all known compounds. See \code{\link{chemodiv}}
#' for alternative methods that can be used when most or all compounds
#' are unknown.
#'
#' @param compoundData Data frame with the chemical compounds of interest,
#' usually the compounds found in the sample dataset.
#' Should have a column named "compound" with common names of
#' the compounds, a column named "smiles" with SMILES IDs of the compounds,
#' and a column named "inchikey" with the InChIKey IDs for the compounds.
#' @param type Type of data compound dissimilarity calculations will be
#' based on: \code{NPClassifier}, \code{PubChemFingerprint} or \code{fMCS}.
#' If more than one is chosen, a matrix with mean values of the other
#' matrices will also calculated.
#' @param npcTable A data frame already generated by \code{\link{NPCTable}}
#' can optionally be supplied, if compound dissimilarities are to be
#' calculated using \code{type = "NPClassifier"}.
#' @param unknownCompoundsMean If unknown compounds, i.e. ones without SMILES
#' or InChIKey, should be given mean dissimilarity values. If not, these
#' will have dissimilarity 1 to all other compounds.
#'
#' @return List with compound dissimilarity matrices. A list is always
#' outputted, even if only one matrix is calculated. Downstream functions,
#' including \code{\link{calcDiv}}, \code{\link{calcBetaDiv}},
#' \code{\link{calcDivProf}}, \code{\link{sampDis}}, \code{\link{molNet}}
#' and \code{\link{chemoDivPlot}} require only the matrix as
#' input (e.g. as \code{fullList$specificMatrix}) rather than the whole list.
#'
#' @export
#'
#' @references
#' Kim HW, Wang M, Leber CA, Nothias L-F, Reher R, Kang KB,
#' van der Hooft JJJ, Dorrestein PC, Gerwick WH, Cottrell GW. 2021.
#' NPClassifier: A Deep Neural Network-Based Structural Classification
#' Tool for Natural Products. Journal of Natural Products 84: 2795-2807.
#'
#' Wang Y, Backman TWH, Horan K, Girke T. 2013.
#' fmcsR: mismatch tolerant maximum common substructure searching in R.
#' Bioinformatics 29: 2792-2794.
#'
#' @examples
#' data(minimalCompData)
#' data(minimalNPCTable)
#' compDis(minimalCompData, type = "NPClassifier",
#' npcTable = minimalNPCTable) # Dissimilarity based on NPClassifier
#'
#' \dontrun{compDis(minimalCompData)} # Dissimilarity based on Fingerprints
#'
#' data(alpinaCompData)
#' data(alpinaNPCTable)
#' compDis(compoundData = alpinaCompData, type = "NPClassifier",
#' npcTable = alpinaNPCTable) # Dissimilarity based on NPClassifier
compDis <- function(compoundData,
                    type = "PubChemFingerprint",
                    npcTable = NULL,
                    unknownCompoundsMean = FALSE) {

  if(!(any(c("NPClassifier", "PubChemFingerprint", "fMCS") %in% type))) {
    stop("Provide at least one type of compound dissimilarity to calculate:
         NPClassifier, PubChemFingerprint or fMCS.")
  }

  colnames(compoundData) <- tolower(colnames(compoundData))

  compoundDisMatList <- list()

  if ("NPClassifier" %in% type) { # Dissimilarities from NPClassifier

    message("Calculating compound dissimilarity matrix using NPClassifier...")

    if (is.null(npcTable)) { # If no table in input

      httr::set_config(httr::config(http_version = 0))
      if(!curl::has_internet()) {
        message("No internet connection available to download compound data.")
        return(invisible(NULL))
      } else if (httr::GET("https://npclassifier.ucsd.edu/")$status_code != 200) {
        message("NPClassifier, https://npclassifier.ucsd.edu/, is unavailable.")
        return(invisible(NULL))
      }

      npcTable <- compoundData
      npcTable[npcTable == ""] <- NA # Replace any empty cells with NA

      npcTable[c("pathway", "pathway2", "pathway3",
               "superclass", "superclass2", "superclass3",
               "class", "class2", "class3")] <- NA

      # Get NPC for each SMILE
      for (i in 1:nrow(npcTable)) {

        # No NA
        if (!is.na(npcTable$smiles[i])) {

          # (May produce strange error if run separately)
          npcClass1 <- httr::GET("https://npclassifier.ucsd.edu/classify",
                                 query = list(smiles = npcTable$smiles[i]))

          npcClass2 <- httr::content(npcClass1, as = "text")

          # If request fails, npcClass2 does not begin with "{", check for that
          if (substring(npcClass2, 1, 1) == "{") {

            npcClass3 <- jsonlite::fromJSON(npcClass2)

            # Check if there are classifications, then put data in data frame
            if (is.character(npcClass3$pathway_results)) {

              npcTable$pathway[i] <- npcClass3$pathway_results[1]
              npcTable$pathway2[i] <- npcClass3$pathway_results[2]
              npcTable$pathway3[i] <- npcClass3$pathway_results[3]

            } else {
              message(paste("NPClassifier has no classification for compound", i))
            }

            if (is.character(npcClass3$superclass_results)) {
              npcTable$superclass[i] <- npcClass3$superclass_results[1]
              npcTable$superclass2[i] <- npcClass3$superclass_results[2]
              npcTable$superclass3[i] <- npcClass3$superclass_results[3]
            }

            if (is.character(npcClass3$class_results)) {
              npcTable$class[i] <- npcClass3$class_results[1]
              npcTable$class2[i] <- npcClass3$class_results[2]
              npcTable$class3[i] <- npcClass3$class_results[3]
            }
            # If the output from NPClassifier API is not as expected
          } else {
            message(paste0("NPClassifier gave error output for compound ", i,
                           ". ", "Is the SMILES correct?"))
          }
        }
      }
    }

    # Message in case of NAs
    if (any(is.na(npcTable$pathway)) ||
        any(is.na(npcTable$superclass)) ||
        any(is.na(npcTable$class))) {
      message("NPClassifier calculations:
              There are compounds completely or partially not classified by
              NPClassifier due to missing SMILES and/or NPClassifier not
              producing full classifications for all compounds.
              If possible, consider adding classifications manually for these
              before running compDis() with type =  NPClassifier for optimal
              dissimilarity calculations. See ?compDis for details.")
    }

    # For dissimilarity calculations, add pseudo-variables to replace NA
    # for compounds where superclass and/or class but not pathway is NA
    for (i in 1:nrow(npcTable)) {
      if (is.na(npcTable$superclass[i]) && !is.na(npcTable$pathway[i])) {
        npcTable$superclass[i] <- paste("superclass",
                                        npcTable$compound[i],
                                        sep = "_")
      }
      if (is.na(npcTable$class[i]) && !is.na(npcTable$pathway[i])) {
        npcTable$class[i] <- paste("class", npcTable$compound[i], sep = "_")
      }
    }

    # Long to wide
    npcTableW <- data.frame(compound = npcTable$compound)

    for (i in 1:nrow(npcTable)) { # For each compound
      for (j in which(colnames(npcTable) == "pathway"):ncol(npcTable)) {

        if (npcTable[i, j] %in% colnames(npcTableW)) {
          # If the name already exist as column in npcTableW, add 1

          colNumber <- match(npcTable[i, j], colnames(npcTableW))
          npcTableW[i, colNumber] <- 1

        } else if (!is.na(npcTable[i,j])) {
          # If name does not exist as column in npcTableW, add column and 1

          npcTableW[,npcTable[i,j]] <- NA
          npcTableW[i,npcTable[i,j]] <- 1
        }
      }
    }

    # Replace Greek letters in names
    correctColNames <- gsub("\u03b1", "alpha", colnames(npcTableW))
    correctColNames <- gsub("\u03b2", "beta", correctColNames)
    correctColNames <- gsub("\u03b3", "gamma", correctColNames)
    correctColNames <- gsub("\u03b4", "delta", correctColNames)
    correctColNames <- gsub("\u03b5", "epsilon", correctColNames)
    correctColNames <- gsub("\u03b6", "zeta", correctColNames)
    correctColNames <- gsub("\u03b7", "eta", correctColNames)
    correctColNames <- gsub("\u03b8", "theta", correctColNames)
    correctColNames <- gsub("\u03b9", "iota", correctColNames)
    correctColNames <- gsub("\u03ba", "kappa", correctColNames)
    correctColNames <- gsub("\u03bb", "lambda", correctColNames)
    correctColNames <- gsub("\u03bc", "mu", correctColNames)
    correctColNames <- gsub("\u03bd", "nu", correctColNames)
    correctColNames <- gsub("\u03be", "xi", correctColNames)
    correctColNames <- gsub("\u03bf", "omicron", correctColNames)
    correctColNames <- gsub("\u03c0", "pi", correctColNames)
    correctColNames <- gsub("\u03c1", "rho", correctColNames)
    correctColNames <- gsub("\u03c3", "sigma", correctColNames)
    correctColNames <- gsub("\u03c4", "tau", correctColNames)
    correctColNames <- gsub("\u03c5", "upsilon", correctColNames)
    correctColNames <- gsub("\u03c6", "phi", correctColNames)
    correctColNames <- gsub("\u03c7", "chi", correctColNames)
    correctColNames <- gsub("\u03c8", "psi", correctColNames)
    correctColNames <- gsub("\u03c9", "omega", correctColNames)

    correctColNames <- gsub("\u0391", "Alpha", colnames(npcTableW))
    correctColNames <- gsub("\u0392", "Beta", correctColNames)
    correctColNames <- gsub("\u0393", "Gamma", correctColNames)
    correctColNames <- gsub("\u0394", "Delta", correctColNames)
    correctColNames <- gsub("\u0395", "Epsilon", correctColNames)
    correctColNames <- gsub("\u0396", "Zeta", correctColNames)
    correctColNames <- gsub("\u0397", "Eta", correctColNames)
    correctColNames <- gsub("\u0398", "Theta", correctColNames)
    correctColNames <- gsub("\u0399", "Iota", correctColNames)
    correctColNames <- gsub("\u039a", "Kappa", correctColNames)
    correctColNames <- gsub("\u039b", "Lambda", correctColNames)
    correctColNames <- gsub("\u039c", "Mu", correctColNames)
    correctColNames <- gsub("\u039d", "Nu", correctColNames)
    correctColNames <- gsub("\u039e", "Xi", correctColNames)
    correctColNames <- gsub("\u039f", "Omicron", correctColNames)
    correctColNames <- gsub("\u03a0", "Pi", correctColNames)
    correctColNames <- gsub("\u03a1", "Rho", correctColNames)
    correctColNames <- gsub("\u03a3", "Sigma", correctColNames)
    correctColNames <- gsub("\u03a4", "Tau", correctColNames)
    correctColNames <- gsub("\u03a5", "Upsilon", correctColNames)
    correctColNames <- gsub("\u03a6", "Phi", correctColNames)
    correctColNames <- gsub("\u03a7", "Chi", correctColNames)
    correctColNames <- gsub("\u03a8", "Psi", correctColNames)
    correctColNames <- gsub("\u03a9", "Omega", correctColNames)

    colnames(npcTableW) <- correctColNames

    # Replacing NA with 0
    npcTableW[is.na(npcTableW)] <- 0

    # Make dissimilarity matrix
    npcDisMat <- as.matrix(vegan::vegdist(npcTableW[, 2:ncol(npcTableW)],
                                          method = "jaccard"))

    colnames(npcDisMat) <- npcTableW$compound
    rownames(npcDisMat) <- npcTableW$compound

    # Setting dissimilarity of pairs of unknown compounds from NaN to 1
    npcDisMat[is.nan(npcDisMat)] <- 1

    # If unknown compounds should have mean dissimilarity
    if (unknownCompoundsMean && any(is.na(npcTable$pathway))) {

      # Calculate and add mean compound dissimilarity of known compounds
      npcKnown <- npcDisMat[-which(is.na(npcTable$pathway)),
                            -which(is.na(npcTable$pathway))]

      meanDis <- mean(npcKnown[lower.tri(npcKnown)], na.rm = TRUE)

      npcDisMat[which(is.na(npcTable$pathway)),] <- meanDis
      npcDisMat[,which(is.na(npcTable$pathway))] <- meanDis

      diag(npcDisMat) <- 0
    }
    compoundDisMatList[["npcDisMat"]] <- npcDisMat
  }

  if ("PubChemFingerprint" %in% type) { # Dissimilarities from Fingerprints

    httr::set_config(httr::config(http_version = 0))
    if(!curl::has_internet()) {
      message("No internet connection available to download compound data.")
      return(invisible(NULL))
    }

    message("Calculating compound dissimilarity matrix using Fingerprints...")

    if(any(is.na(compoundData$inchikey))){
      message("Fingerprint calculations: There are compounds with missing inchikey.")
    }

    # Get CID from inchikey
    compoundCID <- webchem::get_cid(query = compoundData$inchikey,
                                    from = "inchikey",
                                    match = "first")

    if (all(is.na(compoundCID$cid))) {
      stop("No data could be downloaded with the supplied InChIKeys,
           to calculate compound dissimilarities using PubChemFingerprint.")
    }

    # If there are inchikey NAs (pubchemCidToSDF does not handle this)
    if(any(is.na(compoundCID$cid))) {

      # Get fingerprints and show connections to close ChemmineR connection
      cidNoNA <- compoundCID$cid[!is.na(compoundCID$cid)]
      compoundSDF <- ChemmineR::pubchemCidToSDF(as.numeric(cidNoNA))
      suppressWarnings(showConnections())
      compoundbit <- ChemmineR::fp2bit(compoundSDF)
      compoundFinger <- as.data.frame(compoundbit@fpma)
      colnames(compoundFinger) <- c(paste0("bit", seq(1:881)))

      # Make data frame with 0 values for the compounds with NA inchikey,
      # to ensure dissimilarity calculations are done correctly below
      unknownFinger <- as.data.frame(matrix(data = 0,
                                            nrow = sum(is.na(compoundCID$cid)),
                                            ncol = ncol(compoundFinger)))
      colnames(unknownFinger) <- c(paste0("bit", seq(1:881)))


      # Get the rows for compounds with NA inchikey into correct place
      unknownRow <- which(is.na(compoundCID$cid))

      # Put each NA row into it's correct place using rbind()
      for (i in 1:length(unknownRow)) {
        if(unknownRow[i] == 1) { # If first compound is NA
          compoundFinger <- rbind(unknownFinger[i,], compoundFinger)
        } else if (unknownRow[i] > nrow(compoundFinger)) { # If last are NA
          compoundFinger <- rbind(compoundFinger, unknownFinger[i,])
        } else {
          compoundFinger <- rbind(compoundFinger[1:(unknownRow[i]-1),],
                                  unknownFinger[i,],
                                  compoundFinger[unknownRow[i]:
                                                   nrow(compoundFinger),])
        }
      }
    } else { # If no inchikey NAs
      compoundSDF <- ChemmineR::pubchemCidToSDF(as.numeric(compoundCID$cid))
      suppressWarnings(showConnections())
      compoundbit <- ChemmineR::fp2bit(compoundSDF)
      compoundFinger <- as.data.frame(compoundbit@fpma)
      colnames(compoundFinger) <- c(paste0("bit", seq(1:881)))
    }

    fingerDis <- vegan::vegdist(compoundFinger, method = "jaccard")
    fingerDisMat <- as.matrix(fingerDis)
    colnames(fingerDisMat) <- compoundData$compound
    rownames(fingerDisMat) <- compoundData$compound

    fingerDisMat[is.nan(fingerDisMat)] <- 1

    # If unknown compounds should have mean dissimilarity
    if (unknownCompoundsMean && any(is.na(compoundCID$cid))) {

      fingerKnown <- fingerDisMat[-unknownRow, -unknownRow]

      meanDis <- mean(fingerKnown[lower.tri(fingerKnown)])

      fingerDisMat[unknownRow,] <- meanDis
      fingerDisMat[,unknownRow] <- meanDis

      diag(fingerDisMat) <- 0
    }
    compoundDisMatList[["fingerDisMat"]] <- fingerDisMat
  }

  if ("fMCS" %in% type) { # Dissimilarities from fMCS

    httr::set_config(httr::config(http_version = 0))
    if(!curl::has_internet()) {
      message("No internet connection available to download compound data.")
      return(invisible(NULL))
    }

    message("Calculating compound dissimilarity matrix using fMCS...
            For larger datasets, this may take a very long time.")

    # Get CID from inchikey
    compoundCID <- webchem::get_cid(query = compoundData$inchikey,
                                    from = "inchikey",
                                    match = "first")

    if (all(is.na(compoundCID$cid))) {
      stop("No data could be downloaded with the supplied InChIKeys,
           to calculate compound dissimilarities using fMCS.")
    }

    if(any(is.na(compoundData$inchikey))){
      message("fMCS calculations: There are compounds with missing inchikey.")
    }

    if (any(is.na(compoundCID$cid))) { # If there are inchikey NAs

      compoundSDF <- ChemmineR::pubchemCidToSDF(as.numeric(compoundCID$cid[!is.na(compoundCID$cid)]))
      suppressWarnings(showConnections())

      unknownRow <- which(is.na(compoundCID$cid))

      # Doing fMCS with 1 atom and 1 bond mismatch, note that this similarity
      # (Tanimoto = Jaccard dissimilarity) and not dissimilarity, and note
      # that output is matrix
      fmcsSimMat <- sapply(ChemmineR::cid(compoundSDF),
                           function(x) fmcsR::fmcsBatch(compoundSDF[x],
                                                 compoundSDF,
                                                 au = 1,
                                                 bu = 1)[,"Tanimoto_Coefficient"])

      meanSim <- mean(fmcsSimMat[lower.tri(fmcsSimMat)])

      # Putting rows/column with unknown compounds into correct place
      # in matrix, and add mean value or 0 depending on unknownCompoundsMean
      if (unknownCompoundsMean) { # Adding mean similarity

        for (i in 1:length(unknownRow)) {

          if(unknownRow[i] == 1) { # If first compound is NA
            fmcsSimMat <- rbind(meanSim, fmcsSimMat)
            fmcsSimMat <- cbind(meanSim, fmcsSimMat)
          } else if (unknownRow[i] > nrow(fmcsSimMat)) { # If last is NA
            fmcsSimMat <- rbind(fmcsSimMat, meanSim)
            fmcsSimMat <- cbind(fmcsSimMat, meanSim)
          } else {
            fmcsSimMat <- rbind(fmcsSimMat[1:(unknownRow[i]-1),],
                                meanSim,
                                fmcsSimMat[unknownRow[i]:nrow(fmcsSimMat),])
            fmcsSimMat <- cbind(fmcsSimMat[,1:(unknownRow[i]-1)],
                                meanSim,
                                fmcsSimMat[,unknownRow[i]:ncol(fmcsSimMat)])
          }
        }
      } else { # Adding 0 similarity

        for (i in 1:length(unknownRow)) {

          if(unknownRow[i] == 1) { # If first compound is NA
            fmcsSimMat <- rbind(0, fmcsSimMat)
            fmcsSimMat <- cbind(0, fmcsSimMat)
          } else if (unknownRow[i] > nrow(fmcsSimMat)) { # If last is NA
            fmcsSimMat <- rbind(fmcsSimMat, 0)
            fmcsSimMat <- cbind(fmcsSimMat, 0)
          } else {
            fmcsSimMat <- rbind(fmcsSimMat[1:(unknownRow[i]-1),],
                                0,
                                fmcsSimMat[unknownRow[i]:nrow(fmcsSimMat),])
            fmcsSimMat <- cbind(fmcsSimMat[,1:(unknownRow[i]-1)],
                                0,
                                fmcsSimMat[,unknownRow[i]:ncol(fmcsSimMat)])
          }
        }
      }

      diag(fmcsSimMat) <- 1

    } else { # If no inchikey NAs

      compoundSDF <- ChemmineR::pubchemCidToSDF(as.numeric(compoundCID$cid))
      suppressWarnings(showConnections())

      fmcsSimMat <- sapply(ChemmineR::cid(compoundSDF),
                           function(x) fmcsR::fmcsBatch(compoundSDF[x],
                                                 compoundSDF,
                                                 au = 1,
                                                 bu = 1)[,"Tanimoto_Coefficient"])
    }

    # From similarity to dissimilarity
    fmcsDisMat <- 1 - fmcsSimMat
    colnames(fmcsDisMat) <- compoundData$compound
    rownames(fmcsDisMat) <- compoundData$compound

    compoundDisMatList[["fmcsDisMat"]] <- fmcsDisMat
  }

  # If >1 type, mean and and potentially mix are calculated
  if (length(type) > 1) {

    # Matrix with mean values
    compoundDisMatList[["meanDisMat"]] <- Reduce("+", compoundDisMatList) /
      length(compoundDisMatList)

    # If type includes NPClassifier also a matrix with mix values is calculated
    if ("NPClassifier" %in% type) {

      compoundDisMatList[["mixDisMat"]] <- compoundDisMatList[["npcDisMat"]]

      if (length(type) == 3) { # If all 3 types

        fingerfmcsmean <- (compoundDisMatList$fingerDisMat +
                             compoundDisMatList$fmcsDisMat) / 2

        compoundDisMatList$mixDisMat[compoundDisMatList$mixDisMat == 0] <-
          fingerfmcsmean[compoundDisMatList$mixDisMat == 0] / 2

      } else if ("PubChemFingerprint" %in% type) { # If NPC and Fingerprint

        compoundDisMatList$mixDisMat[compoundDisMatList$mixDisMat == 0] <-
          compoundDisMatList$fingerDisMat[compoundDisMatList$mixDisMat == 0] / 2

      } else if ("fMCS"  %in% type) { # If NPC and fMCS

        compoundDisMatList$mixDisMat[compoundDisMatList$mixDisMat == 0] <-
          compoundDisMatList$fmcsDisMat[compoundDisMatList$mixDisMat == 0] / 2

      }
    }

  }
  message("Done!")
  return(compoundDisMatList)
}

Try the chemodiv package in your browser

Any scripts or data that you put into this service are public.

chemodiv documentation built on Aug. 18, 2023, 1:08 a.m.