R/ConservedWaters.R
In vanddraabe: Identification and Statistical Analysis of Conserved Waters Near Proteins

## ConservedWaters.R
##
## dec-28-2015 (exe) created
## may-05-2016 (exe) ConservedWaters: copy PDB files to used and rejected
##                   directories
## may-05-2016 (exe) ConservedWaters: rewrote PDB quality evaluations
## jul-30-2016 (exe) ConservedWaters: remove modeled heavy atoms
## sep-16-2016 (exe) split original file to create ConservedWaters.R
## oct-07-2016 (exe) added chains of interest check for valid chain IDs
## oct-07-2016 (exe) the following functions replace blocks of code
##                   - getRCSBdata function
##                   - DetermineChainsOfInterest function
##                   - RetainChainsOfInterest function
##                   - RemoveHydrogenAtoms function
##                   - RemoveModeledAtoms function
##                   - RetainWatersWithinX function
## oct-07-2016 (exe) creates ALL and PASSED conserved waters PyMOL script files
## jan-23-2017 (exe) update user provided parameter checks
## feb-07-2017 (exe) updated documentation
## apr-18-2017 (exe) added check for same PDBids to indicate MDS trajectory
##                   being use for structures
## apr-25-2017 (exe) added ConservedWaters.MDS()
## jul-25-2017 (exe) updated documentation
## jul-28-2017 (exe) updated NormalizedBvalue and Mobility calls;
##                   changed Bvalue parameter to Bvalues to match code
## jul-31-2017 (exe) updated the ConservedWaters(), ConservedWaters.MDS() and
##                   ConservedWaterStats() documentation
## aug-02-2017 (exe) added prefix to filenames.used variable
## aug-09-2017 (exe) added @importFrom stats ...
##
## Please direct all questions to Emilio Xavier Esposito, PhD
## exeResearch LLC, East Lansing, Michigan 48823 USA
## http://www.exeResearch.com
## emilio AT exeResearch DOT com
## emilio DOT esposito AT gmail DOT com
##
## Copyright (c) 2017, Emilio Xavier Esposito
##
## Permission is hereby granted, free of charge, to any person obtaining
## a copy of this software and associated documentation files (the
## "Software"), to deal in the Software without restriction, including
## without limitation the rights to use, copy, modify, merge, publish,
## distribute, sublicense, and/or sell copies of the Software, and to
## permit persons to whom the Software is furnished to do so, subject to
## the following conditions:
##
## The above copyright notice and this permission notice shall be
## included in all copies or substantial portions of the Software.
##
## THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
## EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
## MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
## NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
## LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
## OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
## WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
##
## Based on http://opensource.org/licenses/MIT
##




## conserved waters docs -------------------------------------------------------
#' @title Conserved Crystallographic Waters
#' @description Identifies conserved crystallographic waters from a collection
#'   of PDBs.
#' @details Only atoms within (less than or equal to) 5.10 Angstroms of the
#'   protein structures are included.
#'
#' @param prefix Directory of aligned structures; string.
#' @param cluster Oxygen atoms within 2.4 Angstroms or less of each other are
#'   considered a cluster; numeric. Default value is 2.4 Angstroms.
#' @param mobility A normalization method to identify the amount of variance an
#'   atom has within a structure; numeric. Calculated mobility values equal to
#'   or greater than the provided value will be removed from analysis. Default
#'   value is 2.0. See [Mobility()] for more information.
#' @param nBvalue The number of standard deviations from the mean for the water
#'   oxygens' B-values within the structure of interest; numeric. Calculated
#'   normalized B-values equal to or greater than the provided value will be removed
#'   from analysis. Default value is 1.0. See [NormalizedBvalue()] for
#'   more information.
#' @param chain The chain to examine. The user can define "first" and the first
#'   chain alphabetically will be selected; this is the default. Defining "all"
#'   will result in all chains being explored. Alternatively the user can define
#'   individual the chains to include in the analysis; for example, `c("A",
#'   "B", "C")`. When defining chains, the chain designation _**must
#'   be characters**_.
#' @param prot.h2o.dist.min The minimum distance (in Angstroms) between the
#'   protein and waters to be considered for the conserved water clusters. Water
#'   oxygen atoms greater than this distance are removed from the analysis.
#'   Default value is 5.10 Angstroms.
#' @param cluster.method Method of clustering the waters; default is "complete".
#'   Any other method accepted by the [stats::hclust()] or
#'   [fastcluster::hclust()] functions are appropriate. The original method used
#'   by Sanschagrin and Kuhn is the complete linkage clustering method and is
#'   the default. Other options include "ward.D" (equivilant to the only Ward
#'   option in `R` versions 3.0.3 and earlier), "ward.D2" (implements Ward's
#'   1963 criteria; see Murtagh and Legendre 2014), or "single" (related to the
#'   minimal spanning tree method and adopts a "friend of friends" clustering
#'   method). Please see [fastcluster::hclust()] for additional and complete
#'   information regarding clustering explanations.
#' @param PDBinfo The PDB information for all structures within the analysis.
#'   This information is obtained using the [getRCSBdata()] function.
#' @param filename The filename prefix for the returned results. Default is
#'   "ProteinSystem"
#'
#' @return
#'   This function returns:
#'   * **h2o.cluster.all**: Clusters constructed from all waters present in
#'     the aligned PDB structures.
#'   * **h2o.cluster.passed**: Clusters constructed from waters that passed
#'     the [Mobility()] and [NormalizedBvalue()] evaluations.
#'   * **h2o.cluster.summary**: Summary of water clusters
#'   * **Excel workbook**: containing the Cluster Statistics, Cluster Summaries
#'     for **all** and **passed** waters, Occurrence Summaries for **all** and
#'     **passed** waters, and the Initial Water Data data as individual tabs
#'   * **call**: The user provided parameters for the function
#'
#' @export
#'
#' @import bio3d
#' @importFrom stats aggregate dist
#'
#' @author Emilio Xavier Esposito \email{emilio@@exeResearch.com}
#'
#' @references
#'   Paul C Sanschagrin and Leslie A Kuhn. Cluster analysis of
#'   consensus water sites in thrombin and trypsin shows conservation between
#'   serine proteases and contributions to ligand specificity. _Protein
#'   Science_, 1998, **7** (_10_), pp 2054-2064.
#'   [DOI: 10.1002/pro.5560071002](http://doi.org/10.1002/pro.5560071002)
#'   [PMID: 9792092](http://www.ncbi.nlm.nih.gov/pubmed/9792092)
#'   [WatCH webpage](http://www.kuhnlab.bmb.msu.edu/software/watch/index.html)
#'
#'   Hitesh Patel, Bjorn A. Gruning, Stefan Gunther, and Irmgard Merfort.
#'   PyWATER: a PyMOL plug-in to find conserved water molecules in proteins by
#'   clustering. _Bioinformatics_, 2014, **30** (_20_), pp 2978-2980.
#'   [DOI: 10.1093/bioinformatics/btu424](http://doi.org/10.1093/bioinformatics/btu424)
#'   [PMID: 24990608](http://www.ncbi.nlm.nih.gov/pubmed/24990608)
#'   [PyWATER on GitHub](https://github.com/hiteshpatel379/PyWATER/blob/master/README.rst)
#'
ConservedWaters <- function(prefix="", # "thrombin_fitlsq_1.0ang/",
                            cluster=2.4, mobility=2.0, nBvalue=1.0,
                            chain="first",
                            prot.h2o.dist.min=5.10,
                            cluster.method="complete",
                            PDBinfo,
                            filename="ProteinSystem") {


  ## THE SETUP -----------------------------------------------------------------
  ## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  ##----- the provided call
  the.call <- match.call()

  ##----- create the date and time portion of the filenames
  current.time <- Sys.time()
  now.date.time <- FileTimeStamp(current.time)

  ##----- rename user defined variables
  cutoff.nBvalue <- nBvalue
  cutoff.mobility <- mobility
  cutoff.cluster <- cluster + 0.001
  cutoff.prot.h2o.dist <- prot.h2o.dist.min + 0.001
  chains.explore <- toupper(chain)


  ## CHECK THE USER PROVIDED VALUES --------------------------------------------
  ## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  ##----- nBvalue, mobility, cluster, and prot.h2o.dist.min
  if ( !is.numeric(cutoff.nBvalue) ||
       is.nan(cutoff.nBvalue) ||
       (cutoff.nBvalue <= 0.0) ) { cutoff.nBvalue <- NULL }

  if ( !is.numeric(cutoff.mobility) ||
       is.nan(cutoff.mobility) ||
       (cutoff.mobility <= 0.0) ) { cutoff.mobility <- NULL }

  if ( !is.numeric(cutoff.cluster) ||
       is.nan(cutoff.cluster) ||
       (cutoff.cluster <= 0.0) ) {
    cutoff.cluster <- 2.401
    mess <- paste("Cluster cutoff must be positive and greater than 0.0. Being",
                  "set to the default of 2.4", sep=" ")
    message(mess)
  }

  if ( !is.numeric(cutoff.prot.h2o.dist) ||
       is.nan(cutoff.prot.h2o.dist) ||
       (cutoff.prot.h2o.dist <= 0.0) ) { cutoff.prot.h2o.dist <- NA }

  ##----- check the user provided chain(s) to explore
  chains.valid <- c("FIRST", "ALL", LETTERS)
  chains.explore.vs.valid <- match(chains.explore, chains.valid)
  chains.explore.vs.valid.NA <- anyNA(chains.explore.vs.valid)
  if ( chains.explore.vs.valid.NA == TRUE ) {
    chains.invalid.idx <- match(NA, chains.explore.vs.valid)
    chains.invalid <- chains.explore[chains.invalid.idx]
    chains.stop <- paste("The provided chains of: ", toString(chains.explore),
                         " contains the invalid chain identifier: ",
                         toString(chains.invalid), ". Please only use the ",
                         "following chain identifiers: ",
                         toString(chains.valid), sep = "")
    stop(chains.stop)
  }

  ##----- check the clustering method
  cluster.method <- check.cluster.method(cluster.method)

  ## READ IN THE PDB STRUCTURES ------------------------------------------------
  ## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  ##----- get list of PDB files within prefix
  pdb.location <- ReturnPDBfullPath(prefix)

  ##----- determine PDB IDs
  pdb.ids <- ExtractPDBids(pdb.location)
  ##--- determine if the pdb.ids are all the same
  num.pdb.structures <- length(pdb.ids)
  num.pdb.ids.unique <- length(unique(pdb.ids))
  if ( num.pdb.structures != num.pdb.ids.unique ) {
    mess <- paste("There are", num.pdb.structures, "PDB structures yet there",
                  "is only", num.pdb.ids.unique, "unique PDB structure names.",
                  "Thus, the base filename is used to identify the individual",
                  "structures.",sep = " ")
    message(mess)
    pdb.ids <- basename(pdb.location)

    ##--- check for flsq in name
    num.flsq <- sum(grepl(pattern = "flsq", x = pdb.ids))
    if ( num.flsq == num.pdb.structures ) {
      pdb.ids <- gsub(pattern = ".pdb_flsq.pdb", replacement = "", x = pdb.ids)
    } else {
      pdb.ids <- gsub(pattern = ".pdb", replacement = "", x = pdb.ids)
    }
  }


  ##----- determine if enough structures are provided
  if ( num.pdb.structures <= 3 ) {
    stop("Water cluster analysis requires more than 3 PDB (structure) file.")
  }

  ##----- which chains to explore?
  chains.oi <- DetermineChainsOfInterest(chains.explore)

  ##----- read in aligned PDBs and create waters data.frame for clustering
  message("----- Reading in the aligned structures _____")
  h2o.df <- NULL
  modeled.heavy.atoms.pdb.ids <- no.h2o.pdb.ids <- got.h2o.pdb.ids <- NULL
  num.h2o.per.pdb <- rep(NA, num.pdb.structures)
  for (curr.pdb in 1:num.pdb.structures) {

    ##--- read in PDB
    curr.pdb.id <- pdb.ids[curr.pdb]
    read.mess <- paste("Reading structure ", curr.pdb.id, "...", sep = "")
    message(read.mess)
    pdb <- bio3d::read.pdb(file = pdb.location[curr.pdb])
    atoms.oi <- pdb$atom

    ##----- determine the chain(s) of interest
    atoms.chains.oi <- RetainChainsOfInterest(atoms.oi,
                                              chains.explore,
                                              chains.oi)

    ##----- remove hydrogen atoms (yeah, they shouldn't be there.
    ##      but sometimes...)
    atoms.chains.oi <- RemoveHydrogenAtoms(atoms.chains.oi)

    ##----- remove modeled heavy atoms
    ##--- identify the protein, het, and waters atoms
    prot.het.h2o.idc.pre.length <-
      unlist(lapply(ProtHetWatIndices(data = atoms.chains.oi), length))
    atoms.num.pre <- nrow(atoms.chains.oi)
    ##--- remove modeled atoms
    atoms.chains.oi <- RemoveModeledAtoms(atoms.chains.oi)
    ##--- compare post and initial structures
    atoms.num <- nrow(atoms.chains.oi)
    if ( atoms.num < atoms.num.pre ) {
      prot.het.h2o.idc.post.length <-
        unlist(lapply(ProtHetWatIndices(data = atoms.chains.oi), length))

      modeled.heavy.atoms.pdb.ids <- append(modeled.heavy.atoms.pdb.ids,
                                            curr.pdb.id)

      ##-- number of modeled atom(s) for each type of atom
      prot.het.h2o.lengths <- prot.het.h2o.idc.pre.length -
        prot.het.h2o.idc.post.length
      ##-- make the message
      modeled.mess <- paste("There are",
                            prot.het.h2o.lengths[1],
                            "modeled protein atoms;",
                            prot.het.h2o.lengths[2],
                            "modeled hetro (non-water) atoms;",
                            prot.het.h2o.lengths[3],
                            "modeled water atoms.\n", sep = " ")
      message(modeled.mess)
    }

    ##--- identify the protein, het, and waters atoms (again...)
    prot.het.h2o.idc <- ProtHetWatIndices(data = atoms.chains.oi)
    prot.idc <- prot.het.h2o.idc$prot.idc
    het.idc <- prot.het.h2o.idc$het.idc
    h2o.idc <- prot.het.h2o.idc$h2o.idc

    ##--- calculate the pairwise distances
    atoms.dist <- as.matrix(dist(atoms.chains.oi[, c("x","y","z")],
                                 method = "euclidean",
                                 diag = TRUE, upper = TRUE))
    diag(atoms.dist) <- NA

    ##--- retain waters within user defined distance (Angstroms) of the protein
    if ( (is.finite(cutoff.prot.h2o.dist)) & (cutoff.prot.h2o.dist > 0.0) ) {
      ##--- the water indices for waters within user defined
      ##    Angstroms of the protein
      h2o.idc <- RetainWatersWithinX(atoms.dist,
                                     prot.het.h2o.idc,
                                     cutoff.prot.h2o.dist)
    }

    ##--- update the structure and distance matrix
    atom.idc <- sort(c(prot.idc, het.idc, h2o.idc))
    atoms.dist <- atoms.dist[atom.idc, atom.idc]
    atoms.chains.oi <- atoms.chains.oi[atom.idc, ]

    ##--- update the indices
    ##--- identify the protein, het, and waters atoms (again...)
    prot.het.h2o.idc <- ProtHetWatIndices(data = atoms.chains.oi)
    prot.idc <- prot.het.h2o.idc$prot.idc
    het.idc <- prot.het.h2o.idc$het.idc
    h2o.idc <- prot.het.h2o.idc$h2o.idc
    num.atoms <- nrow(atoms.chains.oi)

    ##--- number of atoms in each group
    # num.prot.atoms <- length(prot.idc)
    # num.het.atoms <- length(het.idc)
    num.h2o.per.pdb[curr.pdb] <- curr.num.h2o <- length(h2o.idc)

    ##--- check if the file has waters
    h2o.message <- paste(" - Structure", curr.pdb, "of", num.pdb.structures,
                         "-->>", curr.pdb.id, "has", curr.num.h2o, "waters.",
                         sep = " ")
    message(h2o.message)
    if ( curr.num.h2o > 0 ) {
      got.h2o.pdb.ids <- append(got.h2o.pdb.ids, curr.pdb.id)
    } else {
      no.h2o.pdb.ids <- append(no.h2o.pdb.ids, curr.pdb.id)
      next()
    }

    ##--- get residue and atom names
    names.residues <- atoms.chains.oi$resid
    ##- atom names for the het atoms are the element symbol
    names.atoms <- rep(NA, num.atoms)
    names.atoms[prot.idc] <- atoms.chains.oi$elety[prot.idc]
    names.atoms[het.idc]  <- atoms.chains.oi$elesy[het.idc]
    names.atoms[h2o.idc]  <- atoms.chains.oi$elety[h2o.idc]
    names.res.atoms <- paste(names.residues, names.atoms, sep = " ")

    ##--- calculate mobility and normalized b-value for each group
    ##-- protein atoms
    prot.mobility <- Mobility(Bvalues = atoms.chains.oi$b[prot.idc],
                              occupancy = atoms.chains.oi$o[prot.idc])
    prot.nBvalue <- NormalizedBvalue(Bvalues = atoms.chains.oi$b[prot.idc])
    ##-- het atoms
    het.mobility <- Mobility(Bvalues = atoms.chains.oi$b[het.idc],
                             occupancy = atoms.chains.oi$o[het.idc])
    het.nBvalue <- NormalizedBvalue(Bvalues = atoms.chains.oi$b[het.idc])
    ##-- water atoms
    h2o.mobility <- Mobility(Bvalues = atoms.chains.oi$b[h2o.idc],
                             occupancy = atoms.chains.oi$o[h2o.idc])
    h2o.nBvalue <- NormalizedBvalue(Bvalues = atoms.chains.oi$b[h2o.idc])

    ##--- add helper columns and the mobility and normalized b-value values
    ##    to the structure
    pdb.id <- pdb.ids[curr.pdb]
    ##- find current PDBid row in PDBinfo data.frame
    PDBinfo.curr.pdb.idx <- which(tolower(PDBinfo$structureId) %in% pdb.id)
    atoms.chains.oi <- cbind(atoms.chains.oi,
                            PDBid = pdb.id,
                            resolution = PDBinfo[PDBinfo.curr.pdb.idx, "resolution"],
                            rObserved = PDBinfo[PDBinfo.curr.pdb.idx, "rObserved"],
                            rFree = PDBinfo[PDBinfo.curr.pdb.idx, "rFree"],
                            mobility = NA,
                            nBvalue = NA,
                            stringsAsFactors = FALSE)
    ##-- mobility
    atoms.chains.oi$mobility[prot.idc] <- prot.mobility
    atoms.chains.oi$mobility[het.idc] <- het.mobility
    atoms.chains.oi$mobility[h2o.idc] <- h2o.mobility
    ##-- norm B-value
    atoms.chains.oi$nBvalue[prot.idc] <- prot.nBvalue
    atoms.chains.oi$nBvalue[het.idc] <- het.nBvalue
    atoms.chains.oi$nBvalue[h2o.idc] <- h2o.nBvalue

    ##--- perform bound water environment analysis
    bwe.colNames <- c("adn", "ahp.sum", "ahp.mu", "ahp.sd",
                      "hbonds",
                      "o.sum", "o.mu", "o.sd",
                      "b.exp.sum", "b.exp.mu", "b.exp.sd",
                      "mobility.sum", "mobility.mu", "mobility.sd",
                      "nBvalue.sum", "nBvalue.mu", "nBvalue.sd")
    ##-- protein atoms
    prot.bwe.results <- apply(atoms.dist[h2o.idc, ], 1,
                              FUN = BoundWaterEnvironment,
                              set.oi.idc = prot.idc,
                              # h2o.idc = h2o.idc,
                              names.atoms = names.atoms,
                              names.res.atoms = names.res.atoms,
                              structure = atoms.chains.oi,
                              radius = 3.6)
    df.prot.bwe.results <- data.frame(matrix(unlist(prot.bwe.results),
                                             nrow = length(h2o.idc),
                                             byrow = TRUE),
                                      stringsAsFactors = FALSE)
    colnames(df.prot.bwe.results) <- paste0("prot.", bwe.colNames)
    ##-- water atoms
    h2o.bwe.results <- apply(atoms.dist[h2o.idc, ], 1,
                              FUN = BoundWaterEnvironment,
                              set.oi.idc = h2o.idc,
                              # h2o.idc = h2o.idc,
                              names.atoms = names.atoms,
                              names.res.atoms = names.res.atoms,
                              structure = atoms.chains.oi,
                              radius = 3.6)
    df.h2o.bwe.results <- data.frame(matrix(unlist(h2o.bwe.results),
                                            nrow = length(h2o.idc),
                                            byrow = TRUE),
                                      stringsAsFactors = FALSE)
    colnames(df.h2o.bwe.results) <- paste0("h2o.", bwe.colNames)
    ##-- het atoms
    het.bwe.results <- apply(atoms.dist[h2o.idc, ], 1,
                             FUN = BoundWaterEnvironment,
                             set.oi.idc = het.idc,
                             # h2o.idc = h2o.idc,
                             names.atoms = names.atoms,
                             names.res.atoms = names.res.atoms,
                             structure = atoms.chains.oi,
                             radius = 3.6)
    df.het.bwe.results <- data.frame(matrix(unlist(het.bwe.results),
                                            nrow = length(h2o.idc),
                                            byrow = TRUE),
                                     stringsAsFactors = FALSE)
    colnames(df.het.bwe.results) <- paste0("het.", bwe.colNames)
    ##-- all atoms
    all.bwe.results <- apply(atoms.dist[h2o.idc, ], 1,
                             FUN = BoundWaterEnvironment,
                             set.oi.idc = c(prot.idc, h2o.idc, het.idc),
                             # h2o.idc = h2o.idc,
                             names.atoms = names.atoms,
                             names.res.atoms = names.res.atoms,
                             structure = atoms.chains.oi,
                             radius = 3.6)
    df.all.bwe.results <- data.frame(matrix(unlist(all.bwe.results),
                                            nrow = length(h2o.idc),
                                            byrow = TRUE),
                                     stringsAsFactors = FALSE)
    colnames(df.all.bwe.results) <- paste0("all.", bwe.colNames)
    ##-- put all the BoundWaterEnvironment results together
    df.bwe.results <- cbind(df.prot.bwe.results,
                            df.h2o.bwe.results,
                            df.het.bwe.results,
                            df.all.bwe.results,
                            stringsAsFactors = FALSE)
    ##-- add the BWE results to the PDB information
    bwe.placeholder <- data.frame(matrix(data = NA,
                                         nrow = num.atoms,
                                         ncol = ncol(df.bwe.results)),
                                         stringsAsFactors = FALSE)
    colnames(bwe.placeholder) <- colnames(df.bwe.results)
    atoms.chains.oi <- cbind(atoms.chains.oi, bwe.placeholder)
    atoms.chains.oi[h2o.idc, colnames(df.bwe.results)] <- df.bwe.results

    ##--- construct data.frame of ONLY water residues
    h2o.res <- atoms.chains.oi[h2o.idc, ]

    ##--- construct and assign unique rownames for each water residue
    h2o.rownames <- paste(pdb.id,
                          h2o.res$resid,
                          h2o.res$chain,
                          h2o.res$resno,
                          sep = "_")
    rownames(h2o.res) <- h2o.rownames

    ##--- append extracted water residues to 'water residue data.frame'
    h2o.df <- rbind(h2o.df, h2o.res)
  }


  ## PDB STRUCTURES SUMMARY ----------------------------------------------------
  ## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  ##----- summary of structures imported
  num.h2o.per.pdb.imported <- num.h2o.per.pdb[num.h2o.per.pdb > 0]
  num.h2o <- sum(num.h2o.per.pdb.imported)
  pdb.id.imported <- unique(h2o.df$PDBid)
  num.pdbs.got.h2o <- length(got.h2o.pdb.ids)
  num.pdbs.no.h2o <- length(no.h2o.pdb.ids)
  num.pdbs.modeled.heavy <- length(modeled.heavy.atoms.pdb.ids)
  ##--- construct messages
  message.import <- paste(num.pdb.structures,
                          "structures were read for water cluster",
                          "analysis containing", num.h2o,
                          "water molecules.", sep = " ")
  message.got.h2o <- paste("  -", num.pdbs.got.h2o,
                           "Strucutures have water molecules.",
                           sep = " ")
  message.no.h2o <- paste("  -", num.pdbs.no.h2o,
                          "Strucuture(s) do NOT have water molecules.",
                          sep = " ")
  ##--- print the messages
  message("\n----- Summary about imported structures _____")
  message(message.import)
  message(message.got.h2o)
  message(message.no.h2o)
  if ( num.pdbs.no.h2o > 0 ) {
    message.no.h2o.pdb.ids <- paste(no.h2o.pdb.ids, collapse = ", ")
    message.no.h2o.2 <- paste("This structure(s) does NOT have waters: ",
                              message.no.h2o.pdb.ids, sep = "")
    message(message.no.h2o.2)
  }
  if ( num.pdbs.got.h2o <= 1 ) {
    stop("Water cluster analysis requires more than 1 PDB (structure)
         file to have water molecules.")
  }

  # if ( num.pdbs.modeled.heavy > 0 ) {
  #   message.modeled.heavy <- paste(modeled.heavy.atoms.pdb.ids, collapse=", ")
  #   message.modeled.2 <- paste("This structure(s) has heavy atoms with an occupancy value of 0: ",
  #                              message.modeled.heavy, sep="")
  #   message(message.modeled.2)
  # }

  ##----- filter waters using mobility and normalized b-value
  mobility.keep.tf <- nBvalue.keep.tf <- rep(TRUE, num.h2o)
  ##--- mobility
  if ( is.null(cutoff.mobility) == FALSE ) {
    mobility.keep.tf <- h2o.df$mobility < cutoff.mobility
    h2o.keep.tf <- mobility.keep.tf
  }
  ##--- normalized b-value
  if ( is.null(cutoff.nBvalue) == FALSE ) {
    nBvalue.keep.tf <- h2o.df$nBvalue < cutoff.nBvalue
    h2o.keep.tf <- nBvalue.keep.tf
  }
  ##--- based on the user-provided cutoffs, determine the waters for
  ##    conservation analysis
  if ( (is.null(cutoff.mobility) == FALSE) &
       (is.null(cutoff.nBvalue) == FALSE) ) {
    h2o.keep.tf <- (mobility.keep.tf + nBvalue.keep.tf) == 2
  }

  ##----- add mobility, normalized b-value, and keep/passed-cutoff info to
  ##      data.frame
  h2o.df <- cbind(h2o.df,
                  mobility.keep = mobility.keep.tf,
                  nBvalue.keep = nBvalue.keep.tf,
                  passed.cutoffs = h2o.keep.tf,
                  stringsAsFactors = FALSE)

  ##----- keep PDB structures where 50% of the waters PASS the mobility
  ##      and normalized b-value checks
  message(paste("\n----- Checking for high quality structures using B-value",
          "Normalization and Mobility _____", sep = " ") )
  num.h2o.per.pdb.passed <- as.vector(unlist(aggregate(h2o.df$passed.cutoffs,
                                                       list(h2o.df$PDBid),
                                                       sum)[2]))
  pct.h2o.per.pdb.eval <- num.h2o.per.pdb.passed / num.h2o.per.pdb.imported
  ##--- identify structures that PASS the 50% rule
  pdb.keep.tf <- pct.h2o.per.pdb.eval >= 0.50
  pdb.keep <- pdb.id.imported[pdb.keep.tf]
  ##--- keep structures that PASS the 50% rule
  h2o.df$passed.cutoffs[!(h2o.df$PDBid %in% pdb.keep)] <- FALSE
  h2o.df.passed <- h2o.df[h2o.df$passed.cutoffs == TRUE, ]
  ##--- report the structures that did NOT pass the 50% rule
  if ( length(pdb.keep) != length(pdb.id.imported) ) {
  message.50pct <- paste("The following structures were removed from analysis",
                         "because 50% or greater of their waters the did NOT",
                         "pass the mobility or normalized b-value evaluations.",
                         sep = " ")
  message(message.50pct)
  message( paste(pdb.id.imported[!pdb.keep.tf], collapse = ", ") )
  } else {
    message.50pct <- paste("Water molecules from all imported structures will",
                           "be used in the water conservation analysis.",
                           sep = " ")
    message(message.50pct)
  }


  ## CLUSTER WATER MOLECULES ---------------------------------------------------
  ## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  ##----- clustering
  message("\n----- Clustering ALL waters from the provided structures _____")
  h2o.cluster.all <- ClusterWaters(data = h2o.df,
                                   cutoff.cluster,
                                   cluster.method = cluster.method)
  message(paste("----- Clustering waters that PASSED the B-value Normalization",
                "and Mobility _____"), sep = " ")
  h2o.cluster.passed <- ClusterWaters(data = h2o.df.passed,
                                      cutoff.cluster,
                                      cluster.method = cluster.method)

  ##----- merge the ALL and PASSED h2o.clusters.raw data.frames
  h2o.clusters.raw.all <- h2o.cluster.all$h2o.clusters.raw
  h2o.clusters.raw.all <- cbind(h2o.clusters.raw.all,
                                unique.name = rownames(h2o.clusters.raw.all),
                                stringsAsFactors = FALSE)
  h2o.clusters.raw.passed <- h2o.cluster.passed$h2o.clusters.raw
  h2o.clusters.raw.passed <- cbind(h2o.clusters.raw.passed,
                                   unique.name = rownames(h2o.clusters.raw.passed),
                                   stringsAsFactors = FALSE)
  h2o.raw.all.passed <- merge(x = h2o.clusters.raw.all,
                              y = h2o.clusters.raw.passed,
                              by.x = "unique.name",
                              by.y = "unique.name",
                              all.x = TRUE,
                              sort = FALSE)
  rownames(h2o.raw.all.passed) <- h2o.raw.all.passed$unique.name

  h2o.raw.all.passed <- h2o.raw.all.passed[order(rownames(h2o.raw.all.passed)), ]
  h2o.clusters.raw.all <- h2o.clusters.raw.all[order(rownames(h2o.clusters.raw.all)), ]

  h2o.raw.all.passed <- cbind(h2o.clusters.raw.all[, colnames(h2o.clusters.raw.all) != "unique.name"],
                              cluster.passed = h2o.raw.all.passed$cluster.y,
                              pct.conserved.passed = h2o.raw.all.passed$pct.conserved.y,
                              stringsAsFactors = FALSE)


  h2o.raw.all.passed <- h2o.raw.all.passed[order(h2o.raw.all.passed$cluster), ]





  ## CONSTRUCT CLUSTERED WATERS SUMMARY ----------------------------------------
  ## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  ##----- construct the structure and water summary for the clusters
  message("----- Constructing the summary table _____\n")
  h2o.cluster.all.stats <- ConservedWaterStats(h2o.cluster = h2o.cluster.all,
                                               num.h2o.inital = num.h2o,
                                               num.pdbs.got.h2o = num.pdbs.got.h2o)
  h2o.cluster.passed.stats <- ConservedWaterStats(h2o.cluster = h2o.cluster.passed,
                                                  num.h2o.inital = num.h2o,
                                                  num.pdbs.got.h2o = sum(pdb.keep.tf))
  h2o.cluster.stats <- cbind(h2o.cluster.all.stats,
                             h2o.cluster.passed.stats,
                             stringsAsFactors = FALSE)
  message("----- Summary table written _____\n")


  ## WRITE CLUSTERED WATERS PDB FILES ------------------------------------------
  ## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  ##----- write out PDB file with waters (o=% of structures with this water,
  ##      b=mean of b-values)
  ##--- all the provided waters
  message("----- Writing the conserved waters to PDB files _____")
  h2o.all.filename <- paste(filename,
                            "_ConservedWaters_ALL_",
                            now.date.time, ".pdb", sep = "")
  write.conservedWaters.pdb(file = h2o.all.filename,
                            h2o.clusters.summary = h2o.cluster.all$h2o.clusters.summary)

  ##--- the passed waters
  h2o.passed.filename <- paste(filename,
                               "_ConservedWaters_PASSED_",
                               now.date.time, ".pdb", sep = "")
  write.conservedWaters.pdb(file = h2o.passed.filename,
                            h2o.clusters.summary = h2o.cluster.passed$h2o.clusters.summary)

  message("----- PDB files written _____\n")


  ## WRITE RESULTS TO EXCEL WORKBOOK -------------------------------------------
  ## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  ##----- create worksheet names
  CW.ClusterStatistics <- paste("ClusterStats", now.date.time, sep="_")
  CW.all.h2o.summ      <- paste("all_ClustSumm", now.date.time, sep="_")
  CW.all.occ.summ      <- paste("all_OccurSumm", now.date.time, sep="_")
  CW.pass.h2o.summ     <- paste("pass_ClustSumm", now.date.time, sep="_")
  CW.pass.occ.summ     <- paste("pass_OccurSumm", now.date.time, sep="_")
  CW.init.h2o.data     <- paste("InitWaterData", now.date.time, sep="_")

  ##----- construct workbook name
  filename.xlsx <- paste(filename, "_DATA_RESULTS.xlsx", sep="")
  ##--- open existing excel workbook if available
  if ( file.exists(filename.xlsx) == TRUE ) {
    results.wb <- openxlsx::loadWorkbook(file=filename.xlsx)
  } else {  ##--- construct the workbook
    results.wb <- openxlsx::createWorkbook()
  }

  ##----- write the results to an excel workbook
  message("----- Writing results to Excel workbook _____")

  ##--- water cluster summary statistics
  results.wb <- oxClusterStatsSheet(wb.name = results.wb,
                                    sheet.name = CW.ClusterStatistics,
                                    df = h2o.cluster.stats)

  ##--- all clustered waters
  ##- cluster summary
  results.wb <- oxClusterSummarySheet(wb.name = results.wb,
                                      sheet.name = CW.all.h2o.summ,
                                      df = h2o.cluster.all$h2o.clusters.summary)

  ##- water occurrence summary
  results.wb <- oxWaterOccurrenceSheet(wb.name = results.wb,
                                       sheet.name = CW.all.occ.summ,
                                       df = h2o.cluster.all$h2o.occurrence)

  ##-- passed clustered waters
  ##- cluster summary
  results.wb <- oxClusterSummarySheet(wb.name = results.wb,
                                      sheet.name = CW.pass.h2o.summ,
                                      df = h2o.cluster.passed$h2o.clusters.summary)

  ##- water occurrence summary
  results.wb <- oxWaterOccurrenceSheet(wb.name = results.wb,
                                       sheet.name = CW.pass.occ.summ,
                                       df = h2o.cluster.passed$h2o.occurrence)

  ##-- the initial coordinates
  results.wb <- oxInitWaterDataSheet(wb.name = results.wb,
                                     sheet.name = CW.init.h2o.data,
                                     df = h2o.raw.all.passed)

  ##--- write the workbook
  openxlsx::saveWorkbook(results.wb, filename.xlsx, overwrite=TRUE)

  message("----- Results written to Excel workbook _____\n")


  ## RESULTS TO USER -----------------------------------------------------------
  ## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  ##----- filenames and date-time information
  filenames.used <- list(prefix=prefix,
                         conserved.all=h2o.all.filename,
                         conserved.passed=h2o.passed.filename,
                         xlsx=filename.xlsx,
                         when=now.date.time)

  ##----- return the results
  message("----- Done! _____")
  list(h2o.cluster.stats = h2o.cluster.stats,
       h2o.cluster.all = h2o.cluster.all,
       h2o.cluster.passed = h2o.cluster.passed,
       h2o.clusters.summary = h2o.raw.all.passed,  ## merged cluster number and conservation percentage
       filenames.used = filenames.used,
       MDS = FALSE,
       call = the.call
  )
}




## conserved MDS waters docs ---------------------------------------------------
#' @title Conserved Molecular Dynamics Simulation Waters
#' @description Identifies conserved molecular dynamics simulation (MDS) waters
#'   from a collection of PDBs.
#' @details Only atoms within (less than or equal to) 5.10 Angstroms of the
#'   protein structures are included.
#'
#' @param prefix Directory of aligned structures; string.
#' @param cluster Oxygen atoms within 2.4 Angstroms or less of each other are
#'   considered a cluster; numeric. Default value is 2.4 Angstroms.
#' @param chain The chain to examine. The user can define "first" and the first
#'   chain alphabetically will be selected; this is the default. Defining "all"
#'   will result in all chains being explored. Alternatively the user can define
#'   individual the chains to include in the analysis; for example, `c("A",
#'   "B", "C")`. When defining chains, the chain designation _**must
#'   be characters**_.
#' @param prot.h2o.dist.min The minimum distance (in Angstroms) between the
#'   protein and waters to be considered for the conserved water clusters. Water
#'   oxygen atoms greater than this distance are removed from the analysis.
#'   Default value is 5.10 Angstroms.
#' @param cluster.method Method of clustering the waters; default is "complete".
#'   Any other method accepted by the [stats::hclust()] or
#'   [fastcluster::hclust()] functions are appropriate. The original method used
#'   by Sanschagrin and Kuhn is the complete linkage clustering method and is
#'   the default. Other options include "ward.D" (equivilant to the only Ward
#'   option in `R` versions 3.0.3 and earlier), "ward.D2" (implements Ward's
#'   1963 criteria; see Murtagh and Legendre 2014), or "single" (related to the
#'   minimal spanning tree method and adopts a "friend of friends" clustering
#'   method). Please see [fastcluster::hclust()] for additional and complete
#'   information regarding clustering explanations.
#' @param filename The filename prefix for the returned results. Default is
#'   "ProteinSystem"
#'
#' @return
#'   This function returns:
#'   * **h2o.cluster.all**: Clusters constructed from all waters present in
#'     the aligned PDB structures.
#'   * **h2o.cluster.passed**: Clusters constructed from waters that passed
#'     the [Mobility()] and [NormalizedBvalue()] evaluations.
#'   * **h2o.cluster.summary**: Summary of water clusters
#'   * **Excel workbook**: containing the Cluster Statistics, Cluster Summaries
#'     for **all** and **passed** waters, Occurrence Summaries for **all** and
#'     **passed** waters, and the Initial Water Data data as individual tabs
#'   * **call**: The user provided parameters for the function
#'
#' @export
#'
#' @import bio3d
#' @importFrom stats dist
#'
#' @author Emilio Xavier Esposito \email{emilio@@exeResearch.com}
#'
#' @references
#'   Paul C Sanschagrin and Leslie A Kuhn. Cluster analysis of
#'   consensus water sites in thrombin and trypsin shows conservation between
#'   serine proteases and contributions to ligand specificity. _Protein
#'   Science_, 1998, **7** (_10_), pp 2054-2064.
#'   [DOI: 10.1002/pro.5560071002](http://doi.org/10.1002/pro.5560071002)
#'   [PMID: 9792092](http://www.ncbi.nlm.nih.gov/pubmed/9792092)
#'   [WatCH webpage](http://www.kuhnlab.bmb.msu.edu/software/watch/index.html)
#'
#'   Hitesh Patel, Bjorn A. Gruning, Stefan Gunther, and Irmgard Merfort.
#'   PyWATER: a PyMOL plug-in to find conserved water molecules in proteins by
#'   clustering. _Bioinformatics_, 2014, **30** (_20_), pp 2978-2980.
#'   [DOI: 10.1093/bioinformatics/btu424](http://doi.org/10.1093/bioinformatics/btu424)
#'   [PMID: 24990608](http://www.ncbi.nlm.nih.gov/pubmed/24990608)
#'   [PyWATER on GitHub](https://github.com/hiteshpatel379/PyWATER/blob/master/README.rst)
#'
ConservedWaters.MDS <- function(prefix="", # "thrombin_fitlsq_1.0ang/",
                                cluster=2.4,
                                chain="all",
                                prot.h2o.dist.min=5.10,
                                cluster.method="complete",
                                filename="ProteinSystem") {


  ## THE SETUP -----------------------------------------------------------------
  ## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  ##_ the provided call -----
  the.call <- match.call()

  ##_ create the date and time portion of the filenames -----
  current.time <- Sys.time()
  now.date.time <- FileTimeStamp(current.time)

  ##_ rename user defined variables -----
  cutoff.cluster <- cluster + 0.001
  cutoff.prot.h2o.dist <- prot.h2o.dist.min + 0.001
  chains.explore <- toupper(chain)


  ## CHECK THE USER PROVIDED VALUES --------------------------------------------
  ## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  ##_ cluster and prot.h2o.dist.min -----
  if ( !is.numeric(cutoff.cluster) ||
       is.nan(cutoff.cluster) ||
       (cutoff.cluster <= 0.0) ) {
    cutoff.cluster <- 2.401
    mess <- paste("Cluster cutoff must be positive and greater than 0.0. Being",
                  "set to the default of 2.4", sep=" ")
    message(mess)
  }

  if ( !is.numeric(cutoff.prot.h2o.dist) ||
       is.nan(cutoff.prot.h2o.dist) ||
       (cutoff.prot.h2o.dist <= 0.0) ) { cutoff.prot.h2o.dist <- NA }

  ##_ check the user provided chain(s) to explore -----
  chains.valid <- c("FIRST", "ALL", LETTERS)
  chains.explore.vs.valid <- match(chains.explore, chains.valid)
  chains.explore.vs.valid.NA <- anyNA(chains.explore.vs.valid)
  if ( chains.explore.vs.valid.NA == TRUE ) {
    chains.invalid.idx <- match(NA, chains.explore.vs.valid)
    chains.invalid <- chains.explore[chains.invalid.idx]
    chains.stop <- paste("The provided chains of: ", toString(chains.explore),
                         " contains the invalid chain identifier: ",
                         toString(chains.invalid), ". Please only use the ",
                         "following chain identifiers: ",
                         toString(chains.valid), sep = "")
    stop(chains.stop)
  }

  ##_ check the clustering method -----
  cluster.method <- tolower(cluster.method)
  cluster.methods.avail <- c("ward.d", "ward.d2", "single", "complete")
  if ( !any(cluster.method == cluster.methods.avail) ) {
    cluster.stop <- paste("Please select one of the following clustering",
                          "methods: \"complete\" (the default), \"ward.D\",",
                          "\"ward.D2\", or \"single\".", sep = " ")
    stop(cluster.stop)
  }
  if ( cluster.method == "ward.d" ) cluster.method <- "ward.D"
  if ( cluster.method == "ward.d2" ) cluster.method <- "ward.D2"


  ## READ IN THE PDB STRUCTURES ------------------------------------------------
  ## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  ##_ get list of PDB files within prefix -----
  pdb.location <- ReturnPDBfullPath(prefix)

  ##_ determine PDB IDs -----
  # pdb.ids <- basename.MDS.pdb(pdb.location)
  pdb.ids <- ExtractPDBids(pdb.location)
  ##__ determine if the pdb.ids are all the same -----
  num.pdb.structures <- length(pdb.ids)
  num.pdb.ids.unique <- length(unique(pdb.ids))
  if ( num.pdb.structures != num.pdb.ids.unique ) {
    mess <- paste("There are", num.pdb.structures, "PDB structures yet there",
                  "is only", num.pdb.ids.unique, "unique PDB structure names.",
                  "Thus, the base filename is used to identify the individual",
                  "structures.",sep = " ")
    message(mess)
    pdb.ids <- basename(pdb.location)

    ##__ check for flsq in name -----
    num.flsq <- sum(grepl(pattern = "flsq", x = pdb.ids))
    if ( num.flsq == num.pdb.structures ) {
      pdb.ids <- gsub(pattern = ".pdb_flsq.pdb", replacement = "", x = pdb.ids)
    } else {
      pdb.ids <- gsub(pattern = ".pdb", replacement = "", x = pdb.ids)
    }
  }


  ##_ determine if enough structures are provided -----
  if ( num.pdb.structures <= 3 ) {
    stop("Water cluster analysis requires more than 3 PDB (structure) file.")
  }

  ##_ which chains to explore? -----
  chains.oi <- DetermineChainsOfInterest(chains.explore)

  ##_ read in aligned PDBs and create waters data.frame for clustering -----
  message("----- Reading in the aligned structures _____")
  h2o.df <- NULL
  modeled.heavy.atoms.pdb.ids <- no.h2o.pdb.ids <- got.h2o.pdb.ids <- NULL
  num.h2o.per.pdb <- rep(NA, num.pdb.structures)
  for (curr.pdb in 1:num.pdb.structures) {

    ##__ read in PDB -----
    curr.pdb.id <- pdb.ids[curr.pdb]
    read.mess <- paste("Reading structure ", curr.pdb.id, "...", sep = "")
    message(read.mess)
    pdb <- bio3d::read.pdb(file = pdb.location[curr.pdb])
    atoms.oi <- pdb$atom

    ##__ determine the chain(s) of interest -----
    atoms.chains.oi <- RetainChainsOfInterest(atoms.oi,
                                              chains.explore,
                                              chains.oi)

    ##__ remove hydrogen atoms -----
    ## (yeah, they shouldn't be there. but sometimes...)
    atoms.chains.oi <- RemoveHydrogenAtoms(atoms.chains.oi)

    ##__ standardize residue names -----
    atoms.chains.oi$resid <- aaStandardizeNames(atoms.chains.oi$resid)

    ##__ rename ATOM to HETATM for waters -----
    h2o.residues.tf <- atoms.chains.oi$resid %in% names.waters
    atoms.chains.oi$type[h2o.residues.tf] <- "HETATM"

    ##__ rename ATOM to HETATM for possible ligands -----
    standard.AAs.tf <- atoms.chains.oi$resid %in% names.residues
    ligand.AAs.tf <- !h2o.residues.tf & !standard.AAs.tf
    atoms.chains.oi$type[ligand.AAs.tf] <- "HETATM"

    ##___ identify the protein, het, and waters atoms
    prot.het.h2o.idc.pre.length <-
      unlist(lapply(ProtHetWatIndices(data = atoms.chains.oi), length))
    atoms.num.pre <- nrow(atoms.chains.oi)

    ##___ identify the protein, het, and waters atoms (again...) -----
    prot.het.h2o.idc <- ProtHetWatIndices(data = atoms.chains.oi)
    prot.idc <- prot.het.h2o.idc$prot.idc
    het.idc <- prot.het.h2o.idc$het.idc
    h2o.idc <- prot.het.h2o.idc$h2o.idc

    ##___ calculate the pairwise distances -----
    atoms.dist <- as.matrix(dist(atoms.chains.oi[, c("x","y","z")],
                                 method = "euclidean",
                                 diag = TRUE, upper = TRUE))
    diag(atoms.dist) <- NA

    ##___ retain waters within user defined distance (Angstroms) of the protein -----
    if ( (is.finite(cutoff.prot.h2o.dist)) & (cutoff.prot.h2o.dist > 0.0) ) {
      ##--- the water indices for waters within user defined
      ##    Angstroms of the protein
      h2o.idc <- RetainWatersWithinX(atoms.dist,
                                     prot.het.h2o.idc,
                                     cutoff.prot.h2o.dist)
    }

    ##___ update the structure and distance matrix -----
    atom.idc <- sort(c(prot.idc, het.idc, h2o.idc))
    atoms.dist <- atoms.dist[atom.idc, atom.idc]
    atoms.chains.oi <- atoms.chains.oi[atom.idc, ]

    ##___ update the indices -----
    ##____ identify the protein, het, and waters atoms (again...) -----
    prot.het.h2o.idc <- ProtHetWatIndices(data = atoms.chains.oi)
    prot.idc <- prot.het.h2o.idc$prot.idc
    het.idc <- prot.het.h2o.idc$het.idc
    h2o.idc <- prot.het.h2o.idc$h2o.idc
    num.atoms <- nrow(atoms.chains.oi)

    ##___ number of atoms in each group -----
    # num.prot.atoms <- length(prot.idc)
    # num.het.atoms <- length(het.idc)
    num.h2o.per.pdb[curr.pdb] <- curr.num.h2o <- length(h2o.idc)

    ##___ check if the file has waters -----
    h2o.message <- paste(" - Structure", curr.pdb, "of", num.pdb.structures,
                         "-->>", curr.pdb.id, "has", curr.num.h2o, "waters.",
                         sep = " ")
    message(h2o.message)
    if ( curr.num.h2o > 0 ) {
      got.h2o.pdb.ids <- append(got.h2o.pdb.ids, curr.pdb.id)
    } else {
      no.h2o.pdb.ids <- append(no.h2o.pdb.ids, curr.pdb.id)
      next()
    }

    ##___ get residue and atom names -----
    names.residues <- atoms.chains.oi$resid
    ##- atom names for the het atoms are the element symbol
    names.atoms <- rep(NA, num.atoms)
    names.atoms[prot.idc] <- atoms.chains.oi$elety[prot.idc]
    names.atoms[het.idc]  <- atoms.chains.oi$elesy[het.idc]
    names.atoms[h2o.idc]  <- atoms.chains.oi$elety[h2o.idc]
    names.res.atoms <- paste(names.residues, names.atoms, sep = " ")

    ##___ add helper columns and the mobility and normalized b-value values -----
    ##    to the structure
    pdb.id <- pdb.ids[curr.pdb]
    ##- find current PDBid row in PDBinfo data.frame
    # PDBinfo.curr.pdb.idx <- which(tolower(PDBinfo$structureId) %in% pdb.id)
    atoms.chains.oi <- cbind(atoms.chains.oi,
                             PDBid = pdb.id,
                             # resolution = PDBinfo[PDBinfo.curr.pdb.idx, "resolution"],
                             # rObserved = PDBinfo[PDBinfo.curr.pdb.idx, "rObserved"],
                             # rFree = PDBinfo[PDBinfo.curr.pdb.idx, "rFree"],
                             # mobility = NA,
                             # nBvalue = NA,
                             stringsAsFactors = FALSE)
    # ##-- mobility
    # atoms.chains.oi$mobility[prot.idc] <- prot.mobility
    # atoms.chains.oi$mobility[het.idc] <- het.mobility
    # atoms.chains.oi$mobility[h2o.idc] <- h2o.mobility
    # ##-- norm B-value
    # atoms.chains.oi$nBvalue[prot.idc] <- prot.nBvalue
    # atoms.chains.oi$nBvalue[het.idc] <- het.nBvalue
    # atoms.chains.oi$nBvalue[h2o.idc] <- h2o.nBvalue



    ##___ perform bound water environment analysis -----
    bwe.colNames <- c("adn", "ahp.sum", "ahp.mu", "ahp.sd",
                      "hbonds")

    ##____ protein atoms -----
    prot.bwe.results <- apply(atoms.dist[h2o.idc, ], 1,
                              FUN = BoundWaterEnvironment.interact,
                              set.oi.idc = prot.idc,
                              names.atoms = names.atoms,
                              names.res.atoms = names.res.atoms,
                              radius = 3.6)
    df.prot.bwe.results <- data.frame(matrix(unlist(prot.bwe.results),
                                             nrow = length(h2o.idc), byrow = T),
                                      stringsAsFactors = FALSE)
    colnames(df.prot.bwe.results) <- paste0("prot.", bwe.colNames)
    ##____ water atoms -----
    h2o.bwe.results <- apply(atoms.dist[h2o.idc, ], 1,
                             FUN = BoundWaterEnvironment.interact,
                             set.oi.idc = h2o.idc,
                             names.atoms = names.atoms,
                             names.res.atoms = names.res.atoms,
                             radius = 3.6)
    df.h2o.bwe.results <- data.frame(matrix(unlist(h2o.bwe.results),
                                            nrow = length(h2o.idc), byrow = T),
                                     stringsAsFactors = FALSE)
    colnames(df.h2o.bwe.results) <- paste0("h2o.", bwe.colNames)
    ##____ het atoms -----
    het.bwe.results <- apply(atoms.dist[h2o.idc, ], 1,
                             FUN = BoundWaterEnvironment.interact,
                             set.oi.idc = het.idc,
                             names.atoms = names.atoms,
                             names.res.atoms = names.res.atoms,
                             radius = 3.6)
    df.het.bwe.results <- data.frame(matrix(unlist(het.bwe.results),
                                            nrow = length(h2o.idc), byrow = T),
                                     stringsAsFactors = FALSE)
    colnames(df.het.bwe.results) <- paste0("het.", bwe.colNames)
    ##____ all atoms -----
    all.bwe.results <- apply(atoms.dist[h2o.idc, ], 1,
                             FUN = BoundWaterEnvironment.interact,
                             set.oi.idc = c(prot.idc, h2o.idc, het.idc),
                             names.atoms = names.atoms,
                             names.res.atoms = names.res.atoms,
                             radius = 3.6)
    df.all.bwe.results <- data.frame(matrix(unlist(all.bwe.results),
                                            nrow = length(h2o.idc), byrow = T),
                                     stringsAsFactors = FALSE)
    colnames(df.all.bwe.results) <- paste0("all.", bwe.colNames)
    ##____ put all the BoundWaterEnvironment results together -----
    df.bwe.results <- cbind(df.prot.bwe.results,
                            df.h2o.bwe.results,
                            df.het.bwe.results,
                            df.all.bwe.results,
                            stringsAsFactors = FALSE)
    ##____ add the BWE results to the PDB information -----
    bwe.placeholder <- data.frame(matrix(data = NA,
                                         nrow = num.atoms,
                                         ncol = ncol(df.bwe.results)),
                                  stringsAsFactors = FALSE)
    colnames(bwe.placeholder) <- colnames(df.bwe.results)
    atoms.chains.oi <- cbind(atoms.chains.oi, bwe.placeholder)
    atoms.chains.oi[h2o.idc, colnames(df.bwe.results)] <- df.bwe.results

    ##___ construct data.frame of ONLY water residues -----
    h2o.res <- atoms.chains.oi[h2o.idc, ]

    ##___ construct and assign unique rownames for each water residue -----
    h2o.rownames <- paste(pdb.id,
                          h2o.res$resid,
                          h2o.res$chain,
                          h2o.res$resno,
                          sep = "_")
    rownames(h2o.res) <- h2o.rownames

    ##___ append extracted water residues to 'water residue data.frame' -----
    h2o.df <- rbind(h2o.df, h2o.res)
  }


  ## PDB STRUCTURES SUMMARY ----------------------------------------------------
  ## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  ##_ summary of structures imported -----
  num.h2o.per.pdb.imported <- num.h2o.per.pdb[num.h2o.per.pdb > 0]
  num.h2o <- sum(num.h2o.per.pdb.imported)
  pdb.id.imported <- unique(h2o.df$PDBid)
  num.pdbs.got.h2o <- length(got.h2o.pdb.ids)
  num.pdbs.no.h2o <- length(no.h2o.pdb.ids)
  num.pdbs.modeled.heavy <- length(modeled.heavy.atoms.pdb.ids)
  ##__ construct messages -----
  message.import <- paste(num.pdb.structures,
                          "structures were read for water cluster",
                          "analysis containing", num.h2o,
                          "water molecules.", sep = " ")
  message.got.h2o <- paste("  -", num.pdbs.got.h2o,
                           "Strucutures have water molecules.",
                           sep = " ")
  message.no.h2o <- paste("  -", num.pdbs.no.h2o,
                          "Strucuture(s) do NOT have water molecules.",
                          sep = " ")
  ##__ print the messages -----
  message("\n----- Summary about imported structures _____")
  message(message.import)
  message(message.got.h2o)
  message(message.no.h2o)
  if ( num.pdbs.no.h2o > 0 ) {
    message.no.h2o.pdb.ids <- paste(no.h2o.pdb.ids, collapse = ", ")
    message.no.h2o.2 <- paste("This structure(s) does NOT have waters: ",
                              message.no.h2o.pdb.ids, sep = "")
    message(message.no.h2o.2)
  }
  if ( num.pdbs.got.h2o <= 1 ) {
    stop("Water cluster analysis requires more than 1 PDB (structure)
              file to have water molecules.")
  }


  ## CLUSTER WATER MOLECULES ---------------------------------------------------
  ## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  ##_ clustering -----
  message("\n----- Clustering MDS waters from the provided structures _____")
  h2o.cluster.all <- ClusterWaters.MDS(data = h2o.df,
                                   cutoff.cluster,
                                   cluster.method = cluster.method)

  ##_ merge the ALL and PASSED h2o.clusters.raw data.frames -----
  h2o.clusters.raw.all <- h2o.cluster.all$h2o.clusters.raw
  h2o.clusters.raw.all <- cbind(h2o.clusters.raw.all,
                                unique.name = rownames(h2o.clusters.raw.all),
                                stringsAsFactors = FALSE)
  h2o.clusters.raw.all <- h2o.clusters.raw.all[order(rownames(h2o.clusters.raw.all)), ]


  ## CONSTRUCT CLUSTERED WATERS SUMMARY ----------------------------------------
  ## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  ##_ construct the structure and water summary for the clusters -----
  message("----- Constructing the summary table _____\n")
  h2o.cluster.all.stats <- ConservedWaterStats(h2o.cluster = h2o.cluster.all,
                                               num.h2o.inital = num.h2o,
                                               num.pdbs.got.h2o = num.pdbs.got.h2o)
  h2o.cluster.stats <- cbind(h2o.cluster.all.stats,
                             # h2o.cluster.passed.stats,
                             stringsAsFactors = FALSE)
  message("----- Summary table complete _____\n")


  ## WRITE CLUSTERED WATERS PDB FILES ------------------------------------------
  ## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  ##_ write out PDB file with waters (o=% of structures with this water, -----
  ##      b=mean of b-values)
  ##--- all the provided waters
  message("----- Writing the conserved MDS waters to PDB file _____")
  h2o.all.filename <- paste(filename,
                            "_ConservedWaters_ALL_",
                            now.date.time, ".pdb", sep = "")
  write.conservedWaters.pdb(file = h2o.all.filename,
                            h2o.clusters.summary = h2o.cluster.all$h2o.clusters.summary)

  message("----- PDB file written _____\n")


  ## WRITE RESULTS TO EXCEL WORKBOOK -------------------------------------------
  ## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  ##_ create worksheet names -----
  CW.ClusterStatistics <- paste("ClusterStats", now.date.time, sep="_")
  CW.all.h2o.summ      <- paste("MDS_ClustSumm", now.date.time, sep="_")
  CW.all.occ.summ      <- paste("MDS_OccurSumm", now.date.time, sep="_")
  CW.init.h2o.data     <- paste("InitWaterData", now.date.time, sep="_")

  ##_ construct workbook name -----
  filename.xlsx <- paste(filename, "_DATA_RESULTS.xlsx", sep="")
  ##__ open existing excel workbook if available -----
  if ( file.exists(filename.xlsx) == TRUE ) {
    results.wb <- openxlsx::loadWorkbook(file=filename.xlsx)
  } else {  ##--- construct the workbook
    results.wb <- openxlsx::createWorkbook()
  }

  ##_ write the results to an excel workbook -----
  message("----- Writing results to Excel workbook _____")

  ##__ water cluster summary statistics -----
  results.wb <- oxClusterStatsSheet(wb.name = results.wb,
                                    sheet.name = CW.ClusterStatistics,
                                    df = h2o.cluster.stats)

  ##__ all clustered waters -----
  ##- cluster summary
  results.wb <- oxClusterSummarySheet(wb.name = results.wb,
                                      sheet.name = CW.all.h2o.summ,
                                      df = h2o.cluster.all$h2o.clusters.summary)

  ##- water occurrence summary
  results.wb <- oxWaterOccurrenceSheet(wb.name = results.wb,
                                       sheet.name = CW.all.occ.summ,
                                       df = h2o.cluster.all$h2o.occurrence)

  ##__ write the workbook -----
  openxlsx::saveWorkbook(results.wb, filename.xlsx, overwrite=TRUE)

  message("----- Results written to Excel workbook _____\n")


  ## RESULTS TO USER -----------------------------------------------------------
  ## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  ##_ filenames and date-time information -----
  filenames.used <- list(prefix=prefix,
                         conserved.all=h2o.all.filename,
                         xlsx=filename.xlsx,
                         when=now.date.time)


  ##_ return the results -----
  message("----- Done! _____")
  list(h2o.cluster.stats = h2o.cluster.stats,
       h2o.cluster.all = h2o.cluster.all,
       # h2o.clusters.summary = h2o.raw.all.passed,  ## merged cluster number and conservation percentage
       filenames.used = filenames.used,
       MDS = TRUE,
       call = the.call
  )
}




## conserved water statistics docs ---------------------------------------------
#' @title Conserved Water Statistics
#' @description Calculates the Conserved Water Statistics for
#'   [ConservedWaters()]
#' @details Calculates the statistics for each conserved water analysis
#'   performed by [ConservedWaters()]. This summary information is useful for
#'   timings information and is written to the Excel workbook.
#'
#' @param h2o.cluster Conserved water cluster
#' @param num.h2o.inital Number of initial waters
#' @param num.pdbs.got.h2o Number of PDB structures with waters
#'
#' @return
#'   A table with the following information is returned:
#'   - Number of structures
#'   - Number of initial waters
#'   - Number of waters used in the calculations
#'   - Number of water clusters
#'   - Average water conservation
#'   - Number of conserved waters with
#'     + < 50\% conservation
#'     + 50 - 69\% conservation
#'     + 70 - 79\% conservation
#'     + 80 - 89\% conservation
#'     + 90 - 99\% conservation
#'     + 100\% conservation
#'   - Number of pairwise distances evaluated
#'   - Amount of memory used by the pairwise diatance matrix
#'   - Pairwise distance calculation time
#'   - Cluster centroid distance calculation time
#'
#' @family "vanddraabe utilities"
#'
#' @author Emilio Xavier Esposito \email{emilio@@exeResearch.com}
#'
ConservedWaterStats <- function(h2o.cluster,
                                num.h2o.inital,
                                num.pdbs.got.h2o) {

  ##_ extract needed information from h2o cluster -----
  num.h2o.used <- as.integer(sum(h2o.cluster$h2o.clusters.summary$num.waters))
  num.h2o.clusters <- nrow(h2o.cluster$h2o.clusters.summary)
  conservation.mu <- mean(h2o.cluster$h2o.clusters.summary$num.waters)
  clustering.info <- h2o.cluster$clustering.info

  ##_ get the distance matrix size -----
  size.pairwise.dist <- unlist(strsplit(clustering.info$size.pairwise.dist, " "))
  size.pairwise.dist.num <- as.numeric(size.pairwise.dist[1])
  size.pairwise.dist.units <- size.pairwise.dist[2]

  ##_ calculate the percentages -----
  pct.conserved.passed <- h2o.cluster$h2o.clusters.summary$pct.conserved
  num.pct.lt0.5        <- sum(pct.conserved.passed  < 50)
  num.pct.gte0.5_lt0.7 <- sum(pct.conserved.passed >= 50 & pct.conserved.passed < 70)
  num.pct.gte0.7_lt0.8 <- sum(pct.conserved.passed >= 70 & pct.conserved.passed < 80)
  num.pct.gte0.8_lt0.9 <- sum(pct.conserved.passed >= 80 & pct.conserved.passed < 90)
  num.pct.gte0.9_lt1.0 <- sum(pct.conserved.passed >= 90 & pct.conserved.passed < 100)
  num.pct.gte.100 <- sum(pct.conserved.passed >= 100)

  ##_ combine the values -----
  h2o.cluster.summary.values <- c(num.pdbs.got.h2o,
                                  num.h2o.inital,
                                  num.h2o.used,
                                  num.h2o.clusters,
                                  conservation.mu,
                                  num.pct.lt0.5,
                                  num.pct.gte0.5_lt0.7,
                                  num.pct.gte0.7_lt0.8,
                                  num.pct.gte0.8_lt0.9,
                                  num.pct.gte0.9_lt1.0,
                                  num.pct.gte.100,
                                  as.numeric(clustering.info$num.pairwise.dist),
                                  size.pairwise.dist.num,
                                  as.numeric(clustering.info$time.pairwise.dist),
                                  as.numeric(clustering.info$time.cluster.dist) )

  ##_ combine the percentages -----
  h2o.cluster.summary.pct <- c(NA,
                               NA,
                               NA,
                               NA,
                               NA,
                               num.pct.lt0.5/num.h2o.clusters,
                               num.pct.gte0.5_lt0.7/num.h2o.clusters,
                               num.pct.gte0.7_lt0.8/num.h2o.clusters,
                               num.pct.gte0.8_lt0.9/num.h2o.clusters,
                               num.pct.gte0.9_lt1.0/num.h2o.clusters,
                               num.pct.gte.100/num.h2o.clusters,
                               NA,
                               NA,
                               NA,
                               NA)

  ##_ construct the data.frame of values and add rownames -----
  h2o.cluster.summary <- data.frame(values = h2o.cluster.summary.values,
                                    percentages = h2o.cluster.summary.pct * 100,
                                    stringsAsFactors = FALSE)
  rownames(h2o.cluster.summary) <- c("Number of structures",
                                     "Number of initial waters",
                                     "Number of used waters",
                                     "Number of water clusters",
                                     "Average conservation",
                                     "<50%",
                                     "50-69%",
                                     "70-79%",
                                     "80-89%",
                                     "90-99%",
                                     "100%",
                                     "Number of pairwise distances",
                                     paste("Memory size of pairwise distances (",
                                           size.pairwise.dist.units, ")", sep = ""),
                                     "Pairwise distance calc time (s)",
                                     "Cluster centroid distance calcs time (s)")

  ##_ return the summary statistics -----
  return(h2o.cluster.summary)

}
Any scripts or data that you put into this service are public.
vanddraabe documentation built on June 8, 2019, 1:03 a.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
vanddraabe
Identification and Statistical Analysis of Conserved Waters Near Proteins

R/ConservedWaters.R
In vanddraabe: Identification and Statistical Analysis of Conserved Waters Near Proteins

Try the vanddraabe package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

vanddraabe Identification and Statistical Analysis of Conserved Waters Near Proteins

R/ConservedWaters.R In vanddraabe: Identification and Statistical Analysis of Conserved Waters Near Proteins

Try the vanddraabe package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

vanddraabe
Identification and Statistical Analysis of Conserved Waters Near Proteins

R/ConservedWaters.R
In vanddraabe: Identification and Statistical Analysis of Conserved Waters Near Proteins