compDis | R Documentation |
Function to quantify dissimilarities between phytochemical compounds.
compDis(
compoundData,
type = "PubChemFingerprint",
npcTable = NULL,
unknownCompoundsMean = FALSE
)
compoundData |
Data frame with the chemical compounds of interest, usually the compounds found in the sample dataset. Should have a column named "compound" with common names of the compounds, a column named "smiles" with SMILES IDs of the compounds, and a column named "inchikey" with the InChIKey IDs for the compounds. |
type |
Type of data compound dissimilarity calculations will be
based on: |
npcTable |
A data frame already generated by |
unknownCompoundsMean |
If unknown compounds, i.e. ones without SMILES or InChIKey, should be given mean dissimilarity values. If not, these will have dissimilarity 1 to all other compounds. |
This function calculates matrices with pairwise dissimilarities between
the chemical compounds in compoundData
, to quantify how
different the molecules are to each other. It does so in three
different ways, based on the biosynthetic classification or
molecular structure of the molecules:
Using the classification from the NPClassifier tool,
type = "NPClassifier"
. NPClassifier (Kim et al. 2021) is a
deep-learning tool that automatically classifies natural products
(i.e. phytochemical compounds) into a hierarchical classification of
three levels: pathway, superclass and class. This classification largely
corresponds to the biosynthetic groups/pathways the compounds
are produced in. Classifications are downloaded from
https://npclassifier.ucsd.edu/. NPClassifier does not always
manage to classify every compound into all three hierarchical levels. In
such cases, it might be beneficial to first run NPCTable
,
manually edit the resulting data frame with probable classifications if
possible (with help from the Supporting Information in Kim et al. 2021),
and then supply this classification to the compDis
function
with the npcTable
argument. This will ensure that compound
dissimilarities are computed optimally.
Using PubChem Fingerprints, type = "PubChemFingerprint"
.
This is a binary substructure fingerprint with 881 binary
variables describing the chemical structure of a compound.
With this method, compounds are therefore compared
based on how structurally dissimilar the molecules are.
See https://pubchem.ncbi.nlm.nih.gov/docs/data-specification
for more information. (There are many other types of fingerprints,
and ways of calculating compound dissimilarities based on them, see
e.g. packages fingerprint
and rcdk
). Fingerprint data for
molecules is downloaded from PubChem. In association with this,
there might be a Warning message about closing unused connections,
which is not important.
fMCS, flexible Maximum Common Substructure,
type = "fMCS"
. This is a pairwise graph matching concept.
The fMCS of two compounds is the largest substructure that occurs in both
compounds allowing for atom and/or bond mismatches (Wang et al 2013).
As with the fingerprints, compounds are compared based on how
structurally dissimilar the molecules are. While potentially a very
accurate similarity measure, fMCS is much more computationally demanding
than the other methods, and will take a significant amount of time for
larger data sets. Data on molecules is downloaded from PubChem.
In association with this, there might be a Warning message about closing
unused connections, which is not important.
Dissimilarities using NPClassifier and PubChem Fingerprints are generated by calculating Jaccard (Tanimoto) dissimilarities from a 0/1 table with compounds as rows and group (NPClassifier) or binary fingerprint variable (PubChem Fingerprints) as columns. fMCS generates dissimilarity values by calculating Jaccard dissimilarities based on the number of atoms in the maximum common substructure, allowing for one atom and one bond mismatch. Dissimilarities are outputted as dissimilarity matrices.
If dissimilarities are calculated with more than one method,
the function will output additional dissimilarity matrices.
This always includes a matrix with the mean dissimilarity values of the
selected methods. If "NPClassifier"
is included in type
,
a matrix of "mix" values is also calculated. The values in this matrix
are the dissimilarities from NPClassifier when these are > 0.
For pairs of compounds where dissimilarities from NPClassifier
equals 0 (i.e. when the compounds belong to the same pathway, superclass
and class), values are equal to half of the (mean) value(s) of the
structural dissimilarity/-ies from PubChem Fingerprints and/or fMCS.
With this method, compound dissimilarities are primarily based on
NPClassifier, but instead of compounds with identical classification having
0 dissimilarity, these have a dissimilarity based on PubChem Fingerprints
and/or fMCS, scaled to always be less (< 0.5) than compounds being in the
same pathway and superclass, but different class.
If there are unknown compounds, which do not have a
corresponding SMILES or InChIKey, this can be handled in three
different ways. First, these can be completely removed from the list
of compounds and the sample data set, and hence excluded from all analyses.
Second, if unknownCompoundsMean = FALSE
, unknown compounds will
be given a dissimilarity value of 1 to all other compounds. Third, if
unknownCompoundsMean = TRUE
, unknown compounds will be given
a dissimilarity value to all other compounds which equals the mean
dissimilarity value between all known compounds. See chemodiv
for alternative methods that can be used when most or all compounds
are unknown.
List with compound dissimilarity matrices. A list is always
outputted, even if only one matrix is calculated. Downstream functions,
including calcDiv
, calcBetaDiv
,
calcDivProf
, sampDis
, molNet
and chemoDivPlot
require only the matrix as
input (e.g. as fullList$specificMatrix
) rather than the whole list.
Kim HW, Wang M, Leber CA, Nothias L-F, Reher R, Kang KB, van der Hooft JJJ, Dorrestein PC, Gerwick WH, Cottrell GW. 2021. NPClassifier: A Deep Neural Network-Based Structural Classification Tool for Natural Products. Journal of Natural Products 84: 2795-2807.
Wang Y, Backman TWH, Horan K, Girke T. 2013. fmcsR: mismatch tolerant maximum common substructure searching in R. Bioinformatics 29: 2792-2794.
data(minimalCompData)
data(minimalNPCTable)
compDis(minimalCompData, type = "NPClassifier",
npcTable = minimalNPCTable) # Dissimilarity based on NPClassifier
## Not run: compDis(minimalCompData) # Dissimilarity based on Fingerprints
data(alpinaCompData)
data(alpinaNPCTable)
compDis(compoundData = alpinaCompData, type = "NPClassifier",
npcTable = alpinaNPCTable) # Dissimilarity based on NPClassifier
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.