Description Usage Arguments Details Value
View source: R/calculate_PC1.R
The main function of the CLMeval
package, calculate_PC1
allows
the user to evaluate a chemical language model by integrating five orthogonal
metrics of model performance. This is accomplished by principal component
analysis of a dataset where the major dimension of variance is model
performance (that is, models segregate along the first principal component
based on their ability to match the chemical space of the training set).
This function performs PCA in a reference matrix of chemical outcomes, then
uses the base R predict
function to project a model of interest
onto the same principal components.
1 | calculate_PC1(pct_valid, FCD, JSD_stereocenters, JSD_murcko, JSD_NP)
|
pct_valid |
the proportion of valid molecules generated by the trained model |
FCD |
the Frechet ChemNet distance to the training set |
JSD_stereocenters |
the Jensen-Shannon distance between the number of stereocenters in molecules sampled from the trained model vs. the training set |
JSD_murcko |
the Jensen-Shannon distance between the frequency distribution of Murcko scaffolds within molecules sampled from the trained model vs. the training set |
JSD_NP |
the Jensen-Shannon distance between the natural product-likeness scores of molecules sampled from the trained molecule vs. the training set |
The function takes as input five metrics that reflect the quality of a chemical language model. These metrics were chosen because they were found to be robustly correlated to the number of molecules in the training set across a series of benchmarking analyses. These five metrics are as follows:
the proportion of valid molecules generated by the trained model
the Frechet ChemNet distance to the training set
the Jensen-Shannon distance between the number of stereocenters in molecules sampled from the trained model vs. the training set
the Jensen-Shannon distance between the frequency distribution of Murcko scaffolds within molecules sampled from the trained model vs. the training set
the Jensen-Shannon distance between the natural product-likeness scores of molecules sampled from the trained molecule vs. the training set
The reference matrix used to perform PCA contains metrics for a total of 440
chemical language models. These were obtained by training recurrent neural
network-based models on SMILES strings from the ChEMBL, COCONUT, GDB, and
ZINC databases. The number of models from each database varied between 1,000
and 500,000, in eleven increments, and ten random samples of each size
were drawn from each database. For further details, see the
reference
documentation.
For futher details on the metrics, please find a complete description of the analysis at doi:10.26434/chemrxiv.13638347.v1.
a scalar value representing the model's PC1 score, derived from the integration of all five metrics
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.