Description Usage Format Details
A matrix of model evaluation metrics that is used to integrate the values of
the same five metrics for a newly developed model principal component
analysis. The reference matrix contains metrics for a total of 440 chemical
language models. These were obtained by training recurrent neural
network-based models on SMILES strings from the ChEMBL, COCONUT, GDB, and
ZINC databases. The number of models from each database varied between 1,000
and 500,000, in eleven increments, and ten random samples of each size
were drawn from each database. The matrix contains a metadata
attribute that links each row in the matrix to the parameters of the training
set. The five metrics were chosen because they were robustly correlated to
the number of molecules in the training set across all four databases.
These five metrics are as follows:
1 |
a matrix with 440 rows and 5 columns, where each row corresponds to a model and each column corresponds to an evaluation metric
the proportion of valid molecules generated by the trained model
the Frechet ChemNet distance to the training set
the Jensen-Shannon distance between the number of stereocenters in molecules sampled from the trained model vs. the training set
the Jensen-Shannon distance between the frequency distribution of Murcko scaffolds within molecules sampled from the trained model vs. the training set
the Jensen-Shannon distance between the natural product-likeness scores of molecules sampled from the trained molecule vs. the training set
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.