reference: Reference matrix of model evaluation metrics for PCA

Description Usage Format Details

Description

A matrix of model evaluation metrics that is used to integrate the values of the same five metrics for a newly developed model principal component analysis. The reference matrix contains metrics for a total of 440 chemical language models. These were obtained by training recurrent neural network-based models on SMILES strings from the ChEMBL, COCONUT, GDB, and ZINC databases. The number of models from each database varied between 1,000 and 500,000, in eleven increments, and ten random samples of each size were drawn from each database. The matrix contains a metadata attribute that links each row in the matrix to the parameters of the training set. The five metrics were chosen because they were robustly correlated to the number of molecules in the training set across all four databases. These five metrics are as follows:

Usage

1

Format

a matrix with 440 rows and 5 columns, where each row corresponds to a model and each column corresponds to an evaluation metric

Details

  1. the proportion of valid molecules generated by the trained model

  2. the Frechet ChemNet distance to the training set

  3. the Jensen-Shannon distance between the number of stereocenters in molecules sampled from the trained model vs. the training set

  4. the Jensen-Shannon distance between the frequency distribution of Murcko scaffolds within molecules sampled from the trained model vs. the training set

  5. the Jensen-Shannon distance between the natural product-likeness scores of molecules sampled from the trained molecule vs. the training set


skinnider/CLMeval documentation built on Dec. 23, 2021, 3:23 a.m.