CMF – R Package Implementing the Continuous Molecular Fields Approach

library(knitr)
library(conmolfields)
knitr::opts_chunk$set(eval = FALSE)

1. Inrtoduction

The package CMF contains a set of R functions that implement the Continuous Molecular Fields (CMF) approach to building 3D-QSAR models. The reference version of the R environment for statistical computing and graphics is 3.0.1.

2. Installation and Setup of the Required Software

The latest version of the R environment can be downloaded and installed from the homepage of the R Project for Statistical Computing http://www.r-project.org/. In order to perform visualization of molecules, molecular fields and fields of regression coefficients, it is necessary to install two additional packages: rgl and misc3d. They can be installed from the CRAN homepage http://cran.r-project.org/. In order to use quantum chemical molecular fields, it is necessary to install the MOPAC12 program http://openmopac.net/MOPAC2012.html. In order to run the software implementing the CMF approach to building 3D-QSAR models, it is necessary to keep all files, including all R scripts and all data files, in the same folder (directory), which should be specified as the current working directory.

4. The Basic Workflow for Building 3D-QSAR Models Using the CMF Approach

Two major modes of building and analyzing the performance of 3D-QSAR models are provided. The first one is based on a single splitting between a training set and an external (independent) test set. The model built on the training set is further applied to predict activity values of compounds contained in the test set. This mode requires the use of the following four input files:

The second mode is based on the procedure of external n-fold cross-validation. In this case the whole set of compounds is split into the training and test sets n times, so each compound appears in a test set exactly once. The 3D-QSAR model built on a training set is applied to the corresponding test set, and all prediction results are accumulated. This mode requires the use of the following two input files:

4.1. Aligning molecules

Molecules in a dataset can be aligned using two different approaches. If the set of compounds under study is congeneric, then it can be aligned by least-square fitting (algorithm arun) to a common template substructure, which should be contained in all molecules belonging to this set. Otherwise, alignment can be performed using the seal algorithm. If the first approach is chosen, it is first necessary to obtain the template substructure to be used for performing alignment. For example, it can be extracted from some molecule by specifying a list of serial numbers of atoms. This can be accomplished using the script cmf-do-make-template.R, in which the values of the following parameters can be specified:

The template substructure can further be used for performing molecular alignment. This can be carried out using the following script cmf-do-make-template:

# File name of molecular databased to be aligned
mdb_fname <- "ligands-train.mol2"

# Molecules from mdb_fname to be aligned
iimol <- c(1:72)

# File name of template
templ_fname <- "ligands-template.mol2"

# File name for aligned database
mdb_a_fname <- "ligands-aligned.mol2"

# Algorithm (arun/seal)
algorithm <- "arun"

mdb <- read_mol2(mdb_fname)
mdb_tmp <- read_mol2(templ_fname)
templ <- mdb_tmp[[1]]
if (algorithm == "arun") {
  mdb_a <- align_mdb_template(mdb, templ)
#  mdb_a <- align_mdb_template(mdb, templ, iimol)
} else if (algorithm == "seal") {
  mdb_a <- align_mdb_seal(mdb, templ)
} else {
  cat("Unknown algorithm\n")
}
write_mol2(mdb_a, mdb_a_fname)

in which the following parameters can be specified:

4.2. Computing kernels for the training set

The first step in building 3D-QSAR models using the CMF approach is to compute kernels for all pairs of compounds from the training set and to write them to a file. The kernels can be computed for five types of molecular fields:

All kernels are computed for the following fixed set of attenuation parameter: 0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1.0, 2.0, 3.0, 5.0, 10.0. To carry out this kernel computation procedure, one can run the computational R script cmf-do-comp-kernels-train:

# Training set file name
train_fname <- "ligands-train.mol2"

# The name of the file with kernels for training
kernels_train_fname <- "ligands-kernels-train.RData"

# Molecular fields
mfields <- c("q","vdw","logp","abra","abrb")

# Verbose computation of kernels (TRUE/FALSE, 1/0)
print_comp_kernels <- TRUE

comp_kernels_train(
  train_fname = train_fname, 
  kernels_train_fname = kernels_train_fname, 
  mfields = mfields, 
  print_comp_kernels = print_comp_kernels
)

The values of the following parameters may be specified by editing their default values in the script:

4.3. Building 3D-QSAR regression model using the training set

The second step is to build a 3D-QSAR model and write it to a file. This is performed by running the script cmf-do-krr-train:

# Activity file name
act_train_fname <- "activity-train.txt"

# Activity column number
act_colnum <- 1

# Separator
sep <- ","

# Kernels file name    
kernels_train_fname <- "ligands-kernels-train.RData"

# Model file name
model_fname <- "ligands-model.RData"

# Types of molecular fields
mfields <- c("q","vdw","logp","abra","abrb")

# Whether to perform grid search for attenuation factor alpha (TRUE/FALSE, 1/0).
alpha_grid_search <- TRUE

# Whether to perform grid search for Tikhonov regularization coefficient gamma (TRUE/FALSE, 1/0).
gamma_grid_search <- FALSE

# Whether to form tconic kernel combination 
conic_kernel_combination <- FALSE

# Whether to optimize h (TRUE/FALSE, 1/0)
optimize_h <- FALSE

# Whether to set b=0 (TRUE/FALSE, 1/0)
set_b_0 <- FALSE

# Print intermediate results in the course of optimiization (TRUE/FALSE, 1/0) 
print_interm_icv <- TRUE                   

# Produce intermediate scatter plots in the course of optimization  (TRUE/FALSE, 1/0)
plot_interm_icv <- TRUE                   

# Print final results on interna; cross-validation (TRUE/FALSE, 1/0)
print_final_icv <- TRUE

# Produce final scatter plot for internal cross-valudation (TRUE/FALSE, 1/0)                     
plot_final_icv <- TRUE                       

cmf_krr_train(
  act_train_fname = act_train_fname, 
  act_colnum = act_colnum, 
  sep = sep, 
  kernels_train_fname = kernels_train_fname, 
  model_fname = model_fname, 
  mfields = mfields, 
  alpha_grid_search = alpha_grid_search,
  gamma_grid_search = gamma_grid_search, 
  conic_kernel_combination = conic_kernel_combination, 
  optimize_h = optimize_h, 
  set_b_0 = set_b_0,
  print_interm_icv = print_interm_icv,
  plot_interm_icv = plot_interm_icv,
  print_final_icv = print_final_icv,
  plot_final_icv = plot_final_icv
)

which implements the kernel ridge regression (KRR) algorithm. The values of the following parameters can be specified by changing their default values in this script:

4.4. The Use of 3D-QSAR Model for Predicting Activities of Test Set Compounds

Predicting activities of compounds belonging to the external test set is carried out in two steps. In the first step, one should compute kernels comparing molecular fields of all compounds belonging to the test set with molecular fields of all compounds belonging to the training set. To do that, it is necessary to run script cmf-do-comp-kernels-pred :

# Training set file name
train_fname <- "ligands-train.mol2"

# Prediction set file name
pred_fname <- "ligands-pred.mol2"

# Computed kernels file name
kernels_pred_fname <- "ligands-kernels-pred.RData"

# Molecular fields
mfields <- c("q", "vdw", "logp", "abra", "abrb")

# Verbose computation of kernels (TRUE/FALSE, 1/0)
print_comp_kernels <- TRUE

comp_kernels_pred(
  train_fname = train_fname, 
  pred_fname = pred_fname, 
  kernels_pred_fname = kernels_pred_fname, 
  mfields = mfields, 
  print_comp_kernels = print_comp_kernels
)

The values of the following parameters may be modified in the script: - train_fname - the name of the mol2 file containing the structures from the training set (default: “ligands-train.mol2”); - pred_fname - the name of the mol2 file containing the structures from the test set (default: “ligands-pred.mol2”); - kernels_pred_fname - file name for storing computed kernels (default: “ligands-kernels-pred.mol2”); - mfields – the list of molecular fields for which kernels should be computed (default: c("q","vdw","logp","abra","abrb"));

The second step is to make predictions by running the script cmf-do-krr-pred :

# Prediction for KRR models with continuous molecular fields


# Model file name
model_fname <- "ligands-model.RData"

# Kernels for training file name
kernels_train_fname <- "ligands-kernels-train.RData"

# Kernels for prediction file name
kernels_pred_fname <- "ligands-kernels-pred.RData"

# Column name or activity (if specified)
act_colnum <- 1

# Separator
sep <- ","

# Activity for prediction set (if specified)
act_pred_fname <- "activity-pred.txt"

# File with predictions
pred_fname <- "ligands-pred.RData"

# Print statistical parameters for prediction (TRUE/FALSE, 1/0)
print_pred <- TRUE

# Produce scatter plot prediction on external database (TRUE/FALSE, 1/0)
plot_pred <- TRUE    

res <- cmf_krr_pred(
  model_fname = model_fname, 
  kernels_train_fname = kernels_train_fname,
  kernels_pred_fname = kernels_pred_fname, 
  act_colnum = act_colnum, 
  sep = sep, 
  act_pred_fname = act_pred_fname,
  pred_fname = pred_fname,
  print_pred = print_pred,
  plot_pred = plot_pred
)

The results of prediction is written to the file referenced by the variable pred_fname. The values of the following parameters may be modified in the script:

4.5. Building 3D-QSAR models using external n-fold cross-validation

Before building 3D-QSAR models using external n-fold cross-validation, it is necessary to compute kernels for the whole dataset. This can be done using the script cmf-do-comp-kernels-all:

# The whole set file name
all_fname <- "ligands-all.mol2"

# The name of the files containing kernels for all molecules
kernels_all_fname <- "ligands-kernels-all.RData"

# Molecular fields
mfields <- c("q","vdw","logp","abra","abrb")

# Verbose computation of kernels (TRUE/FALSE, 1/0)
print_comp_kernels <- TRUE

comp_kernels_all(
  all_fname = all_fname, 
  kernels_all_fname = kernels_all_fname, 
  mfields = mfields, 
  print_comp_kernels = print_comp_kernels
)

The following parameters in it can be specified:

# The number of folds in n-folds cross-validation
nfolds <- 5

# Activity file name
act_all_fname <- "activity-all.txt"

# Activity column number
act_colnum <- 1

# Separator
sep <- ","

# Kernels file name    
kernels_all_fname <- "ligands-kernels-all.RData"

# Types of molecular fields
mfields <- c("q","vdw","logp","abra","abrb")

# The name of the file with external cross-validation results
ecv_fname <- "ligands-ecv.RData"

# Whether to perform grid search for attenuation factor alpha (TRUE/FALSE, 1/0).
alpha_grid_search <- TRUE

# Whether to perform grid search for gamma (TRUE/FALSE, 1/0).
gamma_grid_search <- FALSE

# Whether to form tconic kernel combination 
# (with positive mixing coefficients) (TRUE/FALSE, 1/0)
# Conic combination preserves the positive definiteness of kernel matrices
# but leads to models with worse predictive performance assessed unsing cross-validation
conic_kernel_combination <- FALSE

# Whether to optimize h (TRUE/FALSE)
optimize_h <- FALSE

# Whether to set b=0 (TRUE/FALSE, 1/0)
set_b_0 <- FALSE

# Print statistical parameters of the external n-folds cross-validation procedure (TRUE/FALSE, 1/0)
print_ecv <- TRUE

# Produce scatter plot for external coss-validation (TRUE/FALSE, 1/0)
plot_ecv <- TRUE

# Print intermediate results in the course of optimiization (TRUE/FALSE, 1/0) 
print_interm_icv <- TRUE                   

# Produce intermediate scatter plots in the course of optimization  (TRUE/FALSE, 1/0)
plot_interm_icv <- TRUE                   

# Print final results on interna; cross-validation (TRUE/FALSE, 1/0)
print_final_icv <- TRUE

# Produce final scatter plot for internal cross-valudation (TRUE/FALSE, 1/0)                     
plot_final_icv <- TRUE                       

# Print statistical parameters for prediction (TRUE/FALSE, 1/0)
print_pred <- TRUE

# Produce scatter plot prediction on external database (TRUE/FALSE, 1/0)
plot_pred <- TRUE    

ecv <- cmf_krr_ecv(
  nfolds = nfolds, 
  act_all_fname = act_all_fname, 
  act_colnum = act_colnum, 
  sep = sep, 
  kernels_all_fname = kernels_all_fname, 
  mfields = mfields,
  ecv_fname = ecv_fname,  
  alpha_grid_search = alpha_grid_search,
  gamma_grid_search = gamma_grid_search, 
  conic_kernel_combination = conic_kernel_combination, 
  optimize_h = optimize_h, 
  set_b_0 = set_b_0,
  print_ecv = print_ecv,
  plot_ecv = plot_ecv,
  print_interm_icv = print_interm_icv,
  plot_interm_icv = plot_interm_icv,
  print_final_icv = print_final_icv,
  plot_final_icv = plot_final_icv,
  print_pred = print_pred,
  plot_pred = plot_pred
)

in which the following parameters can be changed:

# The number of folds for n-folds cross-validation
nfolds <- 5

# The number of reshuffles
nreshuffles <- 20

# Activity file name
act_all_fname <- "activity-train.txt"

# Activity column number
act_colnum <- 1

# Separator
sep <- ","

# Kernels file name    
kernels_all_fname <- "ligands-kernels-train.RData"

# Types of molecular fields
mfields <- c("q","vdw","logp","abra","abrb")

# File name for reshuffled cross-validation results
ecvr_fname <- "ligands-ecvr-train.RData"

# Whether to perform grid search for attenuation factor alpha (TRUE/FALSE, 1/0).
alpha_grid_search <- FALSE

# Whether to perform grid search for gamma (TRUE/FALSE, 1/0).
gamma_grid_search <- FALSE

# Whether to form tconic kernel combination 
# (with positive mixing coefficients) (TRUE/FALSE, 1/0)
# Conic combination preserves the positive definiteness of kernel matrices
# but leads to models with worse predictive performance assessed unsing cross-validation
conic_kernel_combination <- FALSE

# Whether to optimize h (TRUE/FALSE)
optimize_h <- TRUE

# Whether to set b=0 (TRUE/FALSE, 1/0)
set_b_0 <- FALSE

# Seed for random number generator (-1 - no seed)
seed <- 0

# Print statistical parameters of the reshuffled external n-folds cross-validation procedure (TRUE/FALSE, 1/0)
print_ecvr <- TRUE

# Produce scatter plot for reshuffled external coss-validation (TRUE/FALSE, 1/0)
plot_ecvr <- TRUE

# Print statistical parameters of the external n-folds cross-validation procedure (TRUE/FALSE, 1/0)
print_ecv <- TRUE

# Produce scatter plot for external coss-validation (TRUE/FALSE, 1/0)
plot_ecv <- TRUE

# Print intermediate results in the course of optimiization (TRUE/FALSE, 1/0) 
print_interm_icv <- FALSE                   

# Produce intermediate scatter plots in the course of optimization  (TRUE/FALSE, 1/0)
plot_interm_icv <- FALSE                  

# Print final results on interna; cross-validation (TRUE/FALSE, 1/0)
print_final_icv <- FALSE

# Produce final scatter plot for internal cross-valudation (TRUE/FALSE, 1/0)                     
plot_final_icv <- FALSE                    

# Print statistical parameters for prediction (TRUE/FALSE, 1/0)
print_pred <- FALSE

# Produce scatter plot prediction on external database (TRUE/FALSE, 1/0)
plot_pred <- FALSE  

ecvr <- cmf_krr_ecvr(
  nfolds = nfolds, 
  nreshuffles = nreshuffles, 
  act_all_fname = act_all_fname, 
  act_colnum = act_colnum, 
  sep = sep, 
  kernels_all_fname = kernels_all_fname, 
  mfields = mfields,
  ecvr_fname = ecvr_fname,
  alpha_grid_search = alpha_grid_search,
  gamma_grid_search = gamma_grid_search, 
  conic_kernel_combination = conic_kernel_combination, 
  optimize_h = optimize_h, 
  set_b_0 = set_b_0,
  seed = seed,
  print_ecvr = print_ecvr,
  plot_ecvr = plot_ecvr,
  print_ecv = print_ecv,
  plot_ecv = plot_ecv,
  print_interm_icv = print_interm_icv,
  plot_interm_icv = plot_interm_icv,
  print_final_icv = print_final_icv,
  plot_final_icv = plot_final_icv,
  print_pred = print_pred,
  plot_pred = plot_pred
)

in which the following parameters can be changed:

# File name for reshuffled external cross-validation
ecvr_fname <- "ligands-ecvr.RData"

# Kernels for training file name
kernels_train_fname <- "ligands-kernels-train.RData"

# Kernels for prediction file name
kernels_pred_fname <- "ligands-kernels-pred.RData"

# Column name or activity (if specified)
act_colnum <- 1

# Separator
sep <- ","

# Activity for prediction set (if specified)
act_pred_fname <- "activity-pred.txt"

# File with predictions
pred_fname <- "ligands-ecvr-pred.RData"

# Print statistical parameters for prediction (TRUE/FALSE, 1/0)
print_pred <- TRUE

# Produce scatter plot prediction on external database (TRUE/FALSE, 1/0)
plot_pred <- TRUE    

ecvr_pred <- cmf_krr_ecvr_pred(
  ecvr_fname = ecvr_fname, 
  kernels_train_fname = kernels_train_fname,
  kernels_pred_fname = kernels_pred_fname, 
  act_colnum = act_colnum, 
  sep = sep, 
  act_pred_fname = act_pred_fname,
  pred_fname = pred_fname,
  print_pred = print_pred,
  plot_pred = plot_pred
)

in which the following parameters can be changed:

# Visualisation of prediction histogram from multiple predictions m
# made by external cross-validation with reshuffling (ecvr)

# File with predictions
pred_fname <- "ligands-ecvr-pred.RData"

# Molecule to predict
imol <- 1

load(pred_fname)
hist(ecvr_pred$YPred[imol,], main="Prediction Histogram", xlab="Predicted Value", ylab="Frequency")

in which the following parameters can be changed:

4.6. Producing Scatter Plots

The script cmf-do-krr-train:

# KRR with continuous molecular fields and optimization of  gamma and h.
# alpha is optimized by grid search
# Any combination of molecular fields is supported

# Activity file name
act_train_fname <- "activity-train.txt"

# Activity column number
act_colnum <- 1

# Separator
sep <- ","

# Kernels file name    
kernels_train_fname <- "ligands-kernels-train.RData"

# Model file name
model_fname <- "ligands-model.RData"

# Types of molecular fields
mfields <- c("q","vdw","logp","abra","abrb")

# Whether to perform grid search for attenuation factor alpha (TRUE/FALSE, 1/0).
alpha_grid_search <- TRUE

# Whether to perform grid search for Tikhonov regularization coefficient gamma (TRUE/FALSE, 1/0).
gamma_grid_search <- FALSE

# Whether to form tconic kernel combination 
conic_kernel_combination <- FALSE

# Whether to optimize h (TRUE/FALSE, 1/0)
optimize_h <- FALSE

# Whether to set b=0 (TRUE/FALSE, 1/0)
set_b_0 <- FALSE

# Print intermediate results in the course of optimiization (TRUE/FALSE, 1/0) 
print_interm_icv <- TRUE                   

# Produce intermediate scatter plots in the course of optimization  (TRUE/FALSE, 1/0)
plot_interm_icv <- TRUE                   

# Print final results on interna; cross-validation (TRUE/FALSE, 1/0)
print_final_icv <- TRUE

# Produce final scatter plot for internal cross-valudation (TRUE/FALSE, 1/0)                     
plot_final_icv <- TRUE                       

cmf_krr_train(
  act_train_fname = act_train_fname, 
  act_colnum = act_colnum, 
  sep = sep, 
  kernels_train_fname = kernels_train_fname, 
  model_fname = model_fname, 
  mfields = mfields, 
  alpha_grid_search = alpha_grid_search,
  gamma_grid_search = gamma_grid_search, 
  conic_kernel_combination = conic_kernel_combination, 
  optimize_h = optimize_h, 
  set_b_0 = set_b_0,
  print_interm_icv = print_interm_icv,
  plot_interm_icv = plot_interm_icv,
  print_final_icv = print_final_icv,
  plot_final_icv = plot_final_icv
)

and cmf-do-krr-pred:

# Model file name
model_fname <- "ligands-model.RData"

# Kernels for training file name
kernels_train_fname <- "ligands-kernels-train.RData"

# Kernels for prediction file name
kernels_pred_fname <- "ligands-kernels-pred.RData"

# Column name or activity (if specified)
act_colnum <- 1

# Separator
sep <- ","

# Activity for prediction set (if specified)
act_pred_fname <- "activity-pred.txt"

# File with predictions
pred_fname <- "ligands-pred.RData"

# Print statistical parameters for prediction (TRUE/FALSE, 1/0)
print_pred <- TRUE

# Produce scatter plot prediction on external database (TRUE/FALSE, 1/0)
plot_pred <- TRUE    

res <- cmf_krr_pred(
  model_fname = model_fname, 
  kernels_train_fname = kernels_train_fname,
  kernels_pred_fname = kernels_pred_fname, 
  act_colnum = act_colnum, 
  sep = sep, 
  act_pred_fname = act_pred_fname,
  pred_fname = pred_fname,
  print_pred = print_pred,
  plot_pred = plot_pred
)

not only build 3D-QSAR models and use them to make predictions, but also produce the corresponding scatter plots.

Meanwhile, if it is necessary to reproduce these scatter plots, two computational R scripts can be used for this purpose:

The script cmf-do-plot-cv:

# Label for axis X
xlab <- "Predicted"

# Label for axis Y
ylab <- "Experimental"

# Margin for points in all directions
margin <- 0.5

load(fmodel)

cinf_plotxy(
  model$y_pred_cv, 
  model$y_exp, 
  xlab = xlab, 
  ylab = ylab, 
  margin = margin,
  main = main
)

produces the scatter plot Predicted/Expetimental for the results of internal cross-validation. The values of the following parameters may be modified in the script:

and

The script cmf-do-plot-pred produces scatter plot Predicted/Experimental for the results of predictions performed on external test set. The values of the following parameters may be modified in the script:

# Script for drawing scatter plot for predicting test set


# File with predictions
fpred <- "ligands-pred.RData"

# Main title
main <- "Scatter Plot for Prediction"

# Label for axis X
xlab <- "Predicted"

# Label for axis Y
ylab <- "Experimental"

# Margin for points in all directions
margin <- 0.5

load(fpred)
cinf_plotxy(
  pred$y_pred, 
  pred$y_exp, 
  xlab=xlab, 
  ylab=ylab, 
  margin=margin, 
  main=main
)

4.7. Drawing Iso-surfaces for the Fields of Regression Coefficients and their Overlap with Molecular Fields

The fields of regression coefficients can be visualized with the help of iso-surfaces (colored according to the sign of coefficients and the type of molecular field). The procedure for constructing such iso-surfaces includes two stages. In the first stage the values of regression coefficients are calculated on all nodes in an imaginary grid using the R computational script cmf-do-gen-grid:

# Generation of grid for continuous molecular co-fields

# Training set file name
train_fname <- "ligands-train.mol2"

# Computed kernels file name
kernels_fname <- "ligands-kernels-train.RData"

# Model file name
model_fname <- "ligands-model.RData"

# Grid with regression coefficients file name
grid_fname <- "ligands-grid-krr.RData" 

cmf_gen_grid(
  train_fname = train_fname, 
  kernels_fname = kernels_fname, 
  model_fname = model_fname, 
  grid_fname = grid_fname
)

All results are written to file. The values of the following parameters may be modified in the script:

# View molecule with co-field


# Molecular database file name
mdb_fname <- "ligands-train.mol2"

# Molecule to visualize
imol <- 1

# Field type 
ft <- "vdw"

# Grid with model coefficients file name
grid_fname <- "ligands-grid-krr.RData" 

# Isosurface level
rlevel <- 0.5

# Alpha (non-transperancy) level
alpha_g <- 0.7

cmf_view_mol_field_cofield(
  mdb_fname = mdb_fname, 
  imol = imol, 
  ft = ft, 
  grid_fname = grid_fname, 
  rlevel = rlevel, 
  alpha_g = alpha_g,
  draw_field = FALSE,
  draw_cofield = TRUE
)

The values of the following parameters may be modified in the script:

# View overlapping fields of regression coefficients and molecular fields

# Molecular database file name
mdb_fname <- "ligands-train.mol2"

# Molecule to visualize
imol <- 1

# Field type 
ft <- "vdw"

# Grid with regression coefficients file name
grid_fname <- "ligands-grid-krr.RData" 

# Take alpha from model? (TRUE/FALSE)
alpha_from_model <- TRUE

# Model file name
model_fname <- "ligands-model.RData"

# Alpha value (if not taken from model)
alpha <- 1.0

# Isosurface level
rlevel <- 0.5

# Alpha (non-transperancy) level
alpha_g <- 0.5

# Whether to draw overlap between field and co-field
draw_overlap <- TRUE

cmf_view_mol_field_cofield(
  mdb_fname = mdb_fname, 
  imol = imol, 
  ft = ft, 
  grid_fname = grid_fname, 
  alpha_from_model = alpha_from_model, 
  model_fname = model_fname, 
  alpha = alpha, 
  rlevel = rlevel, 
  alpha_g = alpha_g,
  draw_field = TRUE,
  draw_cofield = TRUE,
  draw_overlap = draw_overlap
)

One can change the default values of the following parameters in this script:

Molecular fields without fields of regression coefficients can be visualized using the below computational R script cmf-do-view-mol-field:

# View overlapping fields of regression coefficients and molecular fields

# Molecular database file name
mdb_fname <- "ligands-train.mol2"

# Molecule to visualize
imol <- 1

# Field type 
ft <- "vdw"

# Take alpha from model? (TRUE/FALSE)
alpha_from_model <- TRUE

# Model file name
model_fname <- "ligands-model.RData"

# Alpha value (if not taken from model)
alpha <- 1.0

# Isosurface level
rlevel <- 0.5

# Alpha (non-transperancy) level
alpha_g <- 0.5

cmf_view_mol_field_cofield(
  mdb_fname = mdb_fname, 
  imol = imol, 
  ft = ft, 
  grid_fname = grid_fname, 
  alpha_from_model = alpha_from_model, 
  model_fname = model_fname, 
  alpha = alpha, 
  rlevel = rlevel, 
  alpha_g = alpha_g,
  draw_field = TRUE,
  draw_cofield = FALSE
)

Default values of the following parameters can be changed in it:

4.8. Analyzing Contributions to Predicted Activity Values

Contributions of different types of molecular fields and different training examples to predicted activity (property) values of compounds belonging to the test set can be calculated using the script cmd-do- anal-contrib-pred. The results of the analysis can be read from the variable a. The default values of the following parameters can be changes:

# Prediction with analysis for the test set

# Model file name
model_fname <- "ligands-model.RData"

# Kernels file name
kernels_pred_fname <- "ligands-kernels-pred.RData"

# Column name or activity (if specified)
act_colnum <- 2

# Separator
sep <- ","

# Activity for prediction set (if specified)
act_pred_fname <- "activity-pred.txt"

anal <- cmf_pred_anal(
  model_fname = model_fname, 
  kernels_pred_fname = kernels_pred_fname, 
  act_colnum = act_colnum, 
  sep = sep, 
  act_pred_fname = act_pred_fname,
  is_train = FALSE
)

The results of the analysis can be read from the variable anal. The default values of the following parameters can be changes:

Contributions of individual atoms to predicted activity (property) values of compounds belonging to the test set can be calculated using the script cmf-do-anal-contrib-atoms-pred:

# Analysis of atomic contributions for the test set


# Training set file name
train_fname <- "ligands-train.mol2"

# Prediction set file name
pred_fname <- "ligands-pred.mol2"

# Molecule from file pred_fname to be analyzed
imol <- 1

# Model file name
model_fname <- "ligands-model.RData"

cmf_pred_anal_atoms(
  train_fname = train_fname, 
  pred_fname = pred_fname, 
  imol = imol, 
  model_fname = model_fname
)

The default values of the following parameters can be changes:

# Analysis of atomic contributions in molecules from the training set


# Training set file name
train_fname <- "ligands-train.mol2"

# Molecule from file train_fname to be analyzed
imol <- 1

# Model file name
model_fname <- "ligands-model.RData"

cmf_pred_anal_atoms(
  train_fname = train_fname, 
  pred_fname = train_fname, 
  imol = imol, 
  model_fname = model_fname
)

The default values of the following parameters can be changes:

4.9. Exploring Applicability Domain of 3D-QSAR Models

The package contains several tools for exploring applicability domains of 3D-QSAR model. If the standard deviation of prediction in external cross-validation with reshuffles is chosen as a distance to model parameter and mean prediction is used for computing prediction error, one can produce “coverage – mean error” and “distance to model - error” plots. The first one can be produced by running the script cmf-do-ecvr-plot-coverage-mean-error:

# Applicability domain from results of external cross-validation with reshuffles


# File name for results of external cross-validation with reshuffles
ecvr_fname <- "ligands-ecvr-train.RData"

# Coverage - mean error plot type ("in", "out", "diff")
# "in" - mean error is calculated inside applicability domain
# "out" - mean error is calculated outside applicability domain
# "diff" - difference of mean errors outside and inside applicability domain 
cmep_type <- "in"

# Whether to draw smoothing curve (TRUE/FALSE, 1/0)
smoothing <- TRUE  

cmf_ecvr_plot_coverage_mean_error(
  ecvr_fname = ecvr_fname, 
  cmep_type = cmep_type,
  smoothing = smoothing
)

, in which the following default values of parameters can be changed:

# Producing "distance to model - error" plot for applicability domain studies


# File name for resukts of external cross-validation with reshuffles
ecvr_fname <- "ligands-ecvr.RData"

# Whether to calculate mean error values for parts of the points (TRUE/FALSE, 1/0)
mean_parts <- TRUE  

# The number of parts for averaging
nparts <- 3   

# Color for drawing mean errors of parts
color_parts <- "red" 

cmf_ecvr_plot_d2m_error(
  ecvr_fname=ecvr_fname,
  mean_parts=mean_parts,
  nparts=nparts,
  color_parts=color_parts
)

, in which the following default values of parameters can be changed:



Try the conmolfields package in your browser

Any scripts or data that you put into this service are public.

conmolfields documentation built on May 2, 2019, 4:18 p.m.