QSPRpred-class: QSPRpred class
In iqspr: Inverse Molecular Design

Description Arguments Fields Methods Examples

Quantitative Structure-Properties Relationship (QSPR) model construction. This class contains all the required functions to train linear and non-linear models, to produce bootstrap datasets for variance estimation, and to provide prediction capabilities over a matrix or vector of studied properties.

`smis`	is a list of vectors of SMILES from which a regression model will be trained, or for which targeted properties will be predicted.
`prop`	is a list of vectors/matrices of available targeted physico-chemical properties for the training dataset.
`v_filterfunc`	defines the filtering function (NULL by default) to use in the computation of properties to filter.
`v_filtermin`	is a vector representing the expected minimal value for each filtered property.
`v_filtermax`	is a vector representing the expected maximal value for each filtered property.
`v_fnames`	is a vector, or a list of vectors, of fingerprints and/or physical descriptors types used as features for each regression model (see `get_descriptor` for an exhaustive list of available descriptors).
`v_scale`	sets (FALSE by default) the scaling of physical descriptors only (i.e. continuous features) - mean = 0, standard deviation = 1.
`v_func`	defines the analytic function (NULL by default), or a list of analytic functions, to use in the computation of a subsequent property, or properties respectively. A given function will return a new property computed analytically via a list of known properties in prop. This is particularly useful when data and regression models can be stated for some properties (e.g. A and B), but not for a targeted property of interest (e.g. A+B, A/B, etc.) for which constrains are defined via the set_target method.
`v_func_args`	is a vector, or a list of vectors, of integers that tags the used properties in prop for the computation of a subsequent property. For example, v_func=list(func1,func2), where func1 and func2 are a priori defined functions, and prop=list(V1,M23), where V1 is a numerical vector and M23 is a two columns matrix. In this case, v_func_args=list(c(1,3),c(2)), i.e. the function func1 uses the 1st and 3rd output properties located in prop, and func2 uses the 2nd only. Therefore, the defined empirical functions know where to find their inputs.
`kekulise`	enables (FALSE by default) electron checking and allows for parsing of incorrect SMILES (see `parse.smiles`).
`model`	is the name of a regression model to be used (see `get_Models` for an exhaustive list).
`params`	is a list of parameters to submit to a given regression model (see `get_Model_params` for examples).
`n_boot`	is the number of requested bootstrap datasets (1 by default) in the training process. This is used for an estimation of the means and standard deviations of subsequent non-Bayesian predictions. A higher number of bootstrap datasets will allow more accuracy in this estimation. However, it exists a trade-off between accuracy and computation time that the user has to figure out. Consequently, in order to ease the bootstrap analysis, a parallelization capability is implemented.
`s_boot`	is the proportion of input data (0.85 by default), defined in ]0,1], used to construct bootstrap datasets.
`r_boot`	allows (FALSE by default) the sampling in a bootstrap analysis to be performed with replacement.
`parallelize`	allows (FALSE by default) to use the full computational capability of a user's machine for a bootstrap analysis. Indeed, N-1 cores, with N the total number of cores available on the machine, will be used.
`v_propmin`	is a vector representing the expected minimal value for each targeted property.
`v_propmax`	is a vector representing the expected maximal value for each targeted property.
`temp`	is a vector/matrix of numerical values which sets the initial temperatures in the annealing process for the sequential Monte-Carlo sampler (see `vignette("tutorial", package = "iqspr")` for details).

propndim: is the number of properties received as input data.
propmin: is a vector representing the expected minimal value for each targeted property.
propmax: is a vector representing the expected maximal value for each targeted property.
filtermin: is a vector representing the expected minimal value for each filtered property.
filtermax: is a vector representing the expected maximal value for each filtered property.
filterfunc: is a function to compute the properties to filter.
X: is the nxd matrix, with d features for n input SMILES, returned by get_descriptor.
Y: is a nxp matrix of p properties for n input SMILES.
fnames: is a list of vectors of fingerprints and/or physical descriptors types used as features in each regression model by get_descriptor.
mdesc: is a scalar or vector of means used for physical descriptors scaling, returned by get_descriptor.
sddesc: is a scalar or vector of standard deviations used for physical descriptors scaling, returned by get_descriptor.
scale: tags the scaling statement (TRUE or FALSE) of the physical descriptors only (i.e. continuous features) - mean = 0, standard deviation = 1.
func: defines the analytic function to use in the computation of a subsequent property.
func_args: is a vector of integers that tags the used columns in the property array prop for the computation of a subsequent property.
trmodel: is the name of the used regression model for training and predictions.
trnboot: is the number of bootstrap dataset used for the training.
trndf: is the number of input SMILES, i.e. the number of degrees of freedom, available in the training of the regression process.

get_features(): returns a list of nxd matrix X with d features for n input SMILES
get_props(): returns a list of nxp matrix Y of p properties for n input SMILES
init_env(smis = NULL, prop = matrix(0), v_filterfunc = NULL, v_filtermin = NULL, v_filtermax = NULL, v_fnames = NULL, v_scale = FALSE, v_func = NULL, v_func_args = NULL, kekulise = F): initialize the QSPR predictor: implicitly called via the QSPRpred$new() method
iqspr_predict(smis = NULL, temp = c(1, 1)): predicts properties for input SMILES from a given regression model and evaluates the probability to reach a targeted properties space
model_training(model = "linear_Bayes", params = NA, n_boot = 10, s_boot = 0.85, r_boot = F, parallelize = F): allows to train regression models, define their parameters, request bootstrap approach and CPU parallelization
qspr_predict(smis = NULL): predicts properties for input SMILES from a given regression model
set_target(v_propmin, v_propmax): sets the targeted properties space in vectors propmin and propmax

## Not run: 

# Load pre-existing data
data(qspr.data)
# Define input SMILES
smis <- paste(qspr.data[,1])
# Define associated properties
prop <- qspr.data[,c(2,5)]
# Define training set
trainidx <- sample(1:nrow(qspr.data), 5000)
# Initialize the prediction environment
# and compute fingerprints/descriptors associated to input SMILES
qsprpred_env <- QSPRpred()
qsprpred_env$initenv(smis=smis[trainidx], prop=as.matrix(prop[trainidx,]), v_fnames="graph")
# Train a regression model with associated parameters,
# number of bootstrapped datasets without CPUs parallelization
qsprpred_env$model_training(model="elasticnet",params=list("alpha" = 0.5),n_boot=10,parallelize=F)

# Predict properties for a test set
predictions <- qsprpred_env$qspr_predict(smis[-trainidx])
# Plot the results
par(mfrow=c(1,2))
plot(predictions[[1]][1,], prop[-trainidx,1], xlab="prediction", ylab="true")
segments(-100,-100,1000,1000,col=2,lwd=2)
plot(predictions[[1]][2,], prop[-trainidx,2], xlab="prediction", ylab="true")
segments(-100,-100,1000,1000,col=2,lwd=2)

# Set a targeted properties space
qsprpred_env$set_target(c(8,100),c(9,200))
# Predict properties for any input SMILES
# and their probability to be close to the targeted properties space
inv_pred <- qsprpred_env$qspr_predict(smis = smis[-trainidx], temp=c(3,3))

See \code{vignette("tutorial", package = "iqspr")} for further options and details.


## End(Not run)