cvq2-package: Calculate the predictive squared correlation coefficient.
In cvq2: Calculate the predictive squared correlation coefficient

Description Details Note Author(s) References Examples

This package compares observation with their predictions calculated by model M. It calculates the predictive squared correlation coefficient, q^2, in comparison to the well known conventional squared correlation coefficient, r^2.

Package:	cvq2
Type:	Package
Version:	1.2.0
Date:	2013-10-10
Depends:	methods, stats
License:	GPL v3
LazyLoad:	yes

This package needs either a description of parameters and observations (I) or a data set that already contains the observations and their related predictions (II). In case of (I), a linear model M is generated on the fly. Afterwards, its calibration performance can be compared with its prediction power. If the input data consist of observations and precidctions only (II), the package can be used to compute either the calibration performance or the prediction power. If model M is generated on the fly (I), the procedure is as follows: The input data set consists of parameters x_1, x_2 …, x_n which describe an observation y. A linear regression (glm) of this data set yields to M. Thus the conventional squared correlation coefficient, r^2, can be calculated:

q^2 = 1 - (SIGMA_i=1^N (y_i^fit - y_i)^2 / (SIGMA_i=1^N (y_i - y_mean)^2) \u2261 1 - RSS/SS

The denominator complies with the Residual Sum of Squares RSS, the difference between the fitted values y_i^fit predicted by M and the observations y_i. The numerator is the Sum of Squares, SS, and refers to the difference between the observations y_i and their mean y_mean.
To compare the calibration of M with its prediction power, M is applied to an external data set. External it is called, because these data have not been used during the linear regression to generate M. The comparison of the predictions y_i^pred with the observation y_i yields to the predictive squared correlation coefficient, q^2:

q^2 = 1 - (SIGMA_i=1^N (y_i^pred - y_i)^2 / (SIGMA_i=1^N (y_i - y_mean)^2) \u2261 1 - PRESS/SS

The PREdictive residual Sum of Squares PRESS is the difference between the predictions y_i^pred and the observations y_i. The Sum of Squares SS refers to the difference between the observations y_i and their mean y_mean.

In case that no external data set is available, one can perform a cross-validation to evaluate the prediction performance. The cross-validation splits the model data set (N elements) into a training set (N-k elements) and a test set (k elements). Each training set yields to an individual model M', which is used to predict the missing k value(s). Each model M' is slightly different to M. Thereby any observed value y_i is predicted once and the comparison between the observation and the prediction (y_i^{pred(N-k)}) yields to q^2_cv:

q_cv^2 = 1 - SIGMA_i=1^N (y_i^pred(N-k) - y_i)^2 / (SIGMA_i=1^N (y_i - y_mean^(N-k,i)^2

The arithmetic mean used in this equation, y_mean^N-k,i, is individually for any test set and calculated for the observed values comprised in the training set.

If k>1, the compilation of training and test set may have impact on the calculation of the predictive squared correlation coefficient. To overcome biasing, one can repeat this calculation with various compilations of training and test set. Thus, any observed value is predicted several times, according to the number of runs performed. Remark, if the prediction performance is evaluated with cross-validation, the calculation of the predictive squared correlation coefficient, q^2, is more accurate than the calculation of the conventional squared correlation coefficient, r^2.

In addition to r^2 and q^2 the root-mean-square-error rmse is calculated to measure the accuracy of model M:

rmse = √{\frac{∑\limits_{i=1}^N≤ft( y_i^{pred} - y_i\right)^2}{N-ν}}

The rmse ist the difference between a model's prediction (y_i^{pred}) and the actual observation (y_i) and can be applied for both, calibration and prediction power. It depends on the number of observations N and the method used to generate the model M. The rmse tends to overestimate M. According to Friedrich Bessel's suggestion [Upton and Cook 2008], this overestimation can be resolved while regarding the degrees of freedom, ν. Thus in case of cross-validation, ν=1 is recommended to calculate the rmse in relation to the prediction power. The degrees of freedom, ν, for the calculation of rmse regarding the prediction power can be set as parameter for cvq2(), looq2() and q2(). In opposite ν=0 is fixed while calculating the rmse in relation to the model calibration.

In case, the input is a comparison of observed and predicted values only (II), r^2 respective q^2 as well as their rmse are calculated immediately for these data. Neither a model M is generated nor a cross-validation is applied.

The package development started few years ago in the Ecological Chemistry Department during my time at the Helmholtz Centre for Environmental Research in Leipzig. Thereby it is based on Schüürmann et al. 2008: External validation and prediction employing the predictive squared correlation coefficient - test set activity mean vs training set activity mean.

Torsten Thalheim <torstenthalheim@gmx.de>

Cramer RD III. 1980. BC(DEF) Parameters. 2. An Empirical Structure-Based Scheme for the Prediction of Some Physical Properties. J. Am. Chem. Soc. 102: 1849-1859.
Cramer RD III, Bunce JD, Patterson DE, Frank IE. 1988. Crossvalidation, Bootstrapping, and Partial Least Squares Compared with Multiple Linear Regression in Conventional QSAR Studies. Quant. Struct.-Act. Relat. 1988: 18-25.
Organisation for Economic Co-operation and Development. 2007. Guidance document on the validation of (quantitative) structure-activity relationship [(Q)SAR] models. OECD Series on Testing and Assessment 69. OECD Document ENV/JM/MONO(2007)2, pp 55 (paragraph no. 198) and 65 (Table 5.7).
Schüürmann G, Ebert R-U, Chen J, Wang B, Kühne R. 2008. External validation and prediction employing the predictive squared correlation coefficient - test set activity mean vs training set activity mean. J. Chem. Inf. Model. 48: 2140-2145.
Tropsha A, Gramatica P, Gombar VK. 2003. The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models. QSAR Comb. Sci. 22: 69-77.
Upton G, Cook I. 2008. Oxford Dictionary of Statistics Oxford University Press ISBN 978-0-19-954145-4 entry for "Variance (data)".

  library(cvq2)
  
  data(cvq2.sample.A)
  result <- cvq2( cvq2.sample.A, y ~ x1 + x2 )
  result
  
  data(cvq2.sample.B)
  result <- cvq2( cvq2.sample.B, y ~ x, nFold = 3 )
  result
  
  data(cvq2.sample.B)
  result <- cvq2( cvq2.sample.B, y ~ x, nFold = 3, nRun = 5 )
  result
  
  data(cvq2.sample.A)
  data(cvq2.sample.A_pred)
  result <- q2( cvq2.sample.A, cvq2.sample.A_pred, y ~ x1 + x2 )
  result
  
  data(cvq2.sample.C)
  result <- calibPow( cvq2.sample.C )
  result
  
  data(cvq2.sample.D)
  result <- predPow( cvq2.sample.D, obs_mean="observed_mean" )
  result

---- CALL ----
cvq2(modelData = cvq2.sample.A, formula = y ~ x1 + x2)

---- RESULTS ----

-- MODEL CALIBRATION (linear regression)
#Elements: 	4

mean (observed): 	3.0900
mean (predicted): 	3.0900
rmse (nu = 0): 		0.2441
r^2: 			0.9726

-- PREDICTION PERFORMANCE (cross validation)
#Runs: 				1
#Groups: 			4
#Elements Training Set: 	3
#Elements Test Set: 		1

mean (observed): 	3.0900
mean (predicted): 	3.1619
rmse (nu = 1): 		1.3286
q^2: 			0.6571

---- CALL ----
cvq2(modelData = cvq2.sample.B, formula = y ~ x, nFold = 3)

---- RESULTS ----

-- MODEL CALIBRATION (linear regression)
#Elements: 	6

mean (observed): 	5.4600
mean (predicted): 	5.4600
rmse (nu = 0): 		1.4989
r^2: 			0.8179

-- PREDICTION PERFORMANCE (cross validation)
#Runs: 				1
#Groups: 			3
#Elements Training Set: 	4
#Elements Test Set: 		2

mean (observed): 	5.4600
mean (predicted): 	5.0158
rmse (nu = 1): 		2.5830
q^2: 			0.6915

---- CALL ----
cvq2(modelData = cvq2.sample.B, formula = y ~ x, nFold = 3, nRun = 5)

---- RESULTS ----

-- MODEL CALIBRATION (linear regression)
#Elements: 	6

mean (observed): 	5.4600
mean (predicted): 	5.4600
rmse (nu = 0): 		1.4989
r^2: 			0.8179

-- PREDICTION PERFORMANCE (cross validation)
#Runs: 				5
#Groups: 			3
#Elements Training Set: 	4
#Elements Test Set: 		2

mean (observed): 	5.4600
mean (predicted): 	5.0263
rmse (nu = 1): 		2.5001
q^2: 			0.6639

---- CALL ----
q2(modelData = cvq2.sample.A, predictData = cvq2.sample.A_pred, 
    formula = y ~ x1 + x2)

---- RESULTS ----

-- MODEL CALIBRATION (linear regression)
#Elements: 	4

mean (observed): 	3.0900
mean (predicted): 	3.0900
rmse (nu = 0): 		0.2441
r^2: 			0.9726

-- PREDICTION PERFORMANCE (model and prediction set available)
#Elements Model Set: 		4
#Elements Prediction Set: 	4

mean (observed): 	10.9125
mean (predicted): 	10.8237
rmse (nu = 0): 		 0.2830
q^2: 			 0.9988

---- CALL ----
calibPow(data = cvq2.sample.C)

---- RESULTS ----

-- MODEL CALIBRATION (linear regression)
#Elements: 	4

mean (observed): 	11.1050
mean (predicted): 	10.9075
rmse (nu = 0): 		 0.2933
r^2: 			 0.9873

---- CALL ----
predPow(data = cvq2.sample.D, obs_mean = "observed_mean")

---- RESULTS ----


-- PREDICTION PERFORMANCE (model and prediction set available)
mean (observed): 	11.1050
mean (predicted): 	10.9075
rmse (nu = 0): 		 0.2933
q^2: 			 0.9928