# UtilOptimRegress: Optimization of predictions utility, cost or benefit for... In UBL: An Implementation of Re-Sampling Approaches to Utility-Based Learning for Both Classification and Regression Tasks

## Description

This function determines the optimal predictions given a utility, cost or benefit surface. This surface is obtained through a specified strategy with some parameters. For determining the optimal predictions an estimation of the conditional probability density function is performed for each test case. If the surface provided is of type utility or benefit a maximization process is carried out. If the user provides a cost surface, then a minimization is performed.

## Usage

 1 2 3 UtilOptimRegress(form, train, test, type = "util", strat = "interpol", strat.parms = list(method = "bilinear"), control.parms, m.pts, minds, maxds, eps = 0.1) 

## Arguments

 form A formula describing the prediction problem. train A data.frame with the training data. test A data.frame with the test data. type A character specifying the type of surface provided. Can be one of: "utility", "cost" or "benefit". Defaults to "utility". strat A character determining the strategy for obtaining the surface of the problem. For now, only the interpolation strategy is available (the default). strat.parms A named list containing the parameters necessary for the strategy previously specified. For the interpolation strategy (the default and only strategy available for now), it is required that the user specifies wich method sould be used for interpolating the points. control.parms A named list with the control.parms defined through the function phi.control. These parameters stablish the diagonal of the surface provided. If the type of surface defined is "cost" this parameter can be set to NULL, because in this case we assume that the accurate prediction, i.e., points in the diagonal of the surface have zero cost. See examples. m.pts A matrix with 3-columns, with interpolation points specifying the utility, cost or benefit of the surface. The points sould be in the off-diagonal of the surface, i.e., the user should provide points where y != y.pred. The first column must have the true value (y), the second column the corresponding prediction (y.pred) and the third column sets the utility cost or benefit of that point (y, y.pred). The user should define as many points as possible. The minimum number of required points are two. More specifically, the user must always set the surface values of at least the points (minds, maxds) and (maxds, minds). See minds and maxds description. maxds The numeric upper bound of the target variable considered. minds The numeric lower bound of the target variable considered. eps Numeric value for the precision considered during the interpolation. Defaults to 0.1.

## Details

The optimization process carried out by this function uses a method for conditional density estimation proposed by Rau M.M et al.(2015). Code for conditional density estimation (available on github https://github.com/MarkusMichaelRau/OrdinalClassification) kindly contributed by M. M. Rau with changes made by P.Branco. The optimization is achieved generalizing the method proposed by Elkan (2001) for classification tasks. In regression, this process involves determining, for each test case, the maximum integral (for utility or benefit surfaces, or the minimum if we have a cost surface) of the product of the conditional density function estimated and either the utility, the benefit or the cost surface. The optimal prediction for a case q is given by: y^{*}(q)=argmax[z] \int pdf(y|q).U(y,z) dy, where pdf(y|q) is the conditional densitiy estimation for case q, and U(y,z) is the utility, benefit or cost surface evaluated on the true value y and predictied value z.

## Value

The function returns a vector with the predictions for the test data optimized using the surface provided.

## Author(s)

Paula Branco [email protected], Rita Ribeiro [email protected] and Luis Torgo [email protected]

## References

Rau, M.M., Seitz, S., Brimioulle, F., Frank, E., Friedrich, O., Gruen, D. and Hoyle, B., 2015. Accurate photometric redshift probability density estimation-method comparison and application. Monthly Notices of the Royal Astronomical Society, 452(4), pp.3710-3725.

Elkan, C., 2001, August. The foundations of cost-sensitive learning. In International joint conference on artificial intelligence (Vol. 17, No. 1, pp. 973-978). LAWRENCE ERLBAUM ASSOCIATES LTD.

phi.control, UtilOptimClassif, UtilInterpol
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 ## Not run: #Example using a utility surface: data(Boston, package = "MASS") tgt <- which(colnames(Boston) == "medv") sp <- sample(1:nrow(Boston), as.integer(0.7*nrow(Boston))) train <- Boston[sp,] test <- Boston[-sp,] control.parms <- phi.control(Boston[,tgt], method="extremes", extr.type="both") # the boundaries of the domain considered minds <- min(train[,tgt]) maxds <- max(train[,tgt]) # build m.pts to include at least (minds, maxds) and (maxds, minds) points # m.pts must only contain points in [minds, maxds] range. m.pts <- matrix(c(minds, maxds, -1, maxds, minds, -1), byrow=TRUE, ncol=3) pred.res <- UtilOptimRegress(medv~., train, test, type = "util", strat = "interpol", strat.parms=list(method = "bilinear"), control.parms = control.parms, m.pts = m.pts, minds = minds, maxds = maxds) eval.util <- EvalRegressMetrics(test$medv, pred.res$optim, pred.res$utilRes, thr=0.8, control.parms = control.parms) # train a normal model model <- randomForest(medv~.,train) normal.preds <- predict(model, test) #obtain the utility of the new points (trues, preds) NormalUtil <- UtilInterpol(test$medv, normal.preds, type="util", control.parms = control.parms, minds, maxds, m.pts, method = "bilinear") #check the performance eval.normal <- EvalRegressMetrics(test$medv, normal.preds, NormalUtil, thr=0.8, control.parms = control.parms) #check both results eval.util eval.normal #check visually both predictions and the surface used UtilInterpol(test$medv, normal.preds, type = "util", control.parms = control.parms, minds, maxds, m.pts, method = "bilinear", visual=TRUE) points(test$medv, normal.preds, col="green") points(test$medv, pred.res$optim, col="blue") # another example now using points interpolation with splines data(algae,package="DMwR") ds <- algae[complete.cases(algae[,1:12]),1:12] tgt <- which(colnames(ds) == "a1") sp <- sample(1:nrow(ds), as.integer(0.7*nrow(ds))) train <- ds[sp,] test <- ds[-sp,] control.parms <- phi.control(ds[,tgt], method="extremes", extr.type="both") # the boundaries of the domain considered minds <- min(train[,tgt]) maxds <- max(train[,tgt]) # build m.pts to include at least (minds, maxds) and (maxds, minds) points m.pts <- matrix(c(minds, maxds, -1, maxds, minds, -1), byrow=TRUE, ncol=3) pred.res <- UtilOptimRegress(a1~., train, test, type = "util", strat = "interpol", strat.parms=list(method = "splines"), control.parms = control.parms, m.pts = m.pts, minds = minds, maxds = maxds) # check the predictions plot(test$a1, pred.res$optim) # assess the performance eval.util <- EvalRegressMetrics(test$a1, pred.res$optim, pred.res$utilRes, thr=0.8, control.parms = control.parms) # # train a normal model model <- randomForest(a1~.,train) normal.preds <- predict(model, test) #obtain the utility of the new points (trues, preds) NormalUtil <- UtilInterpol(test$medv, normal.preds, type = "util", control.parms = control.parms, minds, maxds, m.pts, method="splines") #check the performance eval.normal <- EvalRegressMetrics(test$medv, normal.preds, NormalUtil, thr=0.8, control.parms = control.parms) eval.util eval.normal # observe the utility surface with the normal preds UtilInterpol(test$a1, normal.preds, type="util", control.parms = control.parms, minds, maxds, m.pts, method="splines", visual=TRUE) # add the optim preds points(test$a1, pred.res$optim, col="green") # Example using a cost surface: data(Boston, package = "MASS") tgt <- which(colnames(Boston) == "medv") sp <- sample(1:nrow(Boston), as.integer(0.7*nrow(Boston))) train <- Boston[sp,] test <- Boston[-sp,] # if using interpolation methods for COST surface, the control.parms can be set to NULL # the boundaries of the domain considered minds <- min(train[,tgt]) maxds <- max(train[,tgt]) # build m.pts to include at least (minds, maxds) and (maxds, minds) points m.pts <- matrix(c(minds, maxds, 5, maxds, minds, 20), byrow=TRUE, ncol=3) pred.res <- UtilOptimRegress(medv~., train, test, type = "cost", strat = "interpol", strat.parms = list(method = "bilinear"), control.parms = NULL, m.pts = m.pts, minds = minds, maxds = maxds) # check the predictions plot(test$medv, pred.res$optim) # assess the performance eval.util <- EvalRegressMetrics(test$medv, pred.res$optim, pred.res$utilRes, type="cost", maxC = 20) # # train a normal model model <- randomForest(medv~.,train) normal.preds <- predict(model, test) #obtain the utility of the new points (trues, preds) NormalUtil <- UtilInterpol(test$medv, normal.preds, type="cost", control.parms = NULL, minds, maxds, m.pts, method="bilinear") #check the performance eval.normal <- EvalRegressMetrics(test$medv, normal.preds, NormalUtil, type="cost", maxC = 20) eval.normal eval.util # check visually the surface and the predictions UtilInterpol(test$medv, normal.preds, type="cost", control.parms = NULL, minds, maxds, m.pts, method="bilinear", visual=TRUE) points(test$medv, pred.res\$optim, col="blue") ## End(Not run)