Introduction to cmaRs"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This package is designed to construct Conic Multivariate Adaptive Regression Splines (CMARS) in R. CMARS model is a linear combination of basis functions that is taken from the forward part of the MARS algorithm. CMARS model parameters are obtained after solving Conic Quadratic Programming (CQP) which includes two cone constraints to handle accuracy and complexity of the model. Please check Weber, et al. (2011) CMARS: a new contribution to nonparametric regression with multivariate adaptive regression splines supported by continuous optimization. Inverse Problems in Science and Engineering, 2012, 20.3: 371-400. doi:10.1080/17415977.2011.62477, for more details.

Prediction and binary classification models can be constructed with cmaRs. Note that in order to construct CMARS models, both MOSEK software and Rmosek package needed. Please follow carefully the steps available in https://docs.mosek.com/latest/rmosek/install-interface.html for successful installation.

In order to construct CMARS models, a user can define the following arguments

library(cmaRs)

This package includes three data sets; preddata.std, classdata.std and table.b6. The first two data sets are taken from UCI: Machine Learning Repository (available at http://archive.ics.uci.edu/ml/), preprocessed and standardized. The first one is used for prediction and the other one is used for binary classification. Moreover, there is one more data set called as table.b6 which is directly taken from the "MPV" package (version 1.58) (Braun, and MacQueen, 2023).

data(preddata.std)
head(preddata.std)
data(classdata.std)
head(classdata.std)
data(table.b6)
head(table.b6)

The Prediction Modeling

As an example, the following prediction CMARS model for the data set trees can be constructed with the following snippets.

prediction.model <-  cmaRs(Volume ~ ., degree = 2, nk = 20, data = trees)

In order to study this model in detail, the summary function can be used. Note here that the CMARS model is determined with nk = 20 and degree = 2. Here, the final model contains six basis function in total with one interaction term.

summary(prediction.model)
#> Call:
#> cmaRs(formula = Volume ~ ., data = trees, degree = 2, nk = 20)

#> Volume = +29.1634                            
#> +4.9278 * pmax(0,Girth-14.2)                 
#> -3.2309 * pmax(0,14.2-Girth)                 
#> +0.7313 * pmax(0,Height-75)                  
#> -0.1684 * pmax(0,75-Height)                  
#> +0.1312 * pmax(0,Girth-8.3)*pmax(0,Height-75)
#> -1.2977 * pmax(0,Height-78)                  

#> R2 0.9793 r 0.9896 RSS 168.0029 

Some performance measures are also printed at the end of the output. For instance, the $R^2$, r and RSS values are given for the prediction models.

It is also possible to construct several graphs of a prediction model. Here, plot function helps for this.

fig1 <- plot(prediction.model)
knitr::include_graphics("Figure1.png")

Another function of this package is predict which calculates the fitted values of a CMARS model. An example is given below.

predict(prediction.model)
#> [1] 9.259021 9.386200 9.695543 16.703820 20.239753 20.164926 17.308760
#> [8] 18.824519 22.046053 19.470698 22.996184 21.255036 21.255036 20.075638
#> [15] 22.055412 24.794797 29.229259 31.136370 26.874258 26.018423 32.955379
#> [22] 34.096100 30.473296 37.528153 43.074254 52.021298 53.805352 54.756953
#> [29] 55.315354 55.315354 77.168803

The Classification Modeling

cmaRs can also construct binary classification models. An example is given below with its snippets.

library(earth)
library(cmaRs)
classification.model <- cmaRs(survived ~ age, nk = 35, 
                              classification = TRUE, data = etitanic)

Note that, the classification argument is set as TRUE here which indicates a binary classification model. Moreover, the degree argument is used as 1 which is its default value indicating a main effect model. This model is constructed by using only continuous variable in the data set, age. Similar to the previous example, the summary function presents the details of the model.

summary(classification.model)
#> Call:
#> cmaRs(formula = survived ~ age, data = etitanic, classification = TRUE,
#> nk = 35)
#>
#> survived = -4.9489
#> -0.3084 * pmax(0,age-18)
#> +0.3321 * pmax(0,18-age)
#> -0.427 * pmax(0,age-53)
#> +0.3564 * pmax(0,age-67)
#> -0.3291 * pmax(0,age-64)
#> +1.1664 * pmax(0,age-46)
#> +0.7742 * pmax(0,age-57)
#> -0.1451 * pmax(0,age-35)
#> -0.7469 * pmax(0,age-58)
#> +0.2288 * pmax(0,age-61)
#> +0.0725 * pmax(0,age-41)
#> -1.0824 * pmax(0,age-45)
#> -0.5147 * pmax(0,age-48)
#> +0.3658 * pmax(0,age-51)
#> +0.281 * pmax(0,age-44)
#> +0.0994 * pmax(0,age-34)
#> +0.6316 * pmax(0,age-2)
#> -0.32 * pmax(0,age-3)
#> 
#> AUC 0.6221 MCR 0.3681 PCC 0.6319 precision 0.6339 recall 0.895 specificity 0.2506

The model, here, includes terms with no interactions. Moreover, some performance measures for this binary classification are given in the last line. Some of them such as "Misclassification Rate" is calculated by taking the threshold as 0.5 as the default value.

Another value of the threshold can also be set by using the argument, threshold.class, available in the cmaRs function. The following example is given to examplify this.

classification.model.threshold <- cmaRs(survived ~ age, nk = 35, 
                              classification = TRUE, data = etitanic,
                              threshold.class = 0.1, Auto.linpreds = FALSE)

The following binary classification model follows the same construction steps but with different performance values.

summary(classification.model.threshold)
#> Call:
#> cmaRs(formula = survived ~ age, data = etitanic, classification = TRUE,
#> threshold.class = 0.1, nk = 35, Auto.linpreds = FALSE)
#>
#> survived = -4.9489
#> -0.3084 * pmax(0,age-18)
#> +0.3321 * pmax(0,18-age)
#> -0.427 * pmax(0,age-53)
#> +0.3564 * pmax(0,age-67)
#> -0.3291 * pmax(0,age-64)
#> +1.1664 * pmax(0,age-46)
#> +0.7742 * pmax(0,age-57)
#> -0.1451 * pmax(0,age-35)
#> -0.7469 * pmax(0,age-58)
#> +0.2288 * pmax(0,age-61)
#> +0.0725 * pmax(0,age-41)
#> -1.0824 * pmax(0,age-45)
#> -0.5147 * pmax(0,age-48)
#> +0.3658 * pmax(0,age-51)
#> +0.281 * pmax(0,age-44)
#> +0.0994 * pmax(0,age-34)
#> +0.6316 * pmax(0,age-2)
#> -0.32 * pmax(0,age-3)
#>
#> AUC 0.6221 MCR 0.5746 PCC 0.4254 precision 1 recall 0.0291 specificity 1

In addition to the performance values of the models, it is also possible to construct the ROC curve for the classification model using the plot function as follows.

fig2 <- plot(classification.model)
knitr::include_graphics("Figure2.png")

Components of cmaRs Object

One of the data sets, named as table.b6, inherited from the package MPV (Braun, 2018), with 28 observations and four predictor variables as x1, x2, x3 and x4 with the response variable; y.

data(table.b6)
head(table.b6)

The following snippets demonstrate how to fit and obtain a CMARS prediction model (and cmaRs object) including all predictor variables available in the data set.

model.ex1 <- cmaRs(y~., degree = 2, nk = 20, classification = FALSE,
Auto.linpreds = FALSE, data = table.b6)

summary(model.ex1)
# y= +0.003
# -0.003 * pmax(0,x4-0.0177)
# -0.0798 * pmax(0,0.0177-x4)
# 0 * pmax(0,x2-436.9)
# 0 * pmax(0,436.9-x2)
# -7e-04 * pmax(0,x1-0.0044)*pmax(0,436.9-x2)
# -2.6911 * pmax(0,x1-0.01)
# +0.5728 * pmax(0,x3-0.0186)
# +7e-04 * pmax(0,x1-0.0044)*pmax(0,x2-436.9)
# +2.6131 * pmax(0,x1-0.0106)
# -2e-04 * pmax(0,436.9-x2)*pmax(0,x4-0.0062)

From this point on, implementation details of cmaRs package are examplified. Ten BFs (i.e. nk = 10) built in the forward CMARS algorithm (see open source code; Step 1: Forward step) are listed below.

model.ex1$bf.cmars
# [1] "pmax(0,x4-0.0177)" "pmax(0,0.0177-x4)"
# [3] "pmax(0,x2-436.9)" "pmax(0,436.9-x2)"
# [5] "pmax(0,x1-0.0044)*pmax(0,436.9-x2)" "pmax(0,x1-0.01)"
# [7] "pmax(0,x3-0.0186)" "pmax(0,x1-0.0044)*pmax(0,x2-436.9)"
# [9] "pmax(0,x1-0.0106)" "pmax(0,436.9-x2)*pmax(0,x4-0.0062)"

Here, the BFs pmax(x4-0.0177), pmax(0.0177-x4), pmax(x2-436.9), pmax(436.9-x2), pmax(x1-0.01), pmax(x3-0.0186), and pmax(x1-0.0106) represent the main effects and the remaining BFs (the 5th BF, the 8th BF, the 10th BF) pmax(x1-0.0044)*pmax(436.9-x2), pmax(x1-0.0044)*pmax(x2-436.9), and pmax(436.9-x2)*pmax(x4-0.0062) show the interaction effects of two predictor variables.

The L matrix can be formed as an (11 × 11 in this model) dimensional matrix with the first column and first row having zero values, and the diagonal elements contain the $\sf{L_{m}}$ values for the corresponding BFs.

model.ex1$L
#     [,1]     [,2]      [,3]    [,4]     [,5]    [,6]        [,7]      [,8]
# [1,] 0  0.0000000 0.0000000 0.00000  0.00000  0.0000   0.0000000 0.0000000
# [2,] 0  0.3978693 0.0000000 0.00000  0.00000  0.0000   0.0000000 0.0000000
# [3,] 0  0.0000000 0.3339162 0.00000  0.00000  0.0000   0.0000000 0.0000000
# [4,] 0  0.0000000 0.0000000 10.15874 0.00000  0.0000   0.0000000 0.0000000
# [5,] 0  0.0000000 0.0000000 0.00000  19.45508 0.0000   0.0000000 0.0000000
# [6,] 0  0.0000000 0.0000000 0.00000  0.00000  134.2702 0.0000000 0.0000000
# [7,] 0  0.0000000 0.0000000 0.00000  0.00000  0.0000   0.3196873 0.0000000
# [8,] 0  0.0000000 0.0000000 0.00000  0.00000  0.0000   0.0000000 0.3170173
# [9,] 0  0.0000000 0.0000000 0.00000  0.00000  0.0000   0.0000000 0.0000000
# [10,] 0 0.0000000 0.0000000 0.00000  0.00000  0.0000   0.0000000 0.0000000
# [11,] 0 0.0000000 0.0000000 0.00000  0.00000  0.0000   0.0000000 0.0000000
#         [,9]        [,10]    [,11]
# [1,]  0.00000   0.0000000   0.0000
# [2,]  0.00000   0.0000000   0.0000
# [3,]  0.00000   0.0000000   0.0000
# [4,]  0.00000   0.0000000   0.0000
# [5,]  0.00000   0.0000000   0.0000
# [6,]  0.00000   0.0000000   0.0000
# [7,]  0.00000   0.0000000   0.0000
# [8,]  0.00000   0.0000000   0.0000
# [9,]  11.53763  0.0000000   0.0000
# [10,] 0.00000   0.3182766   0.0000
# [11,] 0.00000   0.0000000   78.2987

To illustrate some computational details, two BFs, the first (i.e. pmax(x4-0.0177)) and the tenth (i.e. pmax(436.9-x2)*pmax(x4-0.0062)), which represent main and interaction effects in the model, respectively, are selected. As stated beforehand, the cmaRs package automatically provides the derivatives matrix (DMS) and the variable matrix (VARMS) whose derivative exists in DMS matrix only for the last BF. In order to obtain these matrices for the other BFs, lets say, for the first one (i.e. pmax(x4-0.0177)), the code,

   bfs_forward_names <- bfs_forward_names[1:2]

should be inserted into the middle of the following two lines of R codes available in the cmaRs.fit function under the "#Step 1: Forward step" section as follows:

bfs_forward_names <- rownames(model_mars$dirs)
bfs_forward_names <- bfs_forward_names[1:2]
bfs_forward_names.orig <- bfs_forward_names

After constructing the model again, one can also obtain the derivatives for the first BF given as a symbolic matrix form (i.e. DMS) as follows.

model.ex1$DMS
# [,1] [,2] [,3] [,4]
# [1,] "0" "0" "0" "0"
# [2,] "0" "0" "0" "0"
# [3,] "0" "0" "0" "0"
# [4,] "0" "0" "0" "matrix(1, nrow <- (n+1), ncol <- 1)"

In addition, since this BF contains no interaction effect (i.e. just has one variable, x4), the first-order derivative of the function produces ones, and the second-order derivatives produces zeros with respect to the variable x4. The variable name whose derivative exists in the DMS matrix is obtained by the following matrix form, called VARMS.

model.ex1$VARMS
# [,1] [,2] [,3] [,4]
# [1,] "1" "1" "1" "1"
# [2,] "1" "1" "1" "1"
# [3,] "1" "1" "1" "1"
# [4,] "1" "1" "1" "xplus4"

On the other hand, the tenth BF contains an interaction effect between the predictors x2 and x4. The first-order derivatives of the tenth BF are (0.0062-xfirst4) with respect to the variable x2 and (436.9-xfirst2) with respect to the variable x4. The second-order derivative of the tenth BF is the

# matrix(-1, nrow <- (n+1), ncol <- 1)

with respect to both of the variables x2 and x4. Note here that in any case where the order of the derivative is greater than two, it vanishes and thus, the full derivative matrix form of the tenth BF turns out to be the following.

model.ex1$DMS
#      [,1] [,2]                 [,3]   [,4]
# [1,] "0"   "0"                  "0"   "0" 
# [2,] "0"   "0.0062 - xfirst4"   "0"   "matrix(-1, nrow <- (n+1), ncol <- 1)"
# [3,] "0"   "0"                  "0"   "0"
# [4,] "0"   "0"                  "0"   "436.9 - xfirst2"

The corresponding variable name matrix (i.e. VARMS matrix), simply provided by calling the model name as model.ex1, unless a change is made in the cmaRs.fit function.

model.ex1$VARMS
#     [,1] [,2]     [,3] [,4]
# [1,] "1" "1"      "1" "1"
# [2,] "1" "xplus2" "1" "1"
# [3,] "1" "1"      "1" "1"
# [4,] "1" "1"      "1" "xplus4"

The $\sf{L_{m}}$ (m = 1, 10) values corresponding to the first and the tenth BFs are $\sf{L_{1}}$ = 0.3978693 and $\sf{L_{10}}$ = 78.2987, respectively. After discritization is applied, which produces complete L matrix, the optimization part which solves the CQP takes place. For solving it, the package Rmosek (MOSEK-ApS, 2017) is utilized, and as a result, the model parameter estimates are obtained as follows.

model.ex1$coefficients
# [1] 3.038673e-03 -2.996649e-03 -7.977544e-02 -6.809132e-06 3.517929e-06
# [6] -6.991181e-04 -2.691055e+00 5.728424e-01 6.722162e-04 2.613055e+00
# [11] -1.701914e-04

Note here again that proper installation of MOSEK has a critical importance for successfully run of the package cmaRs, since Rmosek is used as the interface of it. Besides model parameter estimates, the cmaRs package also provides the fitted values belonging to the CMARS model built if needed.

> model.ex1$fitted.values
# [1] 0.0007575982 0.0003071999 0.0004176579 0.0005434535 0.0004821474 0.0004517915
# [7] 0.0004376478 0.0004160878 0.0013674034 0.0012119509 0.0012961460 0.0011833885
# [13] 0.0009870342 0.0012386077 0.0013057215 0.0011887701 0.0012214500 0.0017378222
# [19] 0.0017998247 0.0018951991 0.0026293088 0.0031847542 0.0033385298 0.0033184915
# [25] 0.0013479697 0.0024916551 0.0027357523 0.0026836362


Try the cmaRs package in your browser

Any scripts or data that you put into this service are public.

cmaRs documentation built on July 9, 2023, 5:17 p.m.