In krystian8207/xspliner: Assisted Model Building, using Surrogate Black-Box Models to Train Interpretable Spline Based Additive Models

This sections provides some additional information and features that xspliner provides.

Monotonic splines approximation

For qualitative variables only

In some cases you may want to transform variables with monotonic function. xspliner provides an option for monotonic spline approximation. You just need to specify monotonic parameter for the local, or global xs transition. It actually can have 4 values:

"not" the default one. Monotonicity is not required
"up" approximation is increasing function
"down" approximation is decreasing function
"auto" compare increasing and decreasing approximation and chooses better one (basing on $R^2$ statistic)

Let's see below example:

library(randomForest)
library(pdp)
library(xspliner)
data(boston)
set.seed(123)
boston_rf <- randomForest(cmedv ~ lstat + ptratio + age, data = boston)
model_xs <- xspline(
  cmedv ~
    xs(lstat, transition = list(k = 6), effect = list(type = "pdp", grid.resolution = 100)) +
    xs(ptratio, transition = list(k = 5), effect = list(type = "pdp", grid.resolution = 100)) +
    age,
  model = boston_rf,
  xs_opts = list(transition = list(monotonic = "auto"))
)

plot(model_xs, "ptratio", plot_deriv = TRUE)
plot(model_xs, "lstat", plot_deriv = TRUE)

Choose if approximation is better

For qualitative variables only

When the response function has linear form, approximating it with splines may make the result worse. xspline function offers automatic check if the spline approximation is better than linear one, and use it in the final model.

You may find two parameters responsible for that:

alter - The sub-parameter of transition. We already know how this parameter works for "always" and "never" values. There is also the third option, "auto". In this case xspline automatically chooses whether variable should be transformed with splines
compare_stat - function of lm class object. It defines statistic that should be used in decision between spline model and linear one. The function should have the attribute higher. When the attribute has "better" value then the model with higher statistic value is chosen.

You can see the feature in above example:

set.seed(123)
boston_rf <- randomForest(cmedv ~ lstat + ptratio + age, data = boston)
model_pdp_auto <- xspline(
  cmedv ~
    xs(lstat, transition = list(k = 6), effect = list(type = "pdp", grid.resolution = 60)) +
    xs(ptratio, transition = list(k = 4), effect = list(type = "pdp", grid.resolution = 40)) +
    age,
  model = boston_rf,
  xs_opts = list(transition = list(alter = "auto"))
)

# aic statistic is used by default

summary(model_pdp_auto)

Linear approximation was better for ptratio response function.

Specifying model family and link

When GLM model is estimated there is possibility to specify response family and link parameters. Family stores information about the distribution of response - standard one is gaussian, which assumes that the response comes from a normal distribution. For classification the binomial family is used.

Link parameters stores info about what function should be used to transform the response. The transformation is used in the final model fitting. The standard link is the identity (for gaussian distribution) - for binomial distribution logit is used.

See more at ??stats::family.glm.

xspline function allows you to decide which response should be used in the final model. Let's check the example below in which poisson distribution with log link is used.

library(xspliner)
library(randomForest)
x <- rnorm(100)
z <- rnorm(100)
y <- rpois(100, exp(1 + x + z))
data <- data.frame(x, y, z)
model_rf <- randomForest(y ~ x + z, data = data)
model_xs_1 <- xspline(model_rf)
model_xs_2 <- xspline(model_rf, family = poisson(), link = "log")

Let's compare two models by checking its AIC statistics:

model_xs_1$aic
model_xs_2$aic

As we can see the second model is better.

Transformed response

In some cases you may want to transform model response with you own function. Let's check the example below with random forest model:

set.seed(123)
x <- rnorm(100, 10)
z <- rnorm(100, 10)
y <- x * z * rnorm(100, 1, 0.1)
data <- data.frame(x, z, y)
model_rf <- randomForest(log(y) ~ x + z, data = data)

In this case log transformation for y, removes interaction of x and z. In xspliner same transformation is used by default:

model_xs <- xspline(model_rf)
summary(model_xs)
plot_model_comparison(model_xs, model = model_rf, data = data)

Multiplicative form

When interactions between predictors occurs black box models in fact deal much better that linear models. xspliner offers using formulas with variables interactions.

You can do it in two possible forms.

Lets start with creating data and building black box:

x <- rnorm(100)
z <- rnorm(100)
y <- x + x * z + z + rnorm(100, 0, 0.1)
data <- data.frame(x, y, z)
model_rf <- randomForest(y ~ x + z, data = data)

The first option is specifying formula with * sign, as in standard linear models.

model_xs <- xspline(y ~ x * z, model = model_rf)
summary(model_xs)
plot_model_comparison(model_xs, model = model_rf, data = data)

The second one is adding form parameter equal to "multiplicative" in case of passing just the model or dot formula.

model_xs <- xspline(model_rf, form = "multiplicative")
summary(model_xs)
plot_model_comparison(model_xs, model = model_rf, data = data)

model_xs <- xspline(y ~ ., model = model_rf, form = "multiplicative")
summary(model_xs)
plot_model_comparison(model_xs, model = model_rf, data = data)

Subset formula

Every example we saw before used to use the same variables in black box and xspliner model. In fact this is not obligatory. How can it be used? For example to build a simpler model based on truncated amount of predictors. Let's see below example:

library(randomForest)
library(xspliner)
data(airquality)
air <- na.omit(airquality)
model_rf <- randomForest(Ozone ~ ., data = air)
varImpPlot(model_rf)

As we can see Wind and Temp variables are of the highest importance. Let's build xspliner basing on just the Two variables.

model_xs <- xspline(Ozone ~ xs(Wind) + xs(Temp), model = model_rf)
summary(model_xs)
plot_model_comparison(model_xs, model = model_rf, data = air)

Or model including variables interaction:

model_xs <- xspline(Ozone ~ xs(Wind) * xs(Temp), model = model_rf)
summary(model_xs)
plot_model_comparison(model_xs, model = model_rf, data = air)

krystian8207/xspliner documentation built on Oct. 6, 2019, 11:02 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

krystian8207/xspliner
Assisted Model Building, using Surrogate Black-Box Models to Train Interpretable Spline Based Additive Models

In krystian8207/xspliner: Assisted Model Building, using Surrogate Black-Box Models to Train Interpretable Spline Based Additive Models

Monotonic splines approximation

Choose if approximation is better

Specifying model family and link

Transformed response

Multiplicative form

Subset formula

R Package Documentation

Browse R Packages

We want your feedback!

krystian8207/xspliner Assisted Model Building, using Surrogate Black-Box Models to Train Interpretable Spline Based Additive Models

In krystian8207/xspliner: Assisted Model Building, using Surrogate Black-Box Models to Train Interpretable Spline Based Additive Models

Monotonic splines approximation

Choose if approximation is better

Specifying model family and link

Transformed response

Multiplicative form

Subset formula

R Package Documentation

Browse R Packages

We want your feedback!

krystian8207/xspliner
Assisted Model Building, using Surrogate Black-Box Models to Train Interpretable Spline Based Additive Models