This sections provides some additional information and features that xspliner provides.

Monotonic splines approximation

For qualitative variables only

In some cases you may want to transform variables with monotonic function. xspliner provides an option for monotonic spline approximation. You just need to specify monotonic parameter for the local, or global xs transition. It actually can have 4 values:

Let's see below example:

library(randomForest)
library(pdp)
library(xspliner)
data(boston)
set.seed(123)
boston_rf <- randomForest(cmedv ~ lstat + ptratio + age, data = boston)
model_xs <- xspline(
  cmedv ~
    xs(lstat, transition = list(k = 6), effect = list(type = "pdp", grid.resolution = 100)) +
    xs(ptratio, transition = list(k = 5), effect = list(type = "pdp", grid.resolution = 100)) +
    age,
  model = boston_rf,
  xs_opts = list(transition = list(monotonic = "auto"))
)

plot(model_xs, "ptratio", plot_deriv = TRUE)
plot(model_xs, "lstat", plot_deriv = TRUE)

Choose if approximation is better

For qualitative variables only

When the response function has linear form, approximating it with splines may make the result worse. xspline function offers automatic check if the spline approximation is better than linear one, and use it in the final model.

You may find two parameters responsible for that:

You can see the feature in above example:

set.seed(123)
boston_rf <- randomForest(cmedv ~ lstat + ptratio + age, data = boston)
model_pdp_auto <- xspline(
  cmedv ~
    xs(lstat, transition = list(k = 6), effect = list(type = "pdp", grid.resolution = 60)) +
    xs(ptratio, transition = list(k = 4), effect = list(type = "pdp", grid.resolution = 40)) +
    age,
  model = boston_rf,
  xs_opts = list(transition = list(alter = "auto"))
)

# aic statistic is used by default

summary(model_pdp_auto)

Linear approximation was better for ptratio response function.

Specifying model family and link

When GLM model is estimated there is possibility to specify response family and link parameters. Family stores information about the distribution of response - standard one is gaussian, which assumes that the response comes from a normal distribution. For classification the binomial family is used.

Link parameters stores info about what function should be used to transform the response. The transformation is used in the final model fitting. The standard link is the identity (for gaussian distribution) - for binomial distribution logit is used.

See more at ??stats::family.glm.

xspline function allows you to decide which response should be used in the final model. Let's check the example below in which poisson distribution with log link is used.

library(xspliner)
library(randomForest)
x <- rnorm(100)
z <- rnorm(100)
y <- rpois(100, exp(1 + x + z))
data <- data.frame(x, y, z)
model_rf <- randomForest(y ~ x + z, data = data)
model_xs_1 <- xspline(model_rf)
model_xs_2 <- xspline(model_rf, family = poisson(), link = "log")

Let's compare two models by checking its AIC statistics:

model_xs_1$aic
model_xs_2$aic

As we can see the second model is better.

Transformed response

In some cases you may want to transform model response with you own function. Let's check the example below with random forest model:

set.seed(123)
x <- rnorm(100, 10)
z <- rnorm(100, 10)
y <- x * z * rnorm(100, 1, 0.1)
data <- data.frame(x, z, y)
model_rf <- randomForest(log(y) ~ x + z, data = data)

In this case log transformation for y, removes interaction of x and z. In xspliner same transformation is used by default:

model_xs <- xspline(model_rf)
summary(model_xs)
plot_model_comparison(model_xs, model = model_rf, data = data)

Multiplicative form

When interactions between predictors occurs black box models in fact deal much better that linear models. xspliner offers using formulas with variables interactions.

You can do it in two possible forms.

Lets start with creating data and building black box:

x <- rnorm(100)
z <- rnorm(100)
y <- x + x * z + z + rnorm(100, 0, 0.1)
data <- data.frame(x, y, z)
model_rf <- randomForest(y ~ x + z, data = data)

The first option is specifying formula with * sign, as in standard linear models.

model_xs <- xspline(y ~ x * z, model = model_rf)
summary(model_xs)
plot_model_comparison(model_xs, model = model_rf, data = data)

The second one is adding form parameter equal to "multiplicative" in case of passing just the model or dot formula.

model_xs <- xspline(model_rf, form = "multiplicative")
summary(model_xs)
plot_model_comparison(model_xs, model = model_rf, data = data)
model_xs <- xspline(y ~ ., model = model_rf, form = "multiplicative")
summary(model_xs)
plot_model_comparison(model_xs, model = model_rf, data = data)

Subset formula

Every example we saw before used to use the same variables in black box and xspliner model. In fact this is not obligatory. How can it be used? For example to build a simpler model based on truncated amount of predictors. Let's see below example:

library(randomForest)
library(xspliner)
data(airquality)
air <- na.omit(airquality)
model_rf <- randomForest(Ozone ~ ., data = air)
varImpPlot(model_rf)

As we can see Wind and Temp variables are of the highest importance. Let's build xspliner basing on just the Two variables.

model_xs <- xspline(Ozone ~ xs(Wind) + xs(Temp), model = model_rf)
summary(model_xs)
plot_model_comparison(model_xs, model = model_rf, data = air)

Or model including variables interaction:

model_xs <- xspline(Ozone ~ xs(Wind) * xs(Temp), model = model_rf)
summary(model_xs)
plot_model_comparison(model_xs, model = model_rf, data = air)


krystian8207/xspliner documentation built on Oct. 6, 2019, 11:02 p.m.