This sections provides some additional information and features that xspliner provides.
For qualitative variables only
In some cases you may want to transform variables with monotonic function. xspliner provides an option for monotonic spline approximation. You just need to specify monotonic
parameter for the local, or global xs
transition. It actually can have 4 values:
Let's see below example:
library(randomForest) library(pdp) library(xspliner) data(boston) set.seed(123) boston_rf <- randomForest(cmedv ~ lstat + ptratio + age, data = boston) model_xs <- xspline( cmedv ~ xs(lstat, transition = list(k = 6), effect = list(type = "pdp", grid.resolution = 100)) + xs(ptratio, transition = list(k = 5), effect = list(type = "pdp", grid.resolution = 100)) + age, model = boston_rf, xs_opts = list(transition = list(monotonic = "auto")) ) plot(model_xs, "ptratio", plot_deriv = TRUE) plot(model_xs, "lstat", plot_deriv = TRUE)
For qualitative variables only
When the response function has linear form, approximating it with splines may make the result worse.
xspline
function offers automatic check if the spline approximation is better than linear one, and use it in the final model.
You may find two parameters responsible for that:
alter
- The sub-parameter of transition
. We already know how this parameter works for "always" and "never" values.
There is also the third option, "auto". In this case xspline automatically chooses whether variable should be transformed with splinescompare_stat
- function of lm
class object. It defines statistic that should be used in decision between spline model and linear one. The function should have the attribute higher
. When the attribute has "better"
value then the model with higher statistic value is chosen.You can see the feature in above example:
set.seed(123) boston_rf <- randomForest(cmedv ~ lstat + ptratio + age, data = boston) model_pdp_auto <- xspline( cmedv ~ xs(lstat, transition = list(k = 6), effect = list(type = "pdp", grid.resolution = 60)) + xs(ptratio, transition = list(k = 4), effect = list(type = "pdp", grid.resolution = 40)) + age, model = boston_rf, xs_opts = list(transition = list(alter = "auto")) ) # aic statistic is used by default summary(model_pdp_auto)
Linear approximation was better for ptratio
response function.
When GLM model is estimated there is possibility to specify response family and link parameters. Family stores information about the distribution of response - standard one is gaussian, which assumes that the response comes from a normal distribution. For classification the binomial family is used.
Link parameters stores info about what function should be used to transform the response. The transformation is used in the final model fitting. The standard link is the identity (for gaussian distribution) - for binomial distribution logit is used.
See more at ??stats::family.glm
.
xspline
function allows you to decide which response should be used in the final model.
Let's check the example below in which poisson distribution with log link is used.
library(xspliner) library(randomForest) x <- rnorm(100) z <- rnorm(100) y <- rpois(100, exp(1 + x + z)) data <- data.frame(x, y, z) model_rf <- randomForest(y ~ x + z, data = data) model_xs_1 <- xspline(model_rf) model_xs_2 <- xspline(model_rf, family = poisson(), link = "log")
Let's compare two models by checking its AIC statistics:
model_xs_1$aic model_xs_2$aic
As we can see the second model is better.
In some cases you may want to transform model response with you own function. Let's check the example below with random forest model:
set.seed(123) x <- rnorm(100, 10) z <- rnorm(100, 10) y <- x * z * rnorm(100, 1, 0.1) data <- data.frame(x, z, y) model_rf <- randomForest(log(y) ~ x + z, data = data)
In this case log transformation for y, removes interaction of x and z. In xspliner same transformation is used by default:
model_xs <- xspline(model_rf) summary(model_xs) plot_model_comparison(model_xs, model = model_rf, data = data)
When interactions between predictors occurs black box models in fact deal much better that linear models. xspliner offers using formulas with variables interactions.
You can do it in two possible forms.
Lets start with creating data and building black box:
x <- rnorm(100) z <- rnorm(100) y <- x + x * z + z + rnorm(100, 0, 0.1) data <- data.frame(x, y, z) model_rf <- randomForest(y ~ x + z, data = data)
The first option is specifying formula with *
sign, as in standard linear models.
model_xs <- xspline(y ~ x * z, model = model_rf) summary(model_xs) plot_model_comparison(model_xs, model = model_rf, data = data)
The second one is adding form parameter equal to "multiplicative" in case of passing just the model or dot formula.
model_xs <- xspline(model_rf, form = "multiplicative") summary(model_xs) plot_model_comparison(model_xs, model = model_rf, data = data)
model_xs <- xspline(y ~ ., model = model_rf, form = "multiplicative") summary(model_xs) plot_model_comparison(model_xs, model = model_rf, data = data)
Every example we saw before used to use the same variables in black box and xspliner model. In fact this is not obligatory. How can it be used? For example to build a simpler model based on truncated amount of predictors. Let's see below example:
library(randomForest) library(xspliner) data(airquality) air <- na.omit(airquality) model_rf <- randomForest(Ozone ~ ., data = air) varImpPlot(model_rf)
As we can see Wind and Temp variables are of the highest importance. Let's build xspliner basing on just the Two variables.
model_xs <- xspline(Ozone ~ xs(Wind) + xs(Temp), model = model_rf) summary(model_xs) plot_model_comparison(model_xs, model = model_rf, data = air)
Or model including variables interaction:
model_xs <- xspline(Ozone ~ xs(Wind) * xs(Temp), model = model_rf) summary(model_xs) plot_model_comparison(model_xs, model = model_rf, data = air)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.