knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 7
)
htmltools::img(src = knitr::image_uri(file.path("/Users/alaninglis/Desktop/vividLogo.png")), 
               alt = 'logo', 
               style = 'position:absolute; top:0; right:0; 
               height:276px; width:240px')
library(vivid)

Introduction

Visualizations can be an important tool for the use of analysis or for the exploration of data. The impact of a carefully chosen visualization can be significant for a researcher, as using a meaningful visualization can give emphasis to the relationships that variables have in a model, and thus serves to help the researcher gain a deeper understanding of the behavior of a model. One such area that can benefit greatly from the use of informative visualizations is that of variable importance and variable interactions. When one creates a machine learning model, a significant challenge that is faced is to visualize the importance of the variables used to train the model and the interactions between these variables and to determine which variables are important and which variables can be cut from the model. Traditional methods of displaying variable importance and variable interactions suffer from a lack of cohesion and imagination. Both variable importance and variable interaction are clearly important tools in feature selection, yet traditional visualization techniques deal with them separately

The vivid (variable importance and variable interaction displays) package was designed to help a user to easily distinguish which variables in a model are important and which variables interact with each other in a sensible and interpretable way. vivid contains a suite of plots that enable importance and interaction to be evaluated in a more efficient manor than would traditionally be possible. These include a heat-map style plot that displays 2-way interactions and individual variable importance. Also, a network style plot where the size of a node represents variable importance (the bigger the node, the more important the variable) and the edge weight represents the 2-way interaction strength. Also included is a partial dependence pairs-style plot that displays 2D partial dependence plots, individual ice curves and a scatter-plot, in the one display. The interaction is calculated using Friedman’s H-Statistic^[Friedman, H. and Popescu, B.E. Predictive learning via rule ensemble. The Annals of Applied Statistics. 2008. 916-954]

To begin, we load the required packages. vivid to create the vusualisations and mlr3 and mlr3learners will be used to create a ranger regression model.

library(vivid)
library(mlr3)  # To create a model
library(mlr3learners)

Data used in this vignette:

The data used in the following examples is simulated from the Friedman benchmark problem 1^[Friedman, Jerome H. (1991) Multivariate adaptive regression splines. The Annals of Statistics 19 (1), pages 1-67.] using the genFriedman() function. This benchmark problem is commonly used for testing purposes. The output is created according to the equation:

$$y = 10 sin(π x1 x2) + 20 (x3 - 0.5)^2 + 10 x4 + 5 x5 + e$$

For the following examples we set the number of features to equal 9 and the number of samples is set to 250 and fit an mlr3 ranger random forest model with $y$ as the response. As the features $x_1$ to $x_5$ are the only variables in the model, therefore $x_6$ to $x_{9}$ are noise variables. As can be seen by the above equation, the only interaction is between $x_1$ and $x_2$

Create the data:

set.seed(1701)
myData <- genFriedman(noFeatures = 9, noSamples = 250, sigma = 1, bins = NULL, seed = NULL)

Create an mlr3 ranger random forest model:

set.seed(1701)
fr_task  <- TaskRegr$new(id = "Friedman", backend = myData, target = "y")
set.seed(1701)
fr_lrn <- lrn("regr.ranger", importance = "permutation")
set.seed(1701)
fr_mod <- fr_lrn$train(fr_task)

To begin, we use the vividMatrix function to create a symmetrical matrix filled with pair-wise interaction strengths and variable importance on the diagonal. The vividMatrix uses Friedman's unnormalized H-Statistic to calculate the pair-wise interaction strength and uses embedded feature selection methods to determine the variable importance. If the supplied learner does not support an embedded variable importance measure, an agnostic approach will be applied automatically to generate the importance values. This function takes in any task, learner and model created from the mlr3 package and results in a matrix of class vivid which can be supplied to the plotting functions.

Create a matrix to be supplied to the plotting functions.

set.seed(1701)
myMatrix <- vividMatrix(task = fr_task, model = fr_mod, gridSize = 30,                                normalize = FALSE, seed = NULL,
                         sqrt = FALSE, reorder = TRUE)

The unnormalized version of the H-statistic was chosen to have a more direct comparison of interaction effects across pairs of variables and the results of H are onthe scale of the response.

Taking a quick look at the created matrix, we can see that it shows the pair-wise interaction strengths for each of the variables, with variable importance on the diagonal.

head(myMatrix, 3)

Visualizing the results

Heat-Map style plot

The first visualization option supplied by vivid creates a heat-map style plot displaying variable importance on the diagonal and variable interaction on the off-diagonal

Example

To call the plot we use the generic plot() function with type = "heatMap" as follows:

plot(myMatrix, type = "heatMap", plotly = F,
          intLow = "floralwhite", intHigh = "dodgerblue4",
          impLow = "white", impHigh = "firebrick1", 
          minImp = NULL, maxImp = NULL, minInt = 0, maxInt = NULL)

Fig 1.0: Heat-map style plot displaying 2-way interaction strength in blue and individual variable importance on the diagonal in red.

In Figure 1.0, variable importance is placed on the diagonal and is displayed using a gradient of white to red, representing the low to high values of importance. In a similar manner, variable interaction is displayed using a gradient of white to dark blue, representing the low to high values of interaction strength. Plotting both the importance and the interactions from the vivid package can save time and minimize complications for a researcher. From the above plot we can see that x_1:x_2 has the strongest interaction and x_4 is the most important variable for predicting y. Traditional visualization techniques are commonly drawn with the order of variables appearing as how they are listed in the data. It has been shown^[ E.Friendly, F. Kwan. Effect ordering for data displays. Computational Statistics Data Analysis, 43:509–539, 2003.] that reordering the data can help improve visualizations by making them more easily interpreted. The plot in Figure 1.0 also utilizes the DendSer^[Catherine B. Hurley and Denise Earle.DendSer: Dendrogram seriation: ordering for visualisation, 2013. R package version 1.0.1] package to push all the relevant values to the top left of the plot. This is done to enable models to be plotted neatly and to allow the user to quickly see which variables are influential in a model. This is particularly useful for large datasets.

Network style plot

Another option supplied by vivid is to visualize the variable importance and interactions as a network style plot. This has the advantage of allowing the user to quickly identify which variables have a strong interaction in a model. The importance of the variable is represented by both the size of the node (with larger nodes meaning they have greater importance) and the colour of the node. The importance values are displayed by using a gradient of white to red, representing the low to high values of importance. The two-way interaction strengths between variables are represented by the connecting lines (or edges). Both the size and colour of the edge are used to highlight interaction strength. Thicker lines between variables indicate a greater interaction strength. The interaction strength values are displayed by using a gradient of white to dark blue, representing the low to high values of interaction strength.

Example

To call the plot we use the generic plot() function with type = "network" as follows:

plot(myMatrix, type = "network", thresholdValue = 0, label = FALSE,
            minInt = 0, maxInt = NULL, minImp = 0, maxImp=NULL,
            labelNudge = 0.05, layout = "circle")

Fig 2.0: Network style plot displaying 2-way interaction strength between each of the variables and individual variable importance

From the above plot we can see that x_1:x_2 weight has the strongest interaction and X-4 is the most important variable for predicting y.

The network plot offers multiple possibilities when it comes to displaying the network style plot through use of the layout argument. The default layout is a circle but any of the layouts included in the igraph/sna package are accepted.

The user can control which interaction values to display by using the thresholdValue argument. In the following example, thresholdValue = 0.1 means that only the the edges with weights (i.e., the interactions) above 0.1 are displayed:

plot(myMatrix, type = "network", thresholdValue = 0.1)

Fig 2.1: Network style plot displaying thresholded 2-way interaction strengths between each of the variables and individual variable importance. In this plot only the interactions greater than 0.1 are shown.

The preceding plots also include options to allow the user to scale the importance/interaction values, through the minInt, maxInt, minImp, maxImp arguments. This is particularly useful if one is comparing different fits using the visualizations, as it keeps the values on the same scale.

Partial dependence pairs style plot, PDPP

This function creates a pairs plot style matrix plot of the 2D partial dependence of each of the variables in the upper diagonal, the individual pdp and ice curves on the diagonal and a scatter-plot of the data on the lower diagonal. The partial dependence plot shows the marginal effect one or two features have on the predicted outcome of a machine learning model^[Friedman, Jerome H. “Greedy function approximation: A gradient boosting machine.” Annals of statistics (2001): 1189-1232.]. A partial dependence plot is used to show whether the relationship between the response variable and a feature is linear or more complex. The PD pairs plot can also be passed the vivid matrix by way of a function argument. This will reorder the PD pairs plot to match the serieated order of the vivid matrix.

Example

To call the pdp pairs plot we use:

set.seed(1701)
ggpdpPairs(task = fr_task, model = fr_mod, gridsize = 10, mat = myMatrix)

Fig 3.0: A pairs style matrix plot displaying the partial dependence between each of the variables in the upper diagonal, the individual pdp and ice curves on the diagonal and a scatter-plot on the lower diagonal

From the above plot, we see the 2D pdp plots on the upper diagonal. The individual pdp and ice curves on the diagonal (with the aggregate pdp curve in black) and scatter-plots on the lower diagonal. From the plot we can see an interaction between $x_1$ and $x_2$ and it seems that $x_4$ has some dependence on all the variables.

Zenplot partial dependence plot, ZPDP

Here we create a zigzag expanded navigation plot (zenplot) of the partial dependence values. This results in an alternating sequence of two-dimensional plots laid out in a zigzag structure.

To create the ZPDP, we use:

set.seed(1701)
pdpZenplot(task = fr_task, model = fr_mod, gridsize = 10)

Fig 4.0: A plot of the partial dependence in a zenplot style layout

If we want to construct a path of indices to order the variables in terms of importance, we can use calcZpath as follows:

set.seed(1701)
zpath <- calcZpath(myMatrix, cutoff = 0.1)
pdpZenplot(task = fr_task, model = fr_mod, gridsize = 10)

Fig 4.1: A plot of the partial dependence in a zenplot style layout following a zenpath

In Fig 4.1, we have only selected the variables with values above 0.1 using the cutoff argument. calcZpath constructs a path for connecting and displaying pairs of variables. It orders the plots in terms of importance starting from the top left and descending.

In addition to displaying both the variable importance and interactions together, the vivid package also allows to display only the variable importance or the interaction strength via the following functions.

Displaying all 2-way interactions

The type = "allInteractions" plot displays the 2-way interactions on the y-axis and the interaction strength on the x-axis. This plot also allows the user to switch between a lollipop style plot (which is default) and a barplot, by use of the plotType argument.

Example To call the plot we use the generic plot() function with type = "allInteractions":

plot(myMatrix, type = "allInteractions", top = 0)

Fig 4.0: A plot displaying all 2-way interaction in a model.

From the above plot we can see that x_1:x_2 has the strongest interaction.

Display the overall interaction strength

The interactionPlot() function creates a plot, displaying the overall interaction strength for each variable in a model. The plot displays the variables on the y-axis and the overall interaction strength on the x-axis. This function also allows the user to switch between a lollipop style plot (which is default) and a barplot, by use of the type argument.

Example

To call the interaction plot we use:

set.seed(1701)
interactionPlot(model =  fr_mod, data = myData, type = "barplot")

Fig 5.0: A plot displaying the overall interaction strength for each variable in a model.

From the above plot we can see that the variable with the strongest overall interaction strength is x_2

Display the overall variable importance

The type = "importance" call creates a plot, displaying the variable importance for each variable in a model. The plot displays variables on the y-axis and the variable importance on the x-axis. This function also allows the user to switch between a lollipop style plot (which is default) and a barplot, by use of the plotType argument.

Example

To call the importance plot we use:

plot(myMatrix, type = "importance", plotType = "barplot")

Fig 6.0: A plot displaying the variable importance for each variable in a model.

From the above plot, we can see that the most important variable, when predicting y is x_4.

Generate data from the Friedman benchmark problem 1

The genFriedman() function simulates data from the Friedman benchmark problem 1^[Friedman, Jerome H. (1991) Multivariate adaptive regression splines. The Annals of Statistics 19 (1), pages 1-67.]. This is mainly used for testing purposes. The output is created according to the formula:

$$y = 10 sin(π x1 x2) + 20 (x3 - 0.5)^2 + 10 x4 + 5 x5 + e$$

By default the number of features is set to 10 and the number of samples is set to 100. sigma denotes the standard deviation of the noise. bins denotes the number of bins to split response variable into. Setting a value greater than 1 turns this into a classification problem where bins determines the number of classes. If the seed argument is not NULL, then the random seed will be set as using the function each time will produce different results.

Example

To generate the Friedman data we use:

myData <- genFriedman(noFeatures = 10, noSamples = 100, sigma = 1, bins = NULL, seed = NULL)
head(myData,3)


AlanInglis/vividOld documentation built on March 4, 2021, 12:44 a.m.