Streamline Routine Modeling Work in R: streamlineR

For an interactive presentation version, please check here

# dir <- 'F:/Projects/Rpackage/streamlineR'
dir <- 'C:/Users/Jianhua/Dropbox/work_doc/Rpackage/streamlineR'
knitr::opts_chunk$set(echo = TRUE, root.dir = dir, warning=FALSE, message=FALSE)

This package is designed to streamline the routine modeling work, especially for scoring. It provides some handy functions to bin numerical variables, replace numerical variables with Weight of Evidence (WOE), ranking variables by Information Values (IV), plotting the successful/failure rates, check model performance based on AUC, and so on.This package also provides the useful function to convert the model output (e.g., coefficients) to graph/tables that are easier to understand for non-technical audience.

The following example illustrates how to use the streamlineR package to prepare data, build models, and visualize results.

Packages Setup

This analysis relies on other packages. If these packages are not available yet in your computer, you need to install them with the following commands.

Install Dependent Packages

# If the default mirror is blocked, choose another mirror to install R packages, 
# chooseCRANmirror()
install.packages(c('dplyr', 'car', 'caret', 'e1071', 'knitr', 'reshape2',
  'corrplot','rpart', 'scales', 'shiny', 'survival', 'gridExtra', 'devtools', 
  'pec', 'MASS', 'pROC', 'manipulate'), dependencies = TRUE)

Load Dependent Packages

After installing these packages, you need to load them into R, so that you can use the functions in those packages.

# Load pacakges
sapply(c('dplyr', 'car', 'caret', 'e1071', 'knitr', 'reshape2', 'corrplot','rpart', 
  'scales', 'survival', 'gridExtra', 'devtools', 'pec', 'MASS', 'pROC', 
  'manipulate'), 
  require, character.only = TRUE)

Install streamlineR from Github

The streamlineR package is under development on the Github platform. The package can be installed using the install_github function from the devtools package. If the install_github function does not work, you can download the package from here, and then install the package locally in Rstudio (Tools -> Install Packages -> Install from -> Package Archive File (.zip; .tar.gz))

# If the install_github does not work, you can download the package from github,
# and then install the package locally: 
# https://api.github.com/repos/JianhuaHuang/streamlineR/zipball/master 
devtools::install_github('JianhuaHuang/streamlineR')
library(streamlineR)

Data Preparation

In this example, I analyzed the primary biliary cirrhosis (PBC) data set from the survival package. The details of this data set is available here, or you can run ?survival::pbc to find the data description within R.

Load Data

The data set can be loaded into R directly by calling the data from the package. Because the sample size is a little small, I increased the sample size by resampling the data 10000 times.

dt <- survival::pbc %>%
  transmute(age = round(age), gender = sex, platelet, stage = as.character(stage), 
    time, status = as.numeric(status %in% c(1, 2))) %>%
  filter(!is.na(stage))

set.seed(1111)  # reproducible
dt <- dt[sample(nrow(dt), 10000, replace = T), ]
row.names(dt) <- NULL
dim(dt)
str(dt)
head(dt)

Split Data into Training and Test Datasets

Before doing any analysis, I held out some data as test data set. The createDataPartition function (from caret package) is used to split the data into training and test data sets. The training data set is used to develop the model, and the test data set is used to test the model performance.

set.seed(1111)
ind.train <- createDataPartition(dt$status, p = .7, list = FALSE)
dt.train <- dt[ind.train, ]
dt.test <- dt[-ind.train, ]
row.names(dt.train) <- NULL
row.names(dt.test) <- NULL
dim(dt.train)
dim(dt.test)

# preserve the original values
dt.train.bk <- dt.train
dt.test.bk <- dt.test

Bin Training Data Based on Regression Coefficients: bin.knn

Before building the model with training data, we may need to convert some numerical variables to categorical variables by binning the numerical values into different groups, so that we can model the non-linear relationship between the independent and dependent variables. This package provides two methods to find the cut points for binning: bin.knn and bin.rpart.

The bin.knn method finds the cut points based on the regression coefficients using the KNN algorithm. Generally, the bin.knn finds the cut points through the following steps:

  1. Divide the independent variable x into some small bucket (e.g., 20 bucket)
  2. Build a univariate model using x and y
  3. Get the regression coefficients for each bucket
  4. Use the KNN algorithm to bin the buckets into bigger groups (e.g., 5 groups), based on their orders and regression coefficients.

The bin.knn method can now deal with Generalized Linear Models (GLM) and Survival Model (class of coxph in R). The bin.knn function takes four arguments: formula, data, n.group, and min.bucket. The min.bucket is the minimum ratio of records in each bucket. The min.bucket should be small enough to make that there are enough buckets for binning, and big enough to guarantee that the estimated regression coefficients are statistically reliable for all buckets.

The following function bins the 20 buckets (including NA) into 5 groups represented by different colors. We can extract the cut points based on the colors for different groups.

bin.knn(status ~ platelet, data = dt.train, n.group = 5, min.bucket = 0.05)

Although the bin.knn function take into account both the order and coefficient of each bucket, it may bin some buckets together if the coefficients for some neighboring buckets are very different. In this case, we may need to change the number of groups and/or buckets by adjusting the n.group and/or min.bucket argument. This can be done interactively using the manipulate function as follows:

manipulate(bin.knn(status ~ platelet, data = dt.train, n.group, min.bucket),
  n.group = slider(1, 10, step = 1, initial = 5, label = 'Number of Groups'),
  min.bucket = slider(0.01, .2, step = 0.01, initial = 0.05,
    label = 'Minimum Population'))

By changing the n.group and min.bucket repeatedly, we may be able to find the appropriate cut points.

Bin Training Data Based on rpart: bin.rpart

Although the bin.knn method is intuitive and provides the visualization, it is difficult to find the optimal cut points manually. Another binning method bin.rpart, can help us to find the optimal cut points automatically, and can be applied to more models.

Decision Tree Algorithm (Recursive Partitioning): rpart

The bin.rpart relies on the output from rpart (recursive partitioning), a famous algorithm used to build the decision tree. The rpart function can produce the optimal splits of numerical data, and generates a tree-structure nodes. Based on these splits and nodes, we can extract the cut points and bin the numerical values into different groups.

rpart(formula = status ~ age, data = dt.train, 
  control = rpart.control(minbucket = .01 * nrow(dt.train)))

Binning for Logistic Model

The usage of bin.rpart is very similar rpart, except that the control argument in rpart is named as rcontrol in bin.rpart. The following code generates the cut points and bins for age and platelet. The cut points (cut.points) and bins are saved in a list as the function outputs.

lg.bin.age <- bin.rpart(formula = status ~ age, data = dt.train, 
  rcontrol = rpart.control(minbucket = .01 * nrow(dt.train)))

str(lg.bin.age)

lg.bin.platelet <- bin.rpart(formula = status ~ platelet, data = dt.train, 
  rcontrol = rpart.control(minbucket = .01 * nrow(dt.train)))

Binning for Survival Model

Compared to other packages (such as smbinning and woe) that only provides binning for logistic model, bin.rpart can provide the optimal binning for all models that can be passed to the rpart function. For example, the bin.rpart function can generate the optimal cut points of age in a survival model, if we change the dependent variable to a survival object (Surv(time, status)) in the formula.

surv.bin.age <- bin.rpart(formula = Surv(time, status) ~ age, data = dt.train,
  rcontrol = rpart.control(minbucket = .05 * nrow(dt.train)))  ## cp = 0.01

Search for Appropriate Number of Cut Points

By default, the cp (complexity parameter used to control rpart. The detail of the cp argument can be checked with ?rpart.control) is set as 0.01, which is a little conservative for the survival model. Thus the number of cut points is usually small for the survival model, if we use the default cp value. We can reduce the cp value (e.g., 0.001) to get more cut points. We may be able to achieve an appropriate number of cut points by changing the cp argument repeatedly.

In stead of changing the cp argument manually, the n.group (number of acceptable binning groups, can be a single number or a range) argument can help to find the appropriate number of cut points automatically. For example, if we set the acceptable n.group as 3:7, the bin.rpart function will try different cp, until the number of binning groups is within 3 to 7 (or the number of cut points within 2:6).

surv.bin.age2 <- bin.rpart(formula = Surv(time, status) ~ age, data = dt.train,
  rcontrol = rpart.control(minbucket = .05 * nrow(dt.train)), n.group = 3:7)

Replace Numerical Varialbes with Bins

After binning, we can replace the original numerical values with the corresponding bins saved in the outputs from bin.rpart.

# We don't need the time column anmore, delete it in both dt.train and dt.test
dt.train <- dplyr::select(dt.train, -time)  
dt.test <- dplyr::select(dt.test, -time)
head(dt.train)
dt.train$age <- lg.bin.age$bins
dt.train$platelet <- lg.bin.platelet$bins
head(dt.train)

After replacing the numerical variables with bins, the numerical variables are converted to categorical variables automatically.

Level Statistics (Frequence, Rate, WOE, and IV): level.stat

For all of the categorical variables, it is useful to calculate some statistics (e.g., population frequency, good/bad rates, and WOE) for different levels and variables, before building the models. The level.stat function is designed for this purpose. In order to use the level.stat function, the dependent variable (y) should be binary, and you should specify which value is flagged as 0/1 in the level.stat output.

col.x <- c('age', 'gender', 'platelet', 'stage')
stat.train <- level.stat(dt.train, x = col.x, y = 'status', flag.0 = 0, flag.1 = 1)
head(stat.train)

Visualize Level Statistics: ggstat

In accompany with the level.stat function, is the visualization of its output using the ggstat function. The ggstat function employs the ggplot2 package to plot the statistics for different groups and variables.

Plot with Default Arguments

ggstat(data = stat.train, var = "Variable.IV", x = "Group", y = "Rate.1", 
    y.label = "Perc.1", y.label.col = "red", y.title = NULL, 
    bar.col = "cornflowerblue", width = "Rate.group", width.label = "Perc.group", 
    width.label.col = "black", ncol = NULL, theme = "classic", 
    background = "white")

The ggstat function takes multiple arguments. All of the arguments have default values, except the first one (data). You only need to specify the data, which is usually the output from the level.stat function (It can also be other data, but the default arguments need to be changed.), to plot the Rate.1 for different groups and variables.

Constant Bar Width

The default arguments can be changed to other values to make the plot looks different. For example, if we don't want to use width of the bar to represent the ratio of population in each group, we can set width = XX (where XX is a numerical value), so that the same width is used for each bar.

ggstat(stat.train, width = .2)

Plot WOE

We can also plot other statistics (e.g., "WOE") by changing the y value to y = 'WOE', and and y.label to y.label = 'WOE.round'. We can also set the number of columns for the plot by changing the ncol argument.

ggstat(stat.train, y = 'WOE', y.label = 'WOE.round', width = .2, 
  width.label = NULL, ncol = 4)

Change Colors for Bar, Labels, and Background

In order to make the plots look diverse, you can also change the colors for the bar, the labels for y and bar width, and the background. A lot of colors can be used in the plot. We can use the handy display.col() function to show the possible colors:

display.col()  # reference to http://sape.inf.usi.ch/quick-reference/ggplot2/colour 

You can pick your favorite colors for different elements in the ggstat function:

ggstat(stat.train, width = .2, y.label.col = 'white', bar.col = 'black',
  width.label.col = 'blue', background = 'green')

Replace Bins with WOE: replace.woe

For logistic model, sometimes it is useful to convert original categorical variables into numeric WOE value, since it can guarantee the linearity between the dependent and independent variables. The replace argument can be used to control whether the original values should be replaced (replace = TRUE) or not (`replace = FALSE'). If the replace argument is set to TRUE, the original values will be replaced by the corresponding WOE values directly. If the replace argument is set to FALSE, the WOE value will be added as a new column with _woe appended to the original column name.

replace.woe(data = dt.train, stat = stat.train, replace = FALSE) %>%
  head

dt.train <- replace.woe(data = dt.train, stat = stat.train, 
  replace = TRUE)
head(dt.train)

Correlation between Independent Variables: ggcorr

Plot with the base system: corrplot

Another advantage of converting categorical values to WOE is that we can calculate the correlation between all independent variables, since they are all numeric values.In order to do this, we need to calculate the correlation matrix between the independent variables, and then visualize them using ggcorr. The idea of ggcorr originates from the corrplot function in the corrplot package. The corrplot function is based on the base R plot system, while the ggcorr function is based on the ggplot system. With the default setting, you can generate the corrplot as follows:

cor.mat <- cor(dt.train[, col.x])
corrplot(cor.mat)

Plot with the ggplot system: ggcorr

The ggcorr function is a simplified version of the corrplot, which only include a few arguments that may be useful to most users. Compared to the corrplot function, the diagonal big dots are removed, because we don't want to highlight those values. Since the plot is based on the ggplot system, the figure can be saved with the ggsave function directly. By default, the ggcorr generate a figure looks like this:

ggcorr(cor.mat)

You can change a few options in the function to personalize the plot:

Using the following randomly-generated data as an example, you can get different plots by changing a few arguments:

set.seed(1111)
data.random <- matrix(runif(100), 10)
colnames(data.random) <- paste('Variable', 1:10)
cor.random <- cor(data.random)

ggcorr(cor.random)  # default output
ggcorr(cor.random, var.position = 'diagonal', add.legend = FALSE)
ggcorr(cor.random, lower = TRUE)
ggcorr(cor.random, lower = TRUE, var.position = 'diagonal', high = 'blue', 
  low = 'green')

Logistic Model

With the WOE values, we are ready to build the logistic regression model with the training data set.

Full Model

lg <- glm(status ~ ., dt.train, family=binomial(link='logit'))
summary(lg)

Stepwise Variable Selection

Then, we can select the significant variables with the stepAIC function from the MASS package. It is worth to mention that there is no argument that controls the p-to-enter directly in the function, since the variable selection is based on AIC. Instead, we can adjust the k value to control the threshold for a variable to enter into the model. For discussion about the usage of k argument, please refer to this thread

lg.aic <- stepAIC(lg, k =  qchisq(0.05, 1, lower.tail=F))   # p to enter: 0.05
summary(lg.aic)

data.frame(vif(lg.aic))  # check the multicollinearity between predictors

Prepare Test Data: bin.custom & replace.woe

After building the model, we need to check the model performance using the test data, which we hold out at the very beginning of the analysis. In order to use the test data to check the model performance, we need to convert the original values to bins, and then to WOE, so that it can be used in the model.

Bin Test Data: bin.custom

In order to bin the test data, we need to use the bin.custom function. This function requires two arguments: the data needed to be binned, and its cut points. Since we already saved the optimal cut points for the training data, we can use those cut points to cut the test data into different bins directly.

head(dt.test)
dt.test$age <-  bin.custom(dt.test$age, cut.p = lg.bin.age$cut.points)
dt.test$platelet <- bin.custom(dt.test$platelet, cut.p = lg.bin.platelet$cut.points)
head(dt.test)

Replace Binned Test Data with WOE: replace.woe

After converting the original numeric values into different bins, then we can replace the bins with the corresponding WOE values estimated based on the training data.

dt.test <- replace.woe(dt.test, stat = stat.train, replace = TRUE)
head(dt.test)

Model Performance: perf.auc & perf.decile

Now, with the test data in shape of WOE values, we can check the model performance based on AUC (Area Under Curve) or decile rates.

Check Performance Based on AUC: perf.auc

In order to check the performance of AUC, the perf.auc function requires the model, the training data, and the test data. Then the function will generate the AUC and ROC curve for both the training and test data sets.

perf.auc(model = lg.aic, dt.train, dt.test)

Check Performance Based on Decile Rate: perf.decile

Thought the AUC makes sense to some technical people who know statistics well, it may not be sensible to the non-technical audience. In order to introduce the model performance in a way that is easier to understand, we may need the perf.decile function. This function takes the actual status (actual), and the predicted probability (pred) as inputs, and generate the performance figure in the following steps:

  1. Rank the pred probability, and divide the records into 10 different groups (deciles)
  2. Calculate the actual and predicted rates in each decile
  3. Plot the actual and predicted rates, together with the diagonal reference line.

If a model performs well, the predicted rates should match with the actual values well, which means the decile points should be around the reference line, and the good deciles should be far away from the bad deciles.

pred.test <- predict(lg.aic, newdata = dt.test, type = 'response')
perf.decile(actual = dt.test$status, pred = pred.test, add.legend = TRUE)

Convert Coefficients to Rate: coef2rate

After checking the model performance, it is also useful to produce the regression model outputs in a audience-friendly style. Since the model is built with WOE, the explanation of the regression coefficients is a bit difficult, since it doesn't relate to the original data directly. The coef2rate function is designed to convert these coefficients back to the good/bad rates for each group and variables, so that the non-technical audience can understand it easily. The coef2rate function works in two different ways dependent on whether force.change is FALSE or TRUE:

pred.stat <- coef2rate(model = lg.aic, data = dt.test, 
  stat = stat.train, force.change = TRUE)
head(pred.stat)

After calculating the Pred.Rate.1, we can again plot it using the ggstat function by setting y = 'Pred.Rate.1'

ggstat(pred.stat, y = 'Pred.Rate.1',  y.label = 'Pred.Perc.1')

Reference



JianhuaHuang/streamit documentation built on May 7, 2019, 10:40 a.m.