For an interactive presentation version, please check here
# dir <- 'F:/Projects/Rpackage/streamlineR' dir <- 'C:/Users/Jianhua/Dropbox/work_doc/Rpackage/streamlineR' knitr::opts_chunk$set(echo = TRUE, root.dir = dir, warning=FALSE, message=FALSE)
This package is designed to streamline the routine modeling work, especially for scoring. It provides some handy functions to bin numerical variables, replace numerical variables with Weight of Evidence (WOE), ranking variables by Information Values (IV), plotting the successful/failure rates, check model performance based on AUC, and so on.This package also provides the useful function to convert the model output (e.g., coefficients) to graph/tables that are easier to understand for non-technical audience.
The following example illustrates how to use the streamlineR
package to prepare data, build models, and visualize results.
This analysis relies on other packages. If these packages are not available yet in your computer, you need to install them with the following commands.
# If the default mirror is blocked, choose another mirror to install R packages, # chooseCRANmirror() install.packages(c('dplyr', 'car', 'caret', 'e1071', 'knitr', 'reshape2', 'corrplot','rpart', 'scales', 'shiny', 'survival', 'gridExtra', 'devtools', 'pec', 'MASS', 'pROC', 'manipulate'), dependencies = TRUE)
After installing these packages, you need to load them into R, so that you can use the functions in those packages.
# Load pacakges sapply(c('dplyr', 'car', 'caret', 'e1071', 'knitr', 'reshape2', 'corrplot','rpart', 'scales', 'survival', 'gridExtra', 'devtools', 'pec', 'MASS', 'pROC', 'manipulate'), require, character.only = TRUE)
streamlineR
from GithubThe streamlineR
package is under development on the Github platform. The package can be installed using the install_github
function from the devtools
package. If the install_github
function does not work, you can download the package from here, and then install the package locally in Rstudio (Tools -> Install Packages -> Install from -> Package Archive File (.zip; .tar.gz))
# If the install_github does not work, you can download the package from github, # and then install the package locally: # https://api.github.com/repos/JianhuaHuang/streamlineR/zipball/master devtools::install_github('JianhuaHuang/streamlineR') library(streamlineR)
In this example, I analyzed the primary biliary cirrhosis (PBC) data set from the survival package. The details of this data set is available here, or you can run ?survival::pbc
to find the data description within R.
The data set can be loaded into R directly by calling the data from the package. Because the sample size is a little small, I increased the sample size by resampling the data 10000 times.
dt <- survival::pbc %>% transmute(age = round(age), gender = sex, platelet, stage = as.character(stage), time, status = as.numeric(status %in% c(1, 2))) %>% filter(!is.na(stage)) set.seed(1111) # reproducible dt <- dt[sample(nrow(dt), 10000, replace = T), ] row.names(dt) <- NULL dim(dt) str(dt) head(dt)
Before doing any analysis, I held out some data as test data set. The createDataPartition
function (from caret
package) is used to split the data into training and test data sets. The training data set is used to develop the model, and the test data set is used to test the model performance.
set.seed(1111) ind.train <- createDataPartition(dt$status, p = .7, list = FALSE) dt.train <- dt[ind.train, ] dt.test <- dt[-ind.train, ] row.names(dt.train) <- NULL row.names(dt.test) <- NULL dim(dt.train) dim(dt.test) # preserve the original values dt.train.bk <- dt.train dt.test.bk <- dt.test
bin.knn
Before building the model with training data, we may need to convert some numerical variables to categorical variables by binning the numerical values into different groups, so that we can model the non-linear relationship between the independent and dependent variables. This package provides two methods to find the cut points for binning: bin.knn
and bin.rpart
.
The bin.knn
method finds the cut points based on the regression coefficients using the KNN algorithm. Generally, the bin.knn
finds the cut points through the following steps:
The bin.knn
method can now deal with Generalized Linear Models (GLM) and Survival Model (class of coxph in R). The bin.knn
function takes four arguments: formula, data, n.group, and min.bucket. The min.bucket is the minimum ratio of records in each bucket. The min.bucket should be small enough to make that there are enough buckets for binning, and big enough to guarantee that the estimated regression coefficients are statistically reliable for all buckets.
The following function bins the 20 buckets (including NA) into 5 groups represented by different colors. We can extract the cut points based on the colors for different groups.
bin.knn(status ~ platelet, data = dt.train, n.group = 5, min.bucket = 0.05)
Although the bin.knn
function take into account both the order and coefficient of each bucket, it may bin some buckets together if the coefficients for some neighboring buckets are very different. In this case, we may need to change the number of groups and/or buckets by adjusting the n.group
and/or min.bucket
argument. This can be done interactively using the manipulate function as follows:
manipulate(bin.knn(status ~ platelet, data = dt.train, n.group, min.bucket), n.group = slider(1, 10, step = 1, initial = 5, label = 'Number of Groups'), min.bucket = slider(0.01, .2, step = 0.01, initial = 0.05, label = 'Minimum Population'))
By changing the n.group
and min.bucket
repeatedly, we may be able to find the appropriate cut points.
bin.rpart
Although the bin.knn
method is intuitive and provides the visualization, it is difficult to find the optimal cut points manually. Another binning method bin.rpart
, can help us to find the optimal cut points automatically, and can be applied to more models.
rpart
The bin.rpart
relies on the output from rpart
(recursive partitioning), a famous algorithm used to build the decision tree. The rpart
function can produce the optimal splits of numerical data, and generates a tree-structure nodes. Based on these splits and nodes, we can extract the cut points and bin the numerical values into different groups.
rpart(formula = status ~ age, data = dt.train, control = rpart.control(minbucket = .01 * nrow(dt.train)))
The usage of bin.rpart
is very similar rpart
, except that the control argument in rpart
is named as rcontrol in bin.rpart
. The following code generates the cut points and bins for age and platelet. The cut points (cut.points) and bins are saved in a list as the function outputs.
lg.bin.age <- bin.rpart(formula = status ~ age, data = dt.train, rcontrol = rpart.control(minbucket = .01 * nrow(dt.train))) str(lg.bin.age) lg.bin.platelet <- bin.rpart(formula = status ~ platelet, data = dt.train, rcontrol = rpart.control(minbucket = .01 * nrow(dt.train)))
Compared to other packages (such as smbinning
and woe
) that only provides binning for logistic model, bin.rpart
can provide the optimal binning for all models that can be passed to the rpart
function. For example, the bin.rpart
function can generate the optimal cut points of age in a survival model, if we change the dependent variable to a survival object (Surv(time, status)
) in the formula.
surv.bin.age <- bin.rpart(formula = Surv(time, status) ~ age, data = dt.train, rcontrol = rpart.control(minbucket = .05 * nrow(dt.train))) ## cp = 0.01
By default, the cp (complexity parameter used to control rpart
. The detail of the cp argument can be checked with ?rpart.control
) is set as 0.01, which is a little conservative for the survival model. Thus the number of cut points is usually small for the survival model, if we use the default cp value. We can reduce the cp value (e.g., 0.001) to get more cut points. We may be able to achieve an appropriate number of cut points by changing the cp argument repeatedly.
In stead of changing the cp argument manually, the n.group (number of acceptable binning groups, can be a single number or a range) argument can help to find the appropriate number of cut points automatically. For example, if we set the acceptable n.group as 3:7, the bin.rpart
function will try different cp, until the number of binning groups is within 3 to 7 (or the number of cut points within 2:6).
surv.bin.age2 <- bin.rpart(formula = Surv(time, status) ~ age, data = dt.train, rcontrol = rpart.control(minbucket = .05 * nrow(dt.train)), n.group = 3:7)
After binning, we can replace the original numerical values with the corresponding bins saved in the outputs from bin.rpart
.
# We don't need the time column anmore, delete it in both dt.train and dt.test dt.train <- dplyr::select(dt.train, -time) dt.test <- dplyr::select(dt.test, -time) head(dt.train) dt.train$age <- lg.bin.age$bins dt.train$platelet <- lg.bin.platelet$bins head(dt.train)
After replacing the numerical variables with bins, the numerical variables are converted to categorical variables automatically.
level.stat
For all of the categorical variables, it is useful to calculate some statistics (e.g., population frequency, good/bad rates, and WOE) for different levels and variables, before building the models. The level.stat
function is designed for this purpose. In order to use the level.stat
function, the dependent variable (y) should be binary, and you should specify which value is flagged as 0/1 in the level.stat
output.
col.x <- c('age', 'gender', 'platelet', 'stage') stat.train <- level.stat(dt.train, x = col.x, y = 'status', flag.0 = 0, flag.1 = 1) head(stat.train)
ggstat
In accompany with the level.stat
function, is the visualization of its output using the ggstat
function. The ggstat
function employs the ggplot2
package to plot the statistics for different groups and variables.
ggstat(data = stat.train, var = "Variable.IV", x = "Group", y = "Rate.1", y.label = "Perc.1", y.label.col = "red", y.title = NULL, bar.col = "cornflowerblue", width = "Rate.group", width.label = "Perc.group", width.label.col = "black", ncol = NULL, theme = "classic", background = "white")
The ggstat
function takes multiple arguments. All of the arguments have default values, except the first one (data). You only need to specify the data, which is usually the output from the level.stat
function (It can also be other data, but the default arguments need to be changed.), to plot the Rate.1 for different groups and variables.
The default arguments can be changed to other values to make the plot looks different. For example, if we don't want to use width of the bar to represent the ratio of population in each group, we can set width = XX
(where XX is a numerical value), so that the same width is used for each bar.
ggstat(stat.train, width = .2)
We can also plot other statistics (e.g., "WOE") by changing the y value to y = 'WOE'
, and and y.label to y.label = 'WOE.round'
. We can also set the number of columns for the plot by changing the ncol argument.
ggstat(stat.train, y = 'WOE', y.label = 'WOE.round', width = .2, width.label = NULL, ncol = 4)
In order to make the plots look diverse, you can also change the colors for the bar, the labels for y and bar width, and the background. A lot of colors can be used in the plot. We can use the handy display.col()
function to show the possible colors:
display.col() # reference to http://sape.inf.usi.ch/quick-reference/ggplot2/colour
You can pick your favorite colors for different elements in the ggstat
function:
ggstat(stat.train, width = .2, y.label.col = 'white', bar.col = 'black', width.label.col = 'blue', background = 'green')
replace.woe
For logistic model, sometimes it is useful to convert original categorical variables into numeric WOE value, since it can guarantee the linearity between the dependent and independent variables. The replace argument can be used to control whether the original values should be replaced (replace = TRUE
) or not (`replace = FALSE'). If the replace argument is set to TRUE, the original values will be replaced by the corresponding WOE values directly. If the replace argument is set to FALSE, the WOE value will be added as a new column with _woe appended to the original column name.
replace.woe(data = dt.train, stat = stat.train, replace = FALSE) %>% head dt.train <- replace.woe(data = dt.train, stat = stat.train, replace = TRUE) head(dt.train)
ggcorr
corrplot
Another advantage of converting categorical values to WOE is that we can calculate the correlation between all independent variables, since they are all numeric values.In order to do this, we need to calculate the correlation matrix between the independent variables, and then visualize them using ggcorr
. The idea of ggcorr
originates from the corrplot
function in the corrplot
package. The corrplot
function is based on the base R plot system, while the ggcorr
function is based on the ggplot system. With the default setting, you can generate the corrplot
as follows:
cor.mat <- cor(dt.train[, col.x]) corrplot(cor.mat)
ggcorr
The ggcorr
function is a simplified version of the corrplot
, which only include a few arguments that may be useful to most users. Compared to the corrplot
function, the diagonal big dots are removed, because we don't want to highlight those values. Since the plot is based on the ggplot system, the figure can be saved with the ggsave
function directly. By default, the ggcorr
generate a figure looks like this:
ggcorr(cor.mat)
You can change a few options in the function to personalize the plot:
Using the following randomly-generated data as an example, you can get different plots by changing a few arguments:
set.seed(1111) data.random <- matrix(runif(100), 10) colnames(data.random) <- paste('Variable', 1:10) cor.random <- cor(data.random) ggcorr(cor.random) # default output ggcorr(cor.random, var.position = 'diagonal', add.legend = FALSE) ggcorr(cor.random, lower = TRUE) ggcorr(cor.random, lower = TRUE, var.position = 'diagonal', high = 'blue', low = 'green')
With the WOE values, we are ready to build the logistic regression model with the training data set.
lg <- glm(status ~ ., dt.train, family=binomial(link='logit')) summary(lg)
Then, we can select the significant variables with the stepAIC
function from the MASS
package. It is worth to mention that there is no argument that controls the p-to-enter directly in the function, since the variable selection is based on AIC. Instead, we can adjust the k value to control the threshold for a variable to enter into the model. For discussion about the usage of k argument, please refer to this thread
lg.aic <- stepAIC(lg, k = qchisq(0.05, 1, lower.tail=F)) # p to enter: 0.05 summary(lg.aic) data.frame(vif(lg.aic)) # check the multicollinearity between predictors
bin.custom & replace.woe
After building the model, we need to check the model performance using the test data, which we hold out at the very beginning of the analysis. In order to use the test data to check the model performance, we need to convert the original values to bins, and then to WOE, so that it can be used in the model.
bin.custom
In order to bin the test data, we need to use the bin.custom
function. This function requires two arguments: the data needed to be binned, and its cut points. Since we already saved the optimal cut points for the training data, we can use those cut points to cut the test data into different bins directly.
head(dt.test) dt.test$age <- bin.custom(dt.test$age, cut.p = lg.bin.age$cut.points) dt.test$platelet <- bin.custom(dt.test$platelet, cut.p = lg.bin.platelet$cut.points) head(dt.test)
replace.woe
After converting the original numeric values into different bins, then we can replace the bins with the corresponding WOE values estimated based on the training data.
dt.test <- replace.woe(dt.test, stat = stat.train, replace = TRUE) head(dt.test)
perf.auc & perf.decile
Now, with the test data in shape of WOE values, we can check the model performance based on AUC (Area Under Curve) or decile rates.
perf.auc
In order to check the performance of AUC, the perf.auc
function requires the model, the training data, and the test data. Then the function will generate the AUC and ROC curve for both the training and test data sets.
perf.auc(model = lg.aic, dt.train, dt.test)
perf.decile
Thought the AUC makes sense to some technical people who know statistics well, it may not be sensible to the non-technical audience. In order to introduce the model performance in a way that is easier to understand, we may need the perf.decile
function. This function takes the actual status (actual), and the predicted probability (pred) as inputs, and generate the performance figure in the following steps:
If a model performs well, the predicted rates should match with the actual values well, which means the decile points should be around the reference line, and the good deciles should be far away from the bad deciles.
pred.test <- predict(lg.aic, newdata = dt.test, type = 'response') perf.decile(actual = dt.test$status, pred = pred.test, add.legend = TRUE)
coef2rate
After checking the model performance, it is also useful to produce the regression model outputs in a audience-friendly style. Since the model is built with WOE, the explanation of the regression coefficients is a bit difficult, since it doesn't relate to the original data directly. The coef2rate
function is designed to convert these coefficients back to the good/bad rates for each group and variables, so that the non-technical audience can understand it easily. The coef2rate
function works in two different ways dependent on whether force.change is FALSE or TRUE:
pred.stat <- coef2rate(model = lg.aic, data = dt.test, stat = stat.train, force.change = TRUE) head(pred.stat)
After calculating the Pred.Rate.1, we can again plot it using the ggstat function by setting y = 'Pred.Rate.1'
ggstat(pred.stat, y = 'Pred.Rate.1', y.label = 'Pred.Perc.1')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.