stepRPC: Stepwise logistic regression based on risk profile concept
In PDtoolkit: Collection of Tools for PD Rating Model Development and Validation

stepRPC

R Documentation

Stepwise logistic regression based on risk profile concept

Description

stepRPC customized stepwise regression with p-value and trend check which additionally takes into account the order of supplied risk factors per group when selects a candidate for the final regression model. Trend check is performed comparing observed trend between target and analyzed risk factor and trend of the estimated coefficients within the logistic regression. Note that procedure checks the column names of supplied db data frame therefore some renaming (replacement of special characters) is possible to happen. For details, please, check the help example.

Usage

stepRPC(
  start.model,
  risk.profile,
  p.value = 0.05,
  coding = "WoE",
  coding.start.model = TRUE,
  check.start.model = TRUE,
  db,
  offset.vals = NULL
)

Arguments

`start.model`	Formula class that represents the starting model. It can include some risk factors, but it can be defined only with intercept (`y ~ 1` where `y` is target variable).
`risk.profile`	Data frame with defined risk profile. It has to contain the following columns: `rf` and `group`. Column `group` defines order of groups that will be tested first as a candidate for the regression model. Risk factors selected in each group are kept as a starting variables for the next group testing. Column `rf` contains all candidate risk factors supplied for testing.
`p.value`	Significance level of p-value of the estimated coefficients. For `WoE` coding this value is is directly compared to the p-value of the estimated coefficients, while for `dummy` coding multiple Wald test is employed and its value is used for comparison with selected threshold (`p.value`).
`coding`	Type of risk factor coding within the model. Available options are: `"WoE"` and `"dummy"`. If `"WoE"` is selected, then modalities of the risk factors are replaced by WoE values, while for `"dummy"` option dummies (0/1) will be created for `n-1` modalities where `n` is total number of modalities of analyzed risk factor.
`coding.start.model`	Logical (`TRUE` or `FALSE`), if the risk factors from the starting model should be WoE coded. It will have an impact only for WoE coding option.
`check.start.model`	Logical (`TRUE` or `FALSE`), if risk factors from the starting model should checked for p-value and trend in stepwise process.
`db`	Modeling data with risk factors and target variable. All risk factors (apart from the risk factors from the starting model) should be categorized and as of character type.
`offset.vals`	This can be used to specify an a priori known component to be included in the linear predictor during fitting. This should be `NULL` or a numeric vector of length equal to the number of cases. Default is `NULL`.

Value

The command stepRPC returns a list of four objects.
The first object (model), is the final model, an object of class inheriting from "glm".
The second object (steps), is the data frame with risk factors selected at each iteration.
The third object (warnings), is the data frame with warnings if any observed. The warnings refer to the following checks: if risk factor has more than 10 modalities, if any of the bins (groups) has less than 5% of observations and if there are problems with WoE calculations.
The final, fourth, object dev.db returns the model development database.

Examples

suppressMessages(library(PDtoolkit))
data(loans)
#identify numeric risk factors
num.rf <- sapply(loans, is.numeric)
num.rf <- names(num.rf)[!names(num.rf)%in%"Creditability" & num.rf]
#discretized numeric risk factors using ndr.bin from monobin package
loans[, num.rf] <- sapply(num.rf, function(x) 
ndr.bin(x = loans[, x], y = loans[, "Creditability"])[[2]])
str(loans)
#create risk factor priority groups
rf.all <- names(loans)[-1]
set.seed(591)
rf.pg <- data.frame(rf = rf.all, group = sample(1:3, length(rf.all), rep = TRUE))
head(rf.pg)
#bring AUC for each risk factor in order to sort them within groups
bva <- bivariate(db = loans, target = "Creditability")[[1]]
rf.auc <- unique(bva[, c("rf", "auc")])
rf.pg <- merge(rf.pg, rf.auc, by = "rf", all.x = TRUE)
#prioritized risk factors
rf.pg <- rf.pg[order(rf.pg$group, rf.pg$auc), ]
rf.pg <- rf.pg[order(rf.pg$group), ]
rf.pg
res <- stepRPC(start.model = Creditability ~ 1, 
	   risk.profile = rf.pg, 
	   p.value = 0.05, 
	   coding = "WoE",
	   db = loans)
summary(res$model)$coefficients
res$steps
head(res$dev.db)

PDtoolkit documentation built on Sept. 20, 2023, 9:06 a.m.