file <- "data/german.rds" stopifnot(file.exists(file)) data <- readRDS(file) head(data)
In this exercise we will use the 'South German Credit' data set. It contains a classification of the credit risk of 1000 individuals into 'good' and 'bad' together with 20 additional attributes.
Simply download the file r xfun::embed_file(file, text = "german.rds")
by clicking or
download it from the corresponding homepage
http://archive.ics.uci.edu/ml/datasets/South+German+Credit
which also provides more detailed information on the data set.
We can import/read this file using data <- readRDS(...)
. The file
contains the following information:
status
: status of the debtor's checking account with the bank (factor).duration
: credit duration in months (integer).credit_history
: history of compliance with previous or concurrent credit contracts (factor).purpose
: purpose for which the credit is needed (factor).amount
: credit amount in DM (integer).savings
: debtor's savings (factor).employment_duration
: duration of debtor's employment with current employer (factor; discretized quantitative).installment_rate
: credit installments as a percentage of debtor's disposable income (ordered factor; discretized quantitative).personal_status_sex
: combined information on sex and marital status (factor; sex cannot be recovered from the variable, because male singles and female non-singles are coded with the same code (2); female widows cannot be easily classified, because the code table does not list them in any of the female categories).other_debtors
: Is there another debtor or a guarantor for the credit? (factor).present_residence
: length of time (in years) the debtor lives in the present residence (ordered factor; discretized quantitative).property
: the debtor's most valuable property, i.e. the highest possible code is used. Code 2 is used, if codes 3 or 4 are not applicable and there is a car or any other relevant property that does not fall under variable savings
(factor).age
: age in years (integer).other_installment_plans
: installment plans from providers other than the credit-giving bank (factor).housing
: type of housing the debtor lives in (factor)number_credits
: number of credits including the current one the debtor has (or had) at this bank (ordered factor, discretized quantitative).job
: quality of debtor's job (ordinal)people_liable
: number of persons who financially depend on the debtor (i.e., are entitled to maintenance) (factor, discretized quantitative).telephone
: Is there a telephone landline registered on the debtor's name? (factor; remember that the data are from the 1970s)foreign_worker
: Is the debtor a foreign worker? (factor)credit_risk
: Has the credit contract been complied with (good) or not (bad)? (factor)We would like to find out how the credit risk of a person depends on the provided additional attributes of the person and the considered credit itself.
Therefore, our response in this case is the binary variable credit_risk
, as covariates we have 20 additional variables (17 categorical, 3 numeric).
Apply the forest-building function cforest
to build a forest model as described in the following points:
"german.rds"
.ranger
to build a forest model using the same parameters and compare it to the cforest model, e.g., based on predictions, the number of misclassifications on the testing data or variable importance.# data <- readRDS("data/german.rds") f <- credit_risk ~ status + duration + credit_history + purpose + amount + savings + employment_duration + installment_rate + personal_status_sex + other_debtors + present_residence + property + age + other_installment_plans + housing + number_credits + job + people_liable + telephone + foreign_worker library("partykit") cf <- cforest(formula = f, data = data, ntree = 50) newclient <- data.frame(status = "no checking account", duration = 12, credit_history = "no credits taken/all credits paid back duly", purpose = "repairs", amount = 5000, savings = "unknown/no savings account", employment_duration = "4 <= ... < 7 yrs", installment_rate = "< 20", personal_status_sex = "male : married/widowed", other_debtors = "none", present_residence = "1 <= ... < 4 yrs", property = "real estate", age = 40, other_installment_plans = "none", housing = "own", number_credits = "1", job = "skilled employee/official", people_liable = "0 to 2", telephone = "no", foreign_worker = "no" ) newclient2 <- newclient newclient2$purpose <- "furniture/equipment" predict(cf, newdata = newclient) predict(cf, newdata = newclient2)
set.seed(4) trainid <- sample(1:NROW(data), size = 667, replace = FALSE) train <- data[trainid,] test <- data[-trainid,]
library("ranger") library("caret") rf <- ranger(formula = f, data = train, num.trees = 50) rf$confusion.matrix rf <- ranger(formula = f, data = train, num.trees = 500) rf$confusion.matrix rf <- ranger(formula = f, data = train, num.trees = 50) pred_cf <- predict(cf, newdata = test) confusionMatrix(pred_cf, test$credit_risk) pred_rf <- predict(rf, data = test)$prediction confusionMatrix(pred_rf, test$credit_risk) varimp(cf) rf <- ranger(f, data = train, num.trees = 50, importance = "impurity") importance(rf)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.