In rotterp/RTutorMachineLearningAndCreditDefault: Machine Learning and Credit Default: An Interactive Analysis in R

Lending Club and their probability of default. Lending Club acts as intermediary connecting borrowers and investors by helping people to access or grant credits via peer-to-peer lending.

The focal point of this problem set is to present the application of machine learning in an accessible and reproducible manner. Therefore, we will not focus on experiencing complex statistical formulae or well-established credit default research.

Concerning the problem set most exercises depend to a certain extent on each other. As tasks become trickier and more challenging or may require some previously gained knowledge, it is highly recommended to solve the problem set in its predetermined order. Possessing prior knowledge of R is not mandatory, yet experience with R or other equivalent statistical software, as well as a genuine interest in econometrics would be of advantage. Since most code will be provided, lots of challenging quizzes have been added to make up for the lack of producing written code.

Exercise contents

1. Agenda and overview
Presenting necessary vocabulary plus pros and cons.

2. Data science with R

2.1 Visual analysis
Data manipulation and first dive into our data.

2.2 Text mining
Deriving knowledge by investigating the loan purpose.

3. Statistical learning

3.1 Definition and area of application
What is machine learning and what can be accomplished with it?

3.2 Model accuracy
How to assess the best fit?

4. Predicting credit default rates
Transitioning from regression to classification.

4.1 Logistic regression
Introducing logistic models to predict credit default.

4.2 Resampling and tuning
Maximizing prediction power without overfitting.

4.3 Decision tree
Introducing a tree-based approach to predict credit default.

4.4 Bagging and boosting
Further optimization of prediction accuracy.

5. Conclusion
Comparing results, discussing alternatives and further possibilities.

6. References and credits

7. Appendix
Conduct your own research with the data.

Exercise 1 -- Agenda and overview

This problem set focuses on the application of machine learning to predict credit default. We will mostly utilize data sets provided by Lending Club, which is a peer-to-peer lending platform. The thesis covers multiple ways of analyzing the given data. The standard approach resembles a step-by-step tutorial with lots of background information. The goal is to provide a smooth entry in machine learning including practical experience by dealing with the analysis of data.

We will start with an exploratory investigation to get acquainted with the Lending Club data. As data manipulation is an essential part of data science, this thesis provides a short and straightforward introduction to data manipulation with R. The middle section focuses on laying out the foundation for more advanced statistical methods to predict credit default in subsequent chapters. To conclude, the final exercise recaps our previous results and provides a short outlook regarding alternative methods and steps not covered in this work.

Lending Club

Lending Club is an intermediary providing a peer-to-peer lending service. Contrary to the traditional banking system, Lending Club and other peer-to-peer platforms like Prosper and Upstart are open to individuals with different credit backgrounds. The necessary procedure to raise a credit, however, remains comparable. In general, Lending Club functions as a bank, as borrowers with a good credit score can expect to pay lower interest rates, whilst those with a bad credit score are exposed to higher interest rates.

Lending Club takes a 1% service charge from lenders of every borrower payment. If the loan is not fully paid back or defaults, Lending Club charges a collection fee of 18% of the recovered amount. The remainder is paid out to affected lenders. The fees for borrowers depend on a loan grade which is assigned by Lending Club. There are currently 35 loan grades ranging from A1 to G5. While A1 represents an interest rate of 5.32%, G5 corresponds to an interest rate of 31%. Additionally, any overdue receivables will impose further charges (Lendingclub.com, 2017).

Lending Club only accepts lenders, if they meet certain requirements. For instance, a lender must be at least 18 years old and comply with the so-called financial suitability conditions. All current requirements are disclosed at Lending Club's website. Borrowers, on the other hand, must file a loan application, which is capped at 10% of their equity. In sum, both parties must share private information regarding their financial status. Additionally, borrowers must provide data about their credit score and may raise a credit ranging from USD 1,000 to USD 40,000 (Lendingclub.com, 2017).

Peer-to-peer lending

Peer-to-peer lending is a convenient way to connect individuals to lend and borrow money from each other, without consulting the services of established financial intermediaries like banks. Nevertheless, to tackle the risks of the latter transactions, two opposing parties still require a service provider, e.g., Lending Club or Prosper. These providers facilitate the actual transaction between individuals and function as a gatekeeper. For a further discussion of peer-to-peer or social lending see Wang, Greiner and Aronson (2009), as they discuss the different approaches to peer-to-peer marketplaces in greater detail.

As stated in Shaw (2017) peer-to-peer lending promises average returns of 4.4% a year, which is considerably higher than the average interest of a savings account with 2.5%. Equally, borrowers relish similar pecuniary advantages compared to rather expensive bank loans according to Rind (2017).

Brignall (2017) and Verstein (2011, pp.457-466) state that lenders in a peer-to-peer market have more options to collect information about possible borrowers. Given the high transparency, this not only reduces costs but also encourages lenders to invest in individuals, of whom they think, they will repay the loan both in full and on time. In fact, even individuals with a bad credit score have a chance to take out a loan, if they persuade others to grant them money. That is a huge plus, as borrowers with a complex credit background would typically fall between the cracks. Moreover, the loan amount is almost immediately granted after a successful application.

However, there are drawbacks, too. As the exposure from defaults is much higher, lenders remain mostly uninsured. This might be linked to the fact, that since the introduction of peer-to-peer lending, delinquencies have seen a substantial increase in the past years.

The latter caused regulators to act by issuing guidelines and rules for service providers (Laurent, 2016). For a more detailed view of regulatory measures of social lending, Verstein (2011) is recommended for further reading.

The next exercise comprises the first tasks. Please proceed with Go to next exercise or manually change tabs.

Exercise 2 -- Data science with R

This chapter provides an introduction to data science in R. The initial exercise presents functions to manipulate our data set and allows for deriving first insights as well as more complex visual plots. In more detail, we will learn how to subset and aggregate our credit score, loan amount and volume in the first part and visualize our previous results as a function of time separated by US states in the second part.

Exercise 2.2 focuses on text mining and is an early opportunity to showcase the broad applicability of machine learning. Text mining is a method to extract patterns or relationships from large volumes of text. By analyzing text fragments, we can spot trends or segment our data. In our case specifically, we examine the variable purpose, which reveals the reasons behind taking out a loan with Lending Club.

Exercise 2.1 -- Visual analysis

Before working with the data, there is one mandatory step which will reoccur: we must load our data first. Regarding the data import, the infobox below illustrates the function readRDS(), which we will utilize to import our data set data.rds.

info("readRDS()") # Run this line (Strg-Enter) to show info

Task: Import the data set data.rds utilizing the function readRDS(). Assign the output of the function to a variable called data. To check whether you did everything right, proceed with check.

Remark: For any further advice, press the hint button, which will provide additional information about the given task, or parts of the solution. If, however, you seem to be unable to solve the task with the provided information, opt for the solution tab. The latter will paste the required code to solve the task on the command line. Please be aware of the fact, that it is possible, that an alternative solution was entered, which under normal circumstances, would be suitable, as it yields the same result. RTutor however, occasionally may not recognize this solution as correct, since RTutor expects a predefined answer and in most cases, will not allow for alterations.

# Adapt the code below
# ??? = readRDS("???.rds")

Task: Let's have a quick look at our data set. The function head() returns the first six observations of our data.frame data. Please pass data as a variable.

# Adapt the code below
# ???(???)

We can see, that the data set consists of (17) variables. To get an overview of all variables, please opt for the data tab next to edit. This option leads to the Data Explorer tab. Depending on the size of the current data set, it might take a while until the first results show up. We can then proceed to search the data set with the search function or select the Description tab. Description provides a brief explanation of every variable.

Data

Regarding this problem set, we rely on small user-created data sets, to introduce certain features or techniques. Besides, to research credit default in a real-world example, the preparation.R script utilizes the Lending Club data to generate the two data sets data.rds and data_b.rds for this problem set. Furthermore, as the predictions are computationally expensive, most predictions are solely presented. Similar to the data sets, the prediction.R script employs data_b.rds to produce the presented models. Both scripts as well as the data and the problem set itself, may be accessed at github.com/rotterp/RTutorMachineLearningAndCreditDefault/.

The Lending Club data is freely available at Lendingclub.com. It has been modified in the following order:

All single data files known at the time of the preparation of this thesis have been merged into a single data file.
To minimize runtime, columns, with either a high threshold of missing values, small significance to our analysis or those containing present data have been dropped. Additionally, existing columns may have been altered. I.e., the date format was adjusted.
The data set has been divided into two different data sets (data.rds; data_b.rds). One containing every loan for our initial analysis comprising both our visualization as well as the text mining part and a second data set with terminated loans only, to predict credit default in the later stages of this problem set.

All those alterations were solely made for simplification and are traceable with the preparation.R script.

Please keep in mind, that the Lending Club data sets are readjusted to the present status of the loan at the time, the data was disclosed. That means it is almost impossible to replicate identical results, even though supposedly the same data was utilized. Let's illustrate this with an easy example. Lending Club offers loan terms of 36 and 60 months. Assuming a loan was taken out on the 1st January of 2012 with a loan term of 60 months. It is possible that the borrower had problems paying off his installments and the loan status changed from current to late. If at this point, the data was updated, the loan status corresponds to the most recent one, late. However, there is still a possibility of this loan becoming current again. As the information concerning the development of the loan status is not part of the data set and due to the fact, that data sets are continuously updated, a loan status may change until the loan is eventually paid or charged off.

Loan grades

This section investigates the loan grades assigned by Lending Club which are an essential factor in determining credit risk. Simultaneously, this section serves as an entry point in data manipulation with R. Even though most parts of our data set data have been adjusted to fit our needs, there will be opportunities to aggregate our available data further. To tackle this task, we make use of two powerful and well-known R libraries called dplyr and tidyr.

Task: The given code chunk looks for every loan between 2007 and 2011, which corresponds to the first period and drops all columns except sub_grade. The so-called pipe operator %>% combines both statements. The aforementioned is a convenient way to keep our code lean and efficient. Finally, the function table() returns the distribution of loan grades between 2007 and 2011.

We can see, that loan grades are positively skewed, which implies, that most borrowers who took out a loan between 2007 and 2011 had a good credit score. The next task continues our loan grade research for the whole time span.

Task: This code chunk counts the observations for each grade and status_group (a grouped version of the possible loan statuses). The function substr(???, start, stop) retains only the part of a string indexed by 'start' and 'stop'. As sub_grade has values ranging from 'A1', 'A2', 'A3' to 'G5', substr(sub_grade, 1, 1) transforms those strings to 'A', 'A', 'A' ..., 'G'. This allows summarizing the sub_grade in five more suitable grade classes. Afterwards, the results are grouped by their grade, as the function group_by() creates groups based on one or more variables of our data set and allows to perform aggregations for each group separately. In a final step, the function mutate() adds a new column with the relative frequency of loans within each grade.

Notice, that 'Current & Issued' in conjunction with 'Paid' seems to account for the lions share for each grade. The next task visualizes our results.

Task: This code chunk creates a bar plot which depicts the number of loans per loan grade. We utilized our data.frame data and dplyr to aggregate the results in the prior task. Therefore, we can now visualize our results with the help of the ggplot2 and plotly libraries in R.

library(ggplot2)
library(plotly)

# Plot the results
ggplot(dat, aes(x = grade,
                y = frac,
                fill = status_group)) +
  # Omit position = position_dodge() as a stacked graph
  # looks nicer
  geom_bar(stat="identity") + 
  # Colors for the different status groups
  scale_fill_manual(values = c("#FFC100", 
                               "#ED5973",
                               "#39d9db",
                               "#980043")) +
  # Adjust the legend and ordinate
  theme(axis.text.y = element_text(size = 8, hjust = 1),
        legend.title = element_blank()) +
  # Labels
  labs(x = "Loan grade", y = "Relative frequency of status groups") -> g
  ggplotly(g)

Figure 2.1 Defaults seem to increase with a higher loan grade assigned by Lending Club. This is in line with our expectations, as a higher loan grade inclines a bad credit score and hence, a higher interest rate compared to lower grades. Moreover, regarding the different status groups, 'Current & Issued' together with 'Paid' continuously appear to account for more than half of the loans. This result however, must be treated with care, as only 'Default' and 'Paid' are terminal status groups. Consequently, a borrower who is currently behind schedule with his payments could still either default or resume his payments.

FICO score

Besides Lending Club's internal assessment of the default risk in the form of the loan grades, the credit score is another important characteristic of borrowers. Lending Club's credit scoring model incorporates the FICO score. FICO is an acronym for the Fair Isaac Corporation, which was the company that created the FICO score. While there are requirements for borrowers, it is difficult to estimate a minimum FICO score for borrowers. According to a blog entry on their official site, Lending Club requires borrowers to have a minimum FICO score of 640 (Lendingclub.com, 2017). Lending Club further states an average FICO score of 699 as of June 30, 2017, which can be confirmed with our data set (Lendingclub.com, 2017).

In total, a valid FICO score ranges between 300 (poor credit) and 850 (excellent credit). Hence, a minimum FICO score of 640 might seem rather high. Experian (2017), however, states a FICO score of less than 670 is below average and results in a delinquency rate of 28% in the future. According to Dickler (2017), the nation-wide average FICO score reached a value of 700 in the United States in 2017. This may be considered as a sign of good creditworthiness concerning Lending Club's borrowers, as the probability of default falls to 8% for a FICO score of 700 (Experian, 2017).

Moreover, Lending Club's credit scoring model considers the following quantities (Lendingclub.com, 2017): - On-time payment percentage - Hard inquiries - Credit card usage - Number of owned accounts - Length of the credit history - Tax liens and bankruptcy entries

For the remainder of this section, we explore the average FICO score over different periods, to see, how the FICO score developed over time. Furthermore, two additional functions will be presented: summarize() and spread().

Task: The following code calculates the average FICO score avg_fico categorized by period. As Lending Club has published their data unevenly, the first period comprises loans issued between 2007 and 2011, while the last period corresponds to the second quarter of 2017 only. Since there is no assignment, the result is returned.

The average FICO score in the first period is the highest. However, the variation is relatively small. Thus, we will explore the FICO scores as a combination of period and their respective US state in the next task.

Task: This task extends and continues our FICO score analysis. This time, we want to calculate the average FICO score not only as a function of time but also subject to different US states. We have seen already, that the FICO score varies little over time, except for the first period. Consequently, we extend our previous code snippet and add state as additional parameter to the group_by() statement. Please return a table with the average FICO score depending on period and state based on the variable fico_range_high.

library(dplyr)

# Uncomment and adapt the code below
# ??? %>%
#   group_by(???, ???) %>%
#   summarize(avg_fico = ???)

We can see, that the additional breakdown changes quite a bit, as for the depicted sample of data, the FICO score seems to vary for a single period across different states. The last task of this exercise will visualize this result in the form of a choropleth map. This will allow us to investigate, whether the FICO score varies across time for different states.

Loan book

Task: We want to calculate two new variables called loan_book and Total. The variable loan_book should contain the accumulated loan_amnt for each combination of issue_d and status_group. The variable Total should sum up the loan_book for each month issue_d. We can define the loan book in USD Millions with a simple division by 1e+06. Please insert the following variables: loan_book, issue_d, status_group and loan_amnt.

The following table provides a short description about the respective variables:


issue_d	loan_status	status_group	loan_amnt

The month which the loan was financed.	The status of the loan.	The grouped version of loan_status. There are four status groups possible: 'Current & Issued', 'Late', 'Paid' and 'Default'. 'Late' comprises every loan which is in grace period or late in payment.	The amount of the loan, the borrower applied for.

library(dplyr)

# Uncomment and adapt the code below
# data %>%
#   # Group our observations by date and status_group
#   group_by(???, ???) %>%
#   # Calculate the loan_book in Millions
#   summarize(loan_book = sum(???)/1e+06) %>%
#   # Add another column called Total as the sum of all loan books
#   # and assign the result to a variable called temp
#   mutate(Total = sum(???)) -> temp
#   # Return the results
#   temp

We can see, that our resulting table is in a so-called long format, which does not describe our data parsimoniously. Let's explain this in more detail: there exist up to four rows for each month issue_d, as status_group comprises the levels 'Current & Issued', 'Late', 'Paid' and 'Default'. In the case of June 2007 ('2007-06-01'), the loan books are limited to the two terminal statuses, 'Default' and 'Paid'. Further note that the variable Total is the sum of loan books per month.

A wide table, on the other hand, would look differently. In a wide table format, each row represents a month. Consequently, the column status_group is not necessary. Instead, there are five different columns for our loan_book. The column Total, remains as-is, while the other four columns represent the loan book for each status_group. E.g., loan_book_cndi, loan_book_late, loan_book_paid and loan_book_default.

We have seen so far, that we can quickly aggregate data to gain new insights. In more detail, we found out, that FICO scores vary over different states rather than over time. Furthermore, we calculated the monthly loan_book for every status_group.

Task: The following code chunk transforms our data set temp from a long to a wide table format. The function convert() is user-defined and prepares our existing data.frame to support a time series conversion, which the function dygraph() requires.

! addonquizUnderstanding dplyr and tidyr A

# This function converts our data set to support
# time series compatibility.
# By utilizing convert() we can rely on pipes (%>%) only
convert = function(.data, .what = ""){
  if(.what == "dygraph"){
    # Coerce to data.frame()
    .data <- as.data.frame(.data)
    # Assign the date as row name
    row.names(.data) <- .data$issue_d
    # Drop the previous date column
    .data$issue_d <- NULL
  }
  return(.data)
}

# Adjust our data set data
library(tidyr)

temp %>%
  # Divide our loan_book variable into three additional
  # variables depending on their status group
  spread(key = status_group, value = loan_book) %>%
  # Now that every date is unique, assign issue_d as row.name,
  # coerce to data.frame, drop the issue_d column and assign 
  # the result to a variable called dat 
  convert("dygraph") -> dat
  # Return our fully transformed data set in a sensible order
  select(dat, `Current & Issued`, Late, Paid, Default, Total)

In contrast to prior results, all rows are unique. We dropped the column status_group in favor of four new columns which hold the same information than the previous rows. To be more exact, we aggregated the loan amount per month for every status_group. As a result, we gained information about the loan_book divided by the different status groups per month.

! addonquizUnderstanding dplyr and tidyr B

For the remainder of this exercise, we will visualize our previous results in an interactive time series graph and a choropleth map with the help of the dygraphs and leaflet package.

Task: Utilizing our data.frame dat from the previous task, we create a bar plot which depicts the development of the loan book over time. Feel free to navigate specific time periods via mouse selection. To reset the zoom, please double-click.

library(dygraphs)

dygraph(dat, main = "Loan book [Million]") %>%
  dySeries("Total", label = "Total", color = "#0ac9ff") %>%
  dySeries("Current & Issued", label = "Current and issued loans", 
           color = "#FFC100") %>% 
  dySeries("Paid", label = "Paid loans", color = "#39d9db") %>% 
  dySeries("Default", label = "Defaulted Loans", color = "#980043") %>%
  dySeries("Late", label = "Late loans", color = "#ED5973") %>%
  dyLegend(width = 600) %>%
  dyHighlight(highlightCircleSize = 7, 
              highlightSeriesBackgroundAlpha = 0.3) %>%
  dyOptions(fillGraph = FALSE, 
            fillAlpha = 0.35,
            includeZero = TRUE, 
            gridLineColor = "lightgrey") %>%
  dyBarChart() %>%
  dyCSS("dygraph.css")

Figure 2.2 It is easy to see, that the loan book was growing immensely over the last years and reached its peak in February 2016. Lending Club's decision to report quarterly seems to be appropriate, as the last quarter's volume alone accounts for more than the whole-time span between 2007 and 2013. Recently almost all loans are either 'Current' or 'Issued'. To put it in economic terms, this might be a risky development, as an upswing in 'Default' loans could easily tilt the average default rate regarding all loans to the worse.

Task: Due to the fact, that an observation of FICO scores per month made little sense, we omitted the FICO scores in the previous task. In this final task, however, we visualize either the FICO scores, the loan volume, the default rate or the number of loans issued subject to the respective US states as well as the different periods. Remember each period corresponds to a Lending Club data set available at https://www.lendingclub.com/.

Please specify either "fico", "loans", "default" or "loanbook" as additional parameter. The leaflet package renders a choropleth map which allows for an interactive display. Feel free to explore the different settings by specifying another parameter. To do so click edit, adjust the second argument in the choropleth() function and proceed with check.

# If you are interested in the function choropleth() please 
# have a look at complements.R script hosted on github
# https://github.com/rotterp/RTutorMachineLearningAndCreditDefault/

# Adapt the code below:
# Plug either "loans", "fico", "default" or "loanbook" 
# in choropleth(data, "???")

library(leaflet)
library(geojson)
library(geojsonio)

choropleth(data, "loanbook")

Depending on the specified parameter in choropleth(), the choropleth and hence, the interpretation of the latter changes. Thus, let's do an example: in the second quarter of 2017, in the state of Washington, the average FICO score was 701 with a total of 2,139 issued loans. Together they account for a loan volume of USD 32.2 Million. Moreover, 2.6% percent of the loans have been paid already, and no loan defaulted yet.

It is easy to determine, that the FICO score seems to be relatively constant hovering around 700. Almost every outlier (every color except yellow, if "fico" was specified as the second argument) seems to rely on a minimal number of loans with borrowers having either a very high or low FICO score. This supports our previous finding: a lender at Lending Club can expect an average FICO score of around 700, which should indicate a chance of becoming delinquent in the future of approximately 8% (Experian, 2017). In contrast, our choropleth, reveals, that default rates were higher than suggested by Experian, especially in the 2007 to 2011 periods. Beyond that, the default rates of the more recent periods, beginning in 2016 must be treated with care. This is since most loans belong to either 'Current' or 'Issued', the default rate is inaccurate whenever the sum of defaulted and paid loans does not add up to 100%.

This task concludes this exercise. We not only visualized our results, but we also discussed a variety of functions for data manipulation, which will be key components for the later stages of this problem set. For further information on dplyr and tidyr, Wickham (2015;2017) are recommended for further reading. Both books offer a comparable variety of fundamentals and frameworks to create lean and efficient code in R. To visualize your data with the help of plotly, dygraphs and leaflet, please visit htmlwidgets.org.

Exercise 2.2 -- Text mining

This exercise aims to provide both an introductory overview of text mining as well as a second approach to research our data. We will, therefore, rely on simple methods to derive economic patterns. To be more detailed, we will create a new subset of our data set data, which we will then utilize to determine the reasons of borrowers for raising a credit with Lending Club.

Many parts are following Silge and Robinson (2017), who provide an excellent entrance for the world of text mining with R. They utilize tidytext in combination with widyr. Both packages support the previously introduced dplyr functions, which is a huge bonus, as we can stick to our current data structure.

Task: Let's first create our data set tm. There are three variables in data relevant for text mining: title, purpose and desc. The variables title and purpose are almost similar, as most borrowers stated their intentions already in the title or chose one of the fixed categories offered in the loan application. The third variable desc offers an opportunity for lenders to describe their loans' characteristics further by adding comments.

data = readRDS("data.rds")

# Create the data set tm and return it
tm = select(data, title, purpose, desc)
tm

Neither title nor purpose seems to be mandatory to raise a credit. Moreover, the description holds potentially precious information. However, this would not only involve numerous preparations regarding the data structure but also quite some economic thinking. Consider, that one of our buzzwords would be 'business'. Assume further, that following our analysis, our findings support a strong correlation between the occurrence of the buzzword 'business' in the description and the probability of default. This would imply that a high number of borrowers using the word 'business' in their description are not credit-worthy. Naturally, the question, whether this connection is indeed causal arises. We could, for example, assume, that most of these borrowers can be segmented to being business owners. That would mean that a lot of 'bad' business owners tend to finance their operations with peer-to-peer loans. Nonetheless, this is a strong assumption, and there could be other explanations. Maybe the meaning of the term 'business' does not correspond to the word 'company' and refers to 'issues' instead. E.g. 'my daughter's creditworthiness was none of my business, yet I decided to refinance her credit anyway.' As the latter or other scenarios would require further investigations, we omit desc from our text mining approach.

There are many similarities between title and purpose. While some people might be inclined to state an equivalent title and purpose, others might choose a striking one, such as 'Bill Consolidation - Debt free by 2015!!!', in hopes of increasing their chances to take out a loan. Another problem is upper and lower-case characters or spelling mistakes and differences, i.e., 'personel' and 'personal'. There are fast approaches to account for these circumstances. For example, to cast everything to lower case or to utilize a reference dictionary to compare and summarize findings. Consequently, we keep title as a variable. Finally, the column purpose has a very convenient feature, as a borrower must choose from 14 fixed categories. As a result, we know precisely which values we can expect, which renders this variable ideal for our initial mining approach.

Task: The following code chunk returns all 14 categories and their relative frequency.

tm %>%
  # Count the loans for each purpose
  count(purpose) %>%
  # Calculate the relative frequency
  mutate(frac = round(n / sum(n),2)) %>%
  # Order the table decreasingly by their fraction
  arrange(-frac) %>%
  # Factorize purpose with the according levels
  mutate(purpose = factor(purpose, levels = purpose)) %>%
  # Drop categories without loans and assign
  # the result to a variable called dat
  filter(frac != 0) -> dat

# Plot the results
ggplot(dat, aes(x = purpose,
                y = frac)) +
  geom_bar(stat="identity", position = position_dodge(),
           fill = c("#FFC100", "#463296", "#FA8C2D", "#460564", "#0072B2", 
                    "#6E1E7D", "#CC79A7", "#ED5973", "#39d9db", "#980043")) + 
  theme(legend.title = element_blank()) +
  # Tilt the ordinate labels
  theme(axis.text.x=element_text(angle = 90, hjust = .5)) +
  # Labels
  labs(x = "Loan purpose", y = "Relative frequency") -> g
  ggplotly(g)

Figure 2.3 The terms 'debt_consolidation' and 'credit_card' appear most, as their relative frequency accounts for 80%. While 'debt_consolidation' is rather easy to understand and broadly applicable, other categories may not. For example, what does the term 'credit_card' imply? We can try to answer this question by incorporating more data in our analysis. Perhaps including the loan title in our analysis will allow us to gather further information.

Task: The following code chunk creates a word cloud based on our loan title. We utilize the familiar dplyr pipe chain and proceed line by line: we start with our data set tm. The unnest_tokens(words, title) function separates a loan title into its respective words. For example, our previous title 'Bill Consolidation - Debt free by 2015!!!' results in 6 possible words: 'bill', 'consolidation', 'debt', 'free', 'by' and '2015'. As not all of the latter words are relevant for our analysis, the function anti_join(stop_words) utilizes a predefined dictionary to exclude filler words, in our example 'by'. Other obvious candidates, which are commonly used would be 'as', 'a', 'actually' and so on. The function count(words) calculates the frequency of the remaining terms. If for example 'consolidation' is mentioned in 500,000 loan titles, the function count(words) returns this value. Finally, the function wordcloud() employs the previously aggregated information to visualize our findings.

library(wordcloud)

tm %>%
  # Split the loan title in its components and
  # assign the resulting terms to a variable
  # called words 
  unnest_tokens(words, title) %>%
  # Remove filler words such as 'by', 'a', ...
  anti_join(stop_words, c("words"="word")) %>%
  # Count the absolute frequency
  count(words) %>%
  # Visualize the results
  with(wordcloud(words, n,
                 max.words=175, random.order=FALSE, 
                 rot.per=0.35, scale = c(3,.5),
                 colors = c("#e05971", "#dc415d", 
                            "#d82949", "#d41235")))

Apart from the four previous purposes, a fifth term comes up: 'refinancing'. This means, we have five pivotal terms in total: 'credit', 'card', 'financing', 'debt' and 'consolidation'.

! addonquizUnderstanding word clouds

As we saw, the majority of borrowers state either 'debt' or 'consolidation' as their reason for taking out a loan. Moreover, 'credit', 'card' and presumably 'refinancing', too, are stated in a fifth of the cases. We know already from the frequency chart, that 'debt' and 'consolidation' belong together, as 'debt_consolidation' accounts for the lions share of purposes stated to raise a credit. It is likely that the other three terms represent 'credit card refinancing' as well. If this were the case, both terms would account for about 80% of the reasons stated by borrowers as their purpose alone. Nevertheless, this only holds true, if the terms 'debt consolidation' and 'credit card refinancing' are independent of each other. That is because a person could take out a loan for both, debt consolidation in general and credit card refinancing at the same time.

One of the more common approaches to check for the association between words is to convert our current data structure to a so-called term-document matrix. This would, however, require additional packages, for example, the text mining package RWeka. Unfortunately, RWeka requires a running Java installation. Hence, we proceed with an alternative way to keep our present data structure. The widyr package has a function called pairwise_cor() which calculates the correlation concerning a feature column. As pairwise_cor() is computationally very expensive, we draw a sample of 25,000 observations from our whole data set tm.

Task: The following code chunk creates a sample data set tm_part based on 25,000 observations. In R, random numbers are generated using a seed, which natively depends on the current system time. Consequently, the function with.random.seed() from the RTutor package is applied to ensure the same data set each time, the code chunk is executed. We then proceed with our reduced data sample and divide it into five parts. Afterwards, we utilize unnest_tokens(words, title) and filter for filler words. Finally, we assign the resulting data set to a variable called tm_part and return the first six observations.

library(widyr)

# Draw a sample of 25,000 loans
with.random.seed(sample_n(tm, 25000), seed = 5690) %>%
  # Divide the sample data into five parts to compare 
  # to each other
  mutate(part = row_number() %/% 5) %>%
  # Filter for all parts greater zero
  filter(part > 0) %>%
  # Split up the title into its components
  unnest_tokens(words, title) %>%
  # Remove filler words
  filter(!words %in% stop_words$word) -> tm_part

head(tm_part)

As we can see, the overall structure remains the same except for an extra column called part which divides our data set tm_part into five parts. This column represents the feature column and is utilized to calculate the correlation between words of these sections.

Task: The following code chunk utilizes our reduced data set tm_part to calculate the correlation between words concerning our respective sections part. To be more specific, we start by aggregating our words and filtering out those words, which occur less than 25 times. Afterwards, we call the function pairwise_cor() to determine the correlation between the remaining terms, assign the result to a variable called tm_cor and return the first ten observations.

tm_part %>%
  group_by(words) %>%
  # Remove occasionally appearing words
  filter(n() >= 25) %>%
  # Calculate the pairwise correlation between words
  pairwise_cor(words, part, sort = TRUE) -> tm_cor

# Return the first ten observations
head(tm_cor, 10)

Our result reveals that the correlation between 'debt' and 'consolidation', 'credit' and 'card', as well as 'card' and 'refinancing' is very high. To finish this exercise, please investigate tm_cor and try to solve the following quiz.

Task: To solve the quiz below, feel free to utilize this code chunk. The hint tab provides additional help.

# Enter your code here

! addonquizInterpreting results and careful reading

As intended, the correlation between the three terms 'credit', 'card' and 'refinancing' is indeed very high, while for example, the correlation between 'debt' and 'refinancing' or 'consolidation' and 'card' is negative. This supports our previous hypothesis. The title of nearly 80% of the loans awarded, serves a purpose of either 'debt consolidation' or 'credit card refinancing'. One must wonder: are the chances of taking out a loan higher utilizing these buzzwords; or is it due to the fact, that Lending Club advertises both, credit card payoff and debt consolidation directly on their website?

The idea behind those financial capabilities is comparable, as both involve raising a new credit to pay off old liabilities. While the new loan is a compound liability and therefore, a larger debt to pay off, the terms are usually assumed to be preferable. That is since we expect a borrower to take out a new loan if the terms are better than before. Nevertheless, not everything shiny is gold, as Clements (2017) points out. Even though there is no prepayment penalty in place, most lenders claim an origination fee in advance, which is non-refundable.

In general, text mining is very powerful when a plethora of textual content must be analyzed. It is a fast way to not only derive knowledge but also to bring order to data. In our case, each observation is linked to a single loan. However, most situations are not this linear. We could, for example, assume a variety of different documents related to loans, e.g., various application forms or additional records regarding an applicant's income source. In this case, machine learning algorithms in conjunction with text mining could be used to cluster documents according to specific characteristics. Consequently, the following chapters will shift our emphasis to introducing statistical learning and its possibilities in detail.

Exercise 3 -- Statistical learning

This chapter provides an introduction to statistical learning. The first subchapter presents a variety of terms and quantities, which we will refer to in exercise 4. The second subchapter focuses on model accuracy and sheds light on the bias-variance trade-off.

The common theme for exercise 3 and 4 is inspired by James et al. (2017) as well as Dalpiaz (2017) in conjunction with Hastie, Friedman and Tibshirani (2017). Beyond that, a plethora of alternative or further literature has been incorporated.

Exercise 3.1 -- Definition and area of application

Definition

In general machine learning is a component of statistical learning. James et al. (2017, p.1) state that "Statistical learning refers to a vast set of tools for understanding data. These tools can be classified as supervised or unsupervised." The first one allows observing a response, while the latter does not. Recall our previous text mining approach. We had a limited set of three columns: title, purpose and desc. Hence, we knew the data origination before our analysis. However, such information is not always at hand. Imagine a scenario with an abundance of different documents, e.g., invoices, inquiries, and orders. Further assume, that all documents are part of a single data structure and we cannot distinguish between the documents. In this case, an unsupervised algorithm can be used to cluster the documents. As even though the unsupervised algorithm does not know which document belongs to which class, the algorithm can conclude that document (x) is very different from an invoice or order, which means that (x) is likely an inquiry. Following this line of thought, the algorithm tries to segment the available data into different groups with similar features.

Concerning a supervised algorithm, the supplied observations must be labeled in advance. This means that both, the response, as well as the explanatory variables are known. Applied to the previous example, the information, which document is an order and which is an invoice is available. Thus, the algorithm fits a model based on the supplied data and extrapolates. An example of supervised algorithms is the most commonly known linear regression.

Additionally, a hybrid approach exists. The so-called semi-supervised algorithm is a supervised algorithm with user input. This algorithm might, for example, question whether a single observation, which is difficult to classify, is undoubtedly an invoice. That information must be supplied by an external user and can then be employed by the algorithm to enhance its results. An interesting application of semi-supervised algorithms is discussed in Bair and Tibshirani (2004). They present a semi-supervised learning algorithm to predict a patient's survival regarding the genetic profiles of tumors. Moreover, they elaborate on drawbacks and limitations of supervised learning methods.

Area of application

There are almost no limits of fields, quantitative or qualitative, in which learning algorithms are applicable. Nonetheless, we can further narrow it down to two core elements: prediction and inference. Prediction highlights the output and development in the future based on a set of present observations. Contrary, inference is mainly concerned with explaining the relationship between our variables and maximizing their interpretability. As we are focusing on the prediction of credit default, the focal point of this thesis is prediction; thus, interpretability is not our primary concern, and we may even skip a thorough discussion of relationships between our variables.

Prediction

We can illustrate the prediction steps in mathematical terms similar to James et al. (2017).

Assume an output variable (Y) with a matrix of input variables (X). We want to use (X) to predict (Y). We further denote our response with a caret, meaning (\hat{Y}) is the prediction of (Y) based on our set of predictors (X = X_1, X_2, \dots, X_k). Each (X_k) corresponds to a different column or variable and is a vector of its realizations (x_1, x_2, \dots, x_T). (T) resembles the number of available observations. Always keep in mind, that we only hold a sample of (X) at our disposal. To put it differently, if we want to predict default rates, we lack the knowledge of the actual effects at work. Consequently, our prediction of (Y) will never be perfect. That is the reason behind (\hat{Y}) resembling a function of (\hat{f}(X)) and the object to fit a function, which delivers a sound prediction.

If we assume a linear relationship between our output variable (Y) and our independent variable (X), we can denote our prediction (\hat{Y}) as follows:

[ \begin{eqnarray} \hat{Y} &=& \hat{f}(X) \\ &=& \hat{\beta}_0 + \hat{\beta}_1 X + \varepsilon \end{eqnarray} ]

The (\varepsilon) is referred to as error term or residuals. The residuals (\varepsilon) however, are not part of our prediction. That is why we assume the error term for a linear function (f) is independent and identically Gaussian distributed with zero-mean.

As the underlying function is of vast interest, ideally, several functions (f) are fitted to present observations to achieve a high prediction accuracy. Thus, we will fit different models to maximize prediction accuracy in Exercise 4.

Moreover, it is crucial to note, that the metric to compare models is prediction accuracy. While the latter statement sounds logic, it is difficult to grasp. That is, due to the fact, that the true relationship between our variables and hence the function (f), is not an actual point of our concern. We are interested in which model - all trained with the available data - predicts the probability of default the best. James et al. (2017, p.17) refer to this behaviour as "black box". The next subchapter presents the mean squared error (MSE) as one possible measure, while chapter 4 introduces the area under the curve (AUC).

Estimating a function f

It is essential to understand that, while most machine learning algorithms tend to deliver a good prediction accuracy, one must always weigh costs to performance. For example, a complex linear model with all variables and their interactions needs much longer to regress compared to a linear regression with a single predictor. However, the deviation in prediction accuracy might be modest. Hence, prediction accuracy and performance are two important properties to keep in mind.

Besides, there is no such thing as 'the model', which can be visualized with the following example:

# Generate a random data set utilizing genr("random")
genr = function(.what = ""){
  #-- a part of this function has been omitted and is presented later --#
  if(.what == "random"){
    .data <- data.frame(x = 1:40*10 + rnorm(40,sd=10),
                        y = (1:40)^2 + rnorm(40,sd=150))
  }
  return(.data)
}

# Plot the observations
ggplot(with.random.seed(genr("random"), seed = 5690)) + 
  geom_point(aes(x=x, y=y), size = 3, color = "#34d1a5") +
  geom_smooth(aes(x=x, y=y), method=lm, 
              formula = y ~ poly(x,1), se=FALSE, color = "#d10e89") + 
  geom_smooth(aes(x=x, y=y), method=lm, 
              formula = y ~ poly(x,2), se=FALSE, color = "#FFC100") +
  labs(x = "X", y = "Y")  -> g
  ggplotly(g, tooltip = c("y","x"))

Figure 3.1 The function genr("random") generates the numbers 1 to 40 plus an arbitrary random Gaussian distributed number with zero mean and a standard deviation of 10 as x-coordinates. The associated y-coordinates are the same numbers squared, with a higher random mark-up. Afterwards, a linear as well as a quadratic model is fitted to the observations. We can see, that even though the data generating process is quadratic, the linear model seems to provide a good approximation, too.

James et al. (2017, pp.21-24) state two general approaches to estimate a function: the first one is a parametric model, which is the more common approach. This approach relies on a fixed set of parameters to define the shape of our function, e.g., a linear relationship between default rate and loan amount. The non-parametric method on the other hand - which is the second approach - does not make such strict assumptions. Instead, the non-parametric method tries to fit the observations as precisely as possible.

The following code chunk plots a spline as well as a regression line. It is easy to see that the non-parametric spline fits the data much better than the parametric linear regression, which has advantages and disadvantages, as we will see in the following subchapter.

Regarding figure 3.2, feel free to adjust the number of interpolations by changing the parameter value of the variable n.

# Generate a random data set utilizing genr("spline")
genr = function(.what = ""){

  if(.what == "random"){
    .data <- data.frame(x = 1:40*10 + rnorm(n = 40, sd = 10),
                        y = (1:40)^2 + rnorm(n = 40, sd = 150))
  }

  if(.what == "spline"){
    # x-coordinates are the numbers one to ten
    # y-coordinates are ten randomly drawn numbers ranging from
    # one to forty
    # with an arbitrary mark-up: see genr("random")
    .data <- data.frame(x=1:10, y=sample(genr("random")$x,10))
  }
  return(.data)
}  

# Assign the resulting data set to a variable called dat
dat <- genr("spline")
# Define the level of interpolations n
n <- 100

# Plot the resulting data set
ggplot(dat, aes(x,y)) +
  geom_point(size = 4, color = "#34d1a5") +
  geom_smooth(aes(x=x, y=y), method=lm, formula = y ~ poly(x,1), 
              se=FALSE, color = "#d10e89") + 
  geom_line(data=data.frame(spline(dat, n = n)), color = "#3DB7B8") +
  labs(x = "Input variable X", y = "Y") -> g
  ggplotly(g, tooltip = c("y","x"))

Figure 3.2 While the abscissa corresponds to the numbers 1 to 10, the y-coordinates are based on the forty previous observations of Figure 3.1. To be more exact, a random sample of ten numbers is drawn from the observations in Figure 3.1. Thus, the resulting plot of figure 3.2 will differ every time the code is executed. You can verify this, by clicking edit and check again, to create a new interactive figure. The red straight line corresponds to the linear fit, while the green line represents the non-parametric spline. Both models differ regarding two key characteristics, variance and bias. Both will be examined in the next exercise.

info("spline()") # Run this line (Strg-Enter) to show info

While semi-parametric methods are certainly interesting, we must omit a further discussion, due to the scope of the thesis. Nevertheless, Ruppert, Wand and Carroll (2010, pp.161-237) are recommended for further reading.

Regression and classification

According to James et al. (2017) quantitative problems are often referred to as regression. As opposed to qualitative or categorical problems, which are associated with classification problems. Even though this separation makes sense, they further claim, that borders become blurred. A good example would be a linear or logistic regression. While the first one is quantitative, the second one is a well-known method to analyze a binary problem.

In the case of predicting credit default, the distinction between regression and classification is clear. We are facing a binary problem with two credit states: 'Default' and 'Paid'. Thus, we will focus on logistic regression instead of linear regression models. We could also reason, that we are predicting probabilities of default. Consequently, our output may take any values between 0 and 1. A classic linear regression does not meet this criterion, as we can see in the following figure:

# Generate the data set
# Our x-coordinates are 31 observations  in total ranging 
# from -5 to 5 in 0.33 increments:
# -5, -4.67, -4.34, ..., 4.57, 4.90
x = seq(-5, 5, 0.33)
# Our y-coordinates correspond to 'Paid' and 'Default' loans and 
# are decoded as a sequence of zeros and ones 
# They been manually adjusted to split our observations 
# with three outliers
y = c(0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 
      0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
# Merge both coordinates into a single data.frame called dat
dat = data.frame(x = x, y = y)

# Plot the observations
ggplot(dat) +
  geom_point(aes(x,y=y), color = "#ff0080", size = 3) + 
  # Fit a linear model to our observations
  geom_smooth(aes(x, y), method=lm, formula = 
                y ~ poly(x,1), se=FALSE, size = 2, color = "#2200FF") +
  # Calculate values based on the logistic function
  stat_function(fun = function(x){exp(x)/(1+exp(x))}, geom="line",
                size = 2, color="#12C7A6") +
  # Add two horizontal lines to illustrate the boundary concerning
  # the probability
  geom_hline(yintercept = 0:1, linetype="dotted") +
  labs(x = "Independent variable X", y = "Probability of Default") -> g
  ggplotly(g, tooltip = NULL)

Figure 3.3 This figure is based on 31 manually generated observations ranging from -5 to 5 on the abscissa. Each of these observations is classified as 'Default' or 'Paid' on the ordinate. The green curve is generated using the logistic model, which will be introduced in exercise 4.1. It illustrates the classic logistic s-shaped regression curve. The straight blue line, on the other hand, is a linear fit of our red observations.

We can see, that a standard linear regression is not fit to do the job, as the probability exceeds our lower boundary of 0, as well as the upper boundary of 1. The logistic regression curve, on the other hand, is situated between this range and remains as a possible solution to tackle this problem. Thus, we will employ the logistic model in exercise 4 in accordance with James et al. (2017).

For the remainder of exercise 3, we will discuss the accuracy of our predictions. To be more exact, we will first come back to the linear model, to discuss a model's prediction accuracy as subchapter 3.2 illustrates the bias-variance trade-off.

It is, however, a key takeaway, that both logistic and linear regression have their right to exist and must be applied to their strengths. The linear regression is undoubtedly the easiest and by far most commonly known model in statistics. Therefore, we will rely on it to illustrate the bias-variance trade-off, which is prevalent in all models, including classification. Contrary, the logistic model is the natural choice for predicting binary states, such as credit default in exercise 4.

Exercise 3.2 -- Model accuracy

Model accuracy

As already discussed, there is no single best model. Thus, in combination with a variety of different models at our disposal, assessing model accuracy is a vital part of machine learning.

While visual approaches from the previous subchapter may infrequently be sufficient to determine whether the model fits the data or not, a more reliable method is preferable. One of the most well-known quantities is the mean squared error as a quantitative measure to evaluate the quality of the fit:

[MSE = \frac{1}{T} \sum_{t=1}^T \big(y_t - \hat{f}(x_t)\big)^2]

Again, (T) corresponds to the number of observations, while the (x_t) denotes a single observation. This measure will be small if the difference between the real value (y_t) and the predicted value (\hat{f}(x_t)) is minimal and high otherwise. Even though the mean squared error is one of the most commonly known measures and broadly used in econometrics, it must be treated with care.

We aim to train a model to predict default rates. Training is a frequently used term in machine learning. It means, we first divide the whole data set into two samples. One of them is referred to as the training data set and comprises observations used to teach our model on how to estimate a function (f). The other one is referred to as test data set and is used to assess the out-of-sample prediction quality. This procedure has another intriguing implication: we train our model on a different excerpt of data, and thus, a low mean squared error in the training set does not automatically entail a good performance regarding the test data. Consequently, as the test data simulates other future observations, one should always use the test data to derive statements about the fit. Regarding the comparison of model accuracy, the model with the lower mean squared error on the test data is mostly considered best (James et al., 2017).

Application

We can define the average squared prediction error of the test data as follows:

[MSE = E\Big[\big(f(x_{test}) - \hat{f}(x_{test})\big)^2\Big]]

In this formula, (x_{test}) corresponds to the observations from the test data set. Utilizing this formula as starting point, James et al. (2017) decompose the mean squared error into the variance of our prediction (Var(\hat{f}(x))), the squared bias of our prediction ([Bias(\hat{f}(x))]^2) and the variance of our residuals (Var(\varepsilon)):

[\begin{eqnarray} E[y_{test} - \hat{f}(x_{test})]^2 &=& Var\big(\hat{f}(x_{test})\big) + \Big[E\big[\hat{f}(x_{test})\big] - f(x_{test})\Big]^2 + Var\big(\varepsilon\big) \\ &=& Var\big(\hat{f}(x_{test})\big) + \Big[Bias\big(\hat{f}(x_{test})\big)\Big]^2 + Var\big(\varepsilon\big) \end{eqnarray} ]

With the previous equation in mind, we can conduct our own small simulation study to verify the decomposition. Our approach will be similar to the one of Dalpiaz (2017, ch. 8).

Task: We start by simulating our sample data set. The following code defines a function (f(x) = 2x + 5x^2 - 0.3x^3), which equals the true regression function. We further define a vector of independent variables (X) and a zero-mean error term (\varepsilon). To fully establish our data generating process, we specify our vector of observations (Y), too.

Analogous to exercise 2.2, the function utilizes a fixed seed for a specific action. If we rerun the code chunk, the results will be alike.

# Define a random seed
seed <- 5690

# Define our true regression function f
f = function(x){
  2*x + 5*x^2 - 0.3*x^3
}

# Generate the vector of independent variables x
X = with.random.seed(runif(n = 25, min = 0, max = 30), seed)

# Generate the error term
eps = with.random.seed(rnorm(n = 25, mean = 0, sd = 500), seed)

# Generate the vector of the response variable y
Y = f(X) + eps

# Assign both x and y to a data.frame
dat = data.frame(X = X, Y = Y)

Task: As the actual function is difficult to imagine, we can plot our observations Y and our true regression function f subject to X.

# Plot our observations and our true regression function f
ggplot(dat) +
  geom_point(aes(X,Y) ,size = 4, color = "#ff0080") +
  stat_function(fun=f, 
    geom="line", size = 2, color="#12C7A6") +
  labs(x = "X", y = "Y") +
  theme(legend.position = "none") -> g 
  ggplotly(g)

Figure 3.4 We can see, that there is quite some variation in our observations. The green line corresponds to the true regression function f. The red observations resemble the observations of the response variable Y. Please keep in mind, that we would not be able to do this comparison in a real-world scenario, as we would not know the underlying function f.
Our next step is to regress our response variable Y on our predictors X utilizing the ordinary least squares approach (OLS). Please see the info box below for instructions.

info("Linear regression and other useful features in R") # Run this line (Strg-Enter) to show info

Task: Perform a linear regression with our response variable Y and explanatory variables X stored in dat. Assign the resulting model to f_hat.linear.

# Enter your code below

Task: We can return summary statistics with the help of the summary() function.

summary(f_hat.linear)

The p-values reveal that both, intercept as well as our beta coefficient of X, are significant at the 10 percent level. Unfortunately, the coefficient of determination R-squared is very low, which indicates, that our independent variable does not explain the variation in our dependent variable Y very well. Recall, that our true regression function f is a polynomial and thus, a straight line might not present a decent fit.

Task: The following plot depicts our linear model f_hat.linear, the true regression function f and the observations utilized to train the model.

ggplot(dat) +
  geom_point(aes(X,Y) ,size = 4, color = "#ff0080") +
  stat_function(fun=f, 
    geom="line", size = 2, color="#12C7A6") +
  geom_smooth(aes(X, Y), method=lm, se=FALSE, color = "#FFC100") +
  labs(x = "X", y = "Y") + 
  theme(legend.position = "none")  -> g 
  ggplotly(g)

Figure 3.5 The resulting fit seems to be off, as the relationship between our response variable Y and our predictors X is an excessively simplified version of the truth.

Task: With our trained model f_hat.linear in place, we can try to predict other values. We generate a single test observation (x_{test} = 31) and assign it to a variable called test. Please adapt the code chunk below to predict (\hat{f}(31)) utilizing f_hat.linear and assign the resulting prediction to a variable called pred_linear. Afterwards calculate the difference between the true value ({f}(31)) and our predicted value (\hat{f}(31)).

# Uncomment and adapt the code below

# Assign a test observation to a variable called test
# test = data.frame(X = ???)

# Predict the response utilizing the linear model
# ??? = predict(???, ???)

# Return the difference between true value and the prediction
# ???

Our result is quite striking, as our prediction is roughly 2250 units off. This outcome might seem much, yet a single prediction is not enough to adequately assess our model's quality. We will get a more precise result if we repeat the same prediction multiple times. Therefore, we implement a function which generates (n) data sets. We can then utilize those data sets to predict (\hat{f}(31)) with (n) different models. This approach allows for variation and is less restricting than a singular estimation.

Task: Please adapt the function draw_random_y() below. The following description provides supplementary information to solve the task:

Generate a total of (n = 25) new observations in each sample.
Distribute the vector of response variables uniformly between 0 and 30.
Generate a vector of residuals with mean 0 and a standard deviation of 500.

# Uncomment and adapt the code below
# n = ???

# draw_random_y = function(){
#   X = runif(n = n, min = ???, max = ???)
#   eps = rnorm(n = n, mean = ???, sd = ???)
#   Y = ???
#
#   dat = data.frame(X = X, Y = Y)
# }

Task: With the recently implemented function draw_random_y(), we can proceed to perform a simulation study: the following code creates a data.frame called pred with 250 wildcard predictions. To be more detailed, the function simulation_study() calls draw_random_y() 250 times and applies the same steps as previously for each data set. The result is the data.frame pred, which summarizes the 250 independent predictions for the test observation (x_{test} = 31).

# Create an empty data.frame with the two columns model and prediction
# Set model to "linear" and initialize each prediction with a value 
# of 0
pred <- data.frame(model = rep("linear", 250), prediction = 0)

# Define a new function called simulation_study() which calls 
# draw_random_y() and predicts a value for test for
# each data set
simulation_study = function(){
    .dat = draw_random_y()
    predict(lm(Y ~ X, data = .dat), test)
}

# Apply the function for each model separately inside
# with.random.seed() to ensure consistency
# Reminder: All data sets are distinct
with.random.seed(seed = 5690,
  pred %>%
      rowwise() %>%
      mutate(prediction = simulation_study()) -> pred
)

As we can see, the predicted values seem to differ quite a bit depending on the linear model, which is due to the various 'random' data sets employed to determine our estimates.

Task: With pred, we can proceed to calculate and decompose the mean squared error into squared bias and variance. Please estimate these quantities in accordance with the subsequent formulae described in Dalpiaz (2017, ch. 8.3): [ \widehat{\textrm{MSE}} = E\Big[\big(f(x_{test})-\hat{f}(x_{test})\big)^2\Big]]
[ \widehat{\textrm{Bias}} = E\Big[\big(f(x_{test})\Big]-f(x_{test})]
[ \widehat{\textrm{Var}} = E\Bigg[\Big(\hat{f}(x_{test}) - E\big[f(x_{test})\big]\Big)^2\Bigg]]

Reminder: The variance of our residuals (Var(\varepsilon)) totals up to zero.

# Please enter your code below

# Estimate the mean squared error

# Estimate the variance

# Estimate the bias

# Calculate the bias^2

# Calculate the sum of bias^2 and variance

The decomposition holds true. The sum of squared bias and variance equals our mean squared error. Nevertheless, our mean squared error seems high. At this point, it might be intriguing to know, how our linear model performs in contrast to other models. Hence, we will create a new data set pred with predictions for three additional models: polynomials with a degree of 0, 2 and 7. That results in four different polynomials, a constant model with 0 degrees, our previous linear model with 1 degree and a quadratic and septic model with 2 and 7 degrees respectively.

Task: This time, the code chunk creates a new data set pred comprising four different polynomial models with 250 observations each.

# Create an empty data.frame with two columns: model and prediction
# Other than previously, we create 250 observations per model,
# starting with the linear model to ensure consistency across tasks
pred = data.frame(model = c(rep("linear", 250), 
                            rep(c("constant", "quadratic", "septic"), 250)), 
                  prediction = 0)

# Extend our previous simulation_study function
simulation_study_extended = function(model = NULL){

  .dat = draw_random_y()

  # case_when follows a simple if this then that (IFTTT) logic:
  # if the parameter model corresponds to "linear", the linear model
  # is fitted and provides the prediction
  case_when(model == "linear" ~ predict(lm(Y~X,data=.dat),test),
            model == "constant" ~ predict(lm(Y~1,data=.dat),test),
            model == "quadratic" ~ predict(lm(Y~poly(X,2),data=.dat),test),
            model == "septic" ~ predict(lm(Y~poly(X,7),data=.dat),test))
}

# Apply the function for each row separately inside 
# with.random.seed() 
with.random.seed(seed = 5690,
  pred %>%
      rowwise() %>%
      mutate(prediction = simulation_study_extended(model)) -> pred
)

The previous code chunk extends pred by 750 additional predictions for the other models. The next interactive figure will depict the different models based on the initial data set dat. We will shortly discuss these models before we continue to shed light on the bias-variance trade-off of these models.

library(ggplot2)
library(plotly)

# Create the four plots p1 to p4
p1 <- ggplot(dat) + 
      # Add the observations in dat
      geom_point(aes(X,Y), size = 2, color = "#ff0080") +
      # Add the true regression function f
      stat_function(fun=f, geom="line", size = 2, color="#12C7A6") +
      # Add the linear model
      geom_smooth(aes(X, Y), method=lm, formula = Y ~ 1, 
                  se=FALSE, color = "#bf00ff") +
      # Adjust the axis labels
      labs(x = "X", y = "Y") +
      # Remove the legend
      theme(legend.position = "none")
p2 <- ggplot(dat) + 
      geom_point(aes(X,Y), size = 2, color = "#ff0080") +
      stat_function(fun=f, geom="line", size = 2, color="#12C7A6") +
      geom_smooth(aes(X, Y), method=lm, se=FALSE, color = "#FFC100") +
      labs(x = "X", y = "Y") + 
      theme(legend.position = "none") 
p3 <- ggplot(dat) +
      geom_point(aes(X,Y), size = 2, color = "#ff0080") +
      stat_function(fun=f, geom="line", size = 2, color="#12C7A6") +
      geom_smooth(aes(x=X,y=Y), method=lm, formula = y ~ poly(x,2), 
                  se=FALSE, color = "#cc3300") +
      labs(x = "X", y = "Y") +
      theme(legend.position = "none") 
p4 <- ggplot(dat) +
      geom_point(aes(X,Y), size = 2, color = "#ff0080") +
      stat_function(fun=f, geom="line", size = 2, color="#12C7A6") +
      geom_smooth(aes(x=X,y=Y), method=lm, formula = y ~ poly(x,7), 
                  se=FALSE, color = "#483699") +
      labs(x = "X", y = "Y") +
      theme(legend.position = "none")

# Combine the four plots
subplot(p1, p2, p3, p4, nrows = 2, 
        heights = c(0.5, 0.5), titleX = TRUE, titleY = TRUE) -> .p

Figure 3.4 The purple horizontal line corresponds to the constant model. It oversimplifies the true regression function in green and does not explain the observations very well either. As we already know, the yellow straight line does not fit the true regression line either. However, it fits the observations better than the horizontal line, which is an expected behavior, as the more flexible a model, the closer the fit. The quadratic model in red seems to fit the training data adequately without oversimplifying. Lastly, the dark blue septic model appears to fit the observations too closely.

It should become apparent from the recent plot, that neither of our models seems to fit our true regression function (f) perfectly. However, we must keep in mind, that our sample size of 25 observations is petite and has been compiled with the sole purpose to exemplify. In general, we can expect a model to fit better with more observations. Nonetheless, it is not only about the fit. If the model mimics the observations to closely, the resulting model may be wrong towards future predictions, too. The issue above is caused by the so-called bias-variance trade-off which we will discuss now.

Bias-variance trade-off

Recall the formula of the mean squared error:

[MSE = Var\big(\hat{f}(x_{test})\big) + \Big[E\big[\hat{f}(x_{test})\big] - f(x_{test})\Big]^2 + Var\big(\varepsilon\big)]

So far, we have calculated the mean squared error, the bias, and the variance. We also verified that the sum of the squared bias and variance equals the mean squared error. In a real-world situation, the variance of the error term (Var(\varepsilon)) equals noise and cannot be reduced. It serves as a lower bound for the mean squared error. Regarding our simulated data with zero-mean residuals, the variance of our error term sums up to 0. Consequently, we can neglect it in our simulated example. Let's now discuss the remaining two quantities. Our bias is subject to our model, which means we can lessen our mean squared error with a lower bias. Moreover, we can minimize our mean squared error with a lower variance of our prediction. Thus, the question arises, how both quantities interact.

Variance refers to a change of (\hat{f}) when a different sample of training data is applied. Usually, a model with a close fit to the data inherits a high variance, as changing a single observation has a big influence on the model. A less flexible approach like a linear model does not significantly change if a single observation varies. By contrast, bias is an error introduced by wrong assumptions. If we regress (y) on (x) utilizing ordinary least squares, while the true relationship is not linear, our model will inherit a high bias. James et al. (2017) refer to this problem as an over-simplification of real-world situations.

Conventionally, the more flexible the model, the lower the bias and the higher the variance, which we have already observed in figure 3.4.

! addonquizVariance and bias

Now, that we established the basics, we can continue to explore bias and variance separately before coming back to the bias-variance trade-off. We are safe to assume, that the bias of our septic model is much lower than the bias of the constant model. That is since the constant model does not even attempt to follow the distribution of our observations, while the septic model mimics them very closely. However, the question arises, whether it is possible, to visually assess the difference in variance between models. The following figure depicts 25 versions of the constant and septic model. For better visibility, the individual observations have been omitted. Subsequently, only the resulting models are depicted.

# Create the observations
X = runif(n=5000, min=0, max=30)

# Draw a vector of 5000 normal distributed residuals with zero-mean 
# and a standard deviation of 500
eps = rnorm(n = 5000, mean = 0, sd = 500)

# Create a vector of observations Y based on our function f(x) 
# and our residuals
Y = f(X) + eps

# The rep() statement is tricky, but easy to understand:
# rep(1:25, each = 50) will create 50 ones followed by 50 deuces
# and so forth
# rep(c("con","sep"), each = 25, times = 10) will create the string
# "con" 25 times followed by "sep" 25 times, which results in a
# vector of length 50. This vector is again repeated 10 times
# rep(1:25, times = 100) will simply create the numbers 1 to 25 
# followed by the numbers 1 to 25 for a total of 100 times
.data = data.frame(instance = rep(1:25, each = 50),
                   model = rep(c("constant", "septic"), 
                           each = 25, times = 100),
                   obs = rep(1:25, times = 100),
                   X = X,
                   Y = Y)


# Create the plots
p5 <- ggplot(.data) +
      stat_function(fun=f, geom="line", size = 2, color="#12C7A6") +
      geom_smooth(aes(x=X, y=Y, group = instance), method=lm, 
                  formula = y ~ 1, se=FALSE, color = "#bf00ff")
p6 <- ggplot(.data) +
      stat_function(fun=f, geom="line", size = 2, color="#12C7A6") +
      geom_smooth(aes(x=X, y=Y, group = instance), method=lm, 
                  formula = y ~ poly(x,7), se=FALSE, color = "#483699")

# Combine the plots
subplot(p5, p6, nrows = 1, titleX = TRUE, titleY = TRUE)

Figure 3.5 This figure visualizes the variance of the constant and septic model respectively. While the purple horizontal lines (constant model) vary little across different data sets, our septic model in blue varies a lot. This is due to the septic model mimicking the observations very closely. If the data set changes, the whole model may change considerably, too. The constant model, on the other hand, has a low variance, as it represents the expected value.

! start_note "Info: Over- and underfitting"

The previous behavior described in figure 3.5 is referred to as over- and underfitting. Overfitting our data is comparable to a low training, and high test mean squared error. The model fits the observations too precise. Thus, the resulting model is sensitive to any changes in the observations, which results in a high variance, and consequently, reduced prediction accuracy. To avoid overfitting, one must choose a less flexible model.

Underfitting, on the contrary, means our estimated model does not fit the observations very well. A common mistake is to assume a linear relationship, while in fact there is none. If underfitting occurs, a high bias results as our model is too simplistic.

The way to tackle this problem is to neither over- nor underfit the data. Therefore, one must estimate a model which has enough room for variance, yet also a proper fit to the observations. Bias and variance will always coexist, and one must find a balance between both, instead of minimizing one and thereby maximizing the other.

! end_note

Before we end this exercise with a summary table regarding the mean squared error across models, please try to solve the following quiz:

! addonquizBias-variance trade-off

Task: To end this exercise, we return a summary table with the mean squared error, as well as the variance and the squared bias plus the sum of the two latter quantities for every model. As the code is already provided, please proceed with check.

pred %>%
  # Remove the rowwise() statement
  ungroup() %>%
  # Regroup based on model
  group_by(model) %>%
  # Coerce to integer to avoid decimals
  summarize(MSE = as.integer(mean((prediction - f(test$X))^2)),
            VAR = as.integer(mean((prediction - mean(prediction))^2)),
            BIAS2 = as.integer((mean(prediction) - f(test$X))^2),
            VAR_BIAS2 = as.integer((mean(prediction) - f(test$X))^2+
                              mean((prediction-mean(prediction))^2))
  ) %>%
  # Drop the model column as we will add a proper caption with stargazer
  select(-model) -> predm

# Return the results
# t() transposes our data.frame 
stargazer(t(predm),
          title = "Bias-variance trade-off",
          covariate.labels = c("", "Constant", "Linear", 
                               "Quadratic", "Septic"),
          align = TRUE,
          type = "html",
          style = "aer",
          colnames = TRUE)

We can see, that the more flexible our model becomes (rising polynomial degrees), the lower the bias and the higher the variance. We can also confirm, that our quadratic model has the lowest mean squared error. This is due to the fact, that in all 250 data sets, the predictions of the quadratic models are the closest to the true value (f(31)). We can also see, that for all models, the decomposition holds true: in all our models, the sum of squared bias and variance is equivalent to the mean squared error.

This exercise should have clarified the connection between bias and variance, which is a crucial component in econometrics and especially machine learning and prediction accuracy. The problem set will present different approaches to tackle the bias-variance trade-off in the following exercises, as we predict the probability of default based on our real-world Lending Club data set.

Exercise 4 -- Predicting credit default rates

Classification

Similar to the previous exercise, which discussed bias and variance with the help of linear regression, both quantities apply to classification problems, too. In our case, we are interested in whether a loan is going to default or not. That means opposed to our regression setting of exercise 3.2 with quantitative results; we are concerned with a binary state: 'Default' and 'Paid'.

From now on, assume (Y) as our loan status. James et al. (2017) define the accuracy of our estimate (\hat{f}) as a so-called error rate. The error rate is the sum of all failures (wrongly classified observations) divided by the total number of observations (T). A failure is classified by the function (I()). (I()) is an indicator function which returns either 1, if (y_t = \hat{y}_t) holds or 0 else. To put it in mathematical terms, we can write our error rate as follows:

[\textrm{error rate} = \frac{1}{T} \sum_{t=1}^T I(y_t \neq \hat{y_t})]
[ I(y_t) = \begin{cases} 1; & y_t = \hat{y}_t \\ 0; & y_t \neq \hat{y}_t \end{cases} ]

Since we divide our data set into training and test data, the important measure is a low test error rate:

[\textrm{error rate}{test} = \frac{1}{T{test}} \sum_{t_{test}=1}^{T_{test}} I(y_{test} \neq \hat{y}_{test})]

Bayes classifier

This leads us to the Bayes classifier, which is one of the possible approaches to classify our observations. In accordance with James et al. (2017), the test error rate is minimized by a classifier, which assigns every observation to the class, it most likely belongs to. To put it differently, one should assign a test observation to the class (j), where following probability is the highest:

[ Pr(Y = j |X = x_{test} ) = \frac{Pr(X = x_{test} \cap Y = j )}{Pr(X = x_{test} )} ]

It is important to note that for calculating the Bayes classifier one must know the conditional probability for each value of our variable (X). If this information is available, the Bayes classifier assigns each observation to the class with the highest conditional probability. Thus, averaging the result yields the Bayes error rate:

[1 - E\Bigg[\max_{j} Pr(Y = j | X)\Bigg]]

At this point, in accordance with Fielding and Bell (1996), it is essential to add, that the classic Bayes classifier predicts a default, if (Pr(Y = 1 |X = x_{test} )) is above the threshold of 0.5. This threshold, however, is completely up to a modeler's choice and it might be sensible to adjust it. Imagine a scenario in which, we are in a position to grant loans. Furthermore, we were capable of achieving a high prediction accuracy, as we predicted a lot of true positives and almost no false positives. At first glance, the high prediction accuracy sounds excellent. However, perhaps a lower prediction accuracy would be even better. That is since we did not examine the associated costs yet. Probably each of the false positives accounts for an exceptionally high amount of interest payments. Hence, if we lessen the threshold, fewer false positives occur increasing our profit. We will discuss the second step and its possible applications in exercise 5.

In line with James et al. (2017), the Bayes classifier serves as the gold standard to compare other methods to. Unfortunately, the distribution of our response variable (Y) relative to the predictors (X) is unknown in real-world scenarios. Therefore, it may be unfeasible to resort to the Bayes classifier. Possible alternatives include logistic regression and the (k)-nearest neighbor (KNN) classifier. The latter algorithm takes the (k) closest neighbors of each (x_t) into consideration to determine its class via majority vote.

Area under the curve

Finally, we need a measure to render our different models comparable. As already hinted, we are going to rely on the so-called area under the curve (AUC). To understand the AUC, we must discuss the possible scenarios, which can occur, when classifying an observation first. Please see the table below:


True Positive (TP)	False Positive (FP)	True Negative (TN)	False Negative (FN)

The class was predicted to be positive, which corresponds to the true class.	The class was wrongly predicted as positive, yet the true class is 0.	The class was predicted as negative, which corresponds to the true class.	The class was predicted as negative, yet the true class is 1.

With the upper table, we can calculate the true positive rate (TPR) and false positive rate (FPR) as follows:

[TPR = \frac{TP}{P} = \frac{TP}{FP + FN}\qquad FPR = \frac{FP}{N} = \frac{FP}{FP + TN}]

The AUC is referred to as area under the curve, while the actual curve is called ROC. ROC is an acronym for receiver operating characteristics and corresponds to a line plotting the ratio of the true positive to the false positive rate. The following interactive figure creates an arbitrary ROC curve with the sole purpose to exemplify. Additionally, we will derive the AUC manually in the next subchapter for logistic regression.

# The following code statements simulate a ROC curve
# To support the slope of the final curve (false positive rate on the abscissa, 
# true positive rate as ordinate), values have been chosen to increase fast
roc <- data.frame(false_positive_rate = c(0, 0.1, 0.2, 0.4, 0.8, 1.0), 
                  true_positive_rate = c(0, 0.65, 0.85, 0.99, 1, 1))

# Applying a spline to our data set is a fancy trick, which helps 
# to create more curvature
sp <-  spline(roc, n = 20)
# Limit the y-coordinates to a maximum of 1
sp$y[10:20] <- 1
# Reassign our interpolated values to our data.frame roc
roc <- data.frame(false_positive_rate = sp$x, true_positive_rate = sp$y)

# Plot the results
ggplot(roc, aes(x = false_positive_rate, y = true_positive_rate)) + 
  # ROC
  geom_line(size = 2, color = "#9205F0") +
  # AUC
  geom_area(fill = "#9205F0", alpha = 0.3) +
  # Limit our coordinate system
  coord_cartesian(xlim=c(0,1), ylim=c(0,1)) +
  labs(x = "False Positive Rate", y = "True Positive Rate") -> g
  ggplotly(g, tooltip = NULL)

Figure 4.1 The utilized data set is user-generated to depict a 'good' ROC curve. We first defined six observations and utilized a spline for interpolation. Afterwards, we manipulated the spline and plotted the resulting observations. In general, one can say, that, the higher the AUC, the better the ratio between true positives and false positives.

An even better ROC curve than figure 4.1 would rise almost perpendicularly and hug the top left corner. That is since an increasing steepness of the curve results in a higher AUC. It is self-evident that by definition a classifier with an AUC of 0.5 is not reliably better than chance. Thus, the higher this value, the better the model. Nonetheless, one must keep in mind, that both, the true positive rate as well as the false positive rate depend on the threshold utilized by the classifier. Therefore, it makes sense to compare different classifiers with different thresholds as suggested in Fielding and Bell (1996). They further state a variety of other accuracy measures and utilizing sensitivity measures, if false positive errors are more costly than false negative errors.

The following exercises kick off the second part of this problem set. We will discuss all required steps to build a model predicting credit default rates. Due to its general nature, the whole procedure may be quickly transferred and applied to a different set of data. Moreover, other supplementary algorithms can be implemented to optimize results even further. Nonetheless, to keep computation times to a minimum, we will resort to solely presenting the final models. To be more exact, we will follow this brief introduction to exercise 4 with a discussion about logistic regression, the development of two logit models to predict default rates and a manual derivation of the AUC in subchapter 4.1. Furthermore, subchapter 4.2 will be concerned with increasing model predictability, as a logistic model is tuned in an attempt to optimize prediction accuracy. Finally, the last two subchapters introduce classification trees, as well as subsampling methods such as bagging and boosting.

Exercise 4.1 -- Logistic regression

For the remainder of the problem set, the approach will be as follows: we will introduce the model and shortly discuss its properties, before conducting our analysis and reviewing our results. Hence, this subchapter presents two logistic regression models to predict credit default.

Logistic model

Please recall the definition of the probability of default, which we denote in accordance with James et al. (2017, p. 131) as follows:

[ \begin{eqnarray} p(X) &=& Pr(Y = 1|X) = Pr(default = 1|X) \\ &=& \beta_0 + \beta_1 X_1 + \dots + \beta_k X_k \end{eqnarray} ]

As before (k) corresponds to the number of predictors, while (Y) represents the binary response and (X) constitutes the matrix of predictors. Continuing the idea of logistic regression, one can fit a so-called logistic function (f(X\beta) = \frac{e^{X\beta}}{1+e^{X\beta}}), which limits our predicted probability to values between 0 and 1. Besides, apart from the logit model, other logistic regressions can be performed. Fitzmaurice et al. (2011) offer a broad spectrum of applications in conjunction with Hosmer et al. (2013). Both books discuss alternatives and extensions to the logit model, as well as other logistic models.

Odds

With (p(X)) denoting the probability of default, we can proceed to define odds. Odds are more convenient than probablities, as they are easy to interpret and widely used in economics. E.g., if the odds of credit default are 1:10, then, on average 1 out of 11 people defaults. We can deduce odds based on the following formula:

[\begin{eqnarray} \frac{Pr(default = 1|X)}{Pr(default = 0|X)} = \frac{p(X)}{1-p(X)} &=& \frac{\frac{exp(X \beta )}{1 + exp(X \beta )}}{1- \frac{exp(X \beta )}{1 + exp(X \beta )}} \\ &=& \frac{\frac{exp(X \beta )}{1 + exp(X \beta )}}{\frac{1}{1 + exp(X \beta )}} \\ &=& exp(\beta_0 + \beta_1 X_1 + \dots + \beta_k X_k) \end{eqnarray} ]

Applying the logarithm on both sides yields the so-called log-odds or logit:

[\log\Bigg(\frac{p(X)}{1-p(X)}\Bigg) = \beta_0 + \beta_1 X_1 + \dots + \beta_k X_k]

Another characteristic of the logit model is the fact that the coefficients (\beta_k) are linear in (X). Consequently, ceteris paribus, a one unit change in (X_1), matches an average change in log-odds by (\beta_1).

Fitting the model

Analogous to the ordinary least squares (OLS) approach for the linear regression model, a procedure is required to fit the logit model. Regarding logistic regressions, estimating the maximum likelihood (ML) is the standard procedure as described in Czepiel (n.d.) and Greene (2002). ML attempts to estimate a number close to (1) if a loan defaulted and (0) if it was paid. We can illustrate this mathematically:

[ \max\limits_{\beta_0, \beta_1, \dots, \beta_k}\Bigg(\ell\Big(\beta_0, \beta_1, \dots, \beta_k\Big)\Bigg) = \max\limits_{\beta_0, \beta_1, \dots, \beta_k}\Bigg({\displaystyle \prod_{t=1}^{T} \Big[p(x_{t})\Big]^{y_t} \Big[1 - p(x_{t})\Big]^{1-y_t}}\Bigg) ]

The variable (\ell) resembles the likelihood function, (y_t) is our binary classifier. The objective is to maximize the likelihood subject to our coefficients as a function of every loan (t).

Application

With the theoretical foundation laid out, it is time to apply our knowledge to the Lending Club data set data_b. As before, we must load our data first.

Task: Use the function readRDS() to import the data set data_b.rds and assign the imported data set to a variable called data.

data = readRDS("data_b.rds")

# Return a summary of data
glimpse(data)

This data set is related to our previous data set data.rds of exercise 2. Contrary to data.rds, data_b.rds comprises only terminated loans, since two possible values regarding our feature column default remain: 'Default' and 'Paid'. Moreover, all columns of the data set data_b.rds have been transformed into dummy variables, as some of the future packages are limited to binary dummies. All modifications to the original data sets are retraceable with the preparation.R script hosted on GitHub. It must be mentioned, that data_b.rds excludes missing values as not all presented implementations of the machine learning algorithms can handle missing data. An alternative approach would be to impute them. Van Buuren and Groothuis-Oudshoorn (2011) provide a good read about imputation. They outline the technical opportunities in a clear and comprehensible way concerning the mice package, which is a powerful library to tackle missing values in your data.

Task: Unlike exercise 3.2, we divide our data set data with the help of a function from the caret package called createDataPartition(). This function has one key advantage. It keeps the distribution of our response variable default identical to our original data set data. That means, if a total of 100 observations are part of a data set with 25 defaults and 75 paid loans, the ratio will be 1:3. If we divide our data set into 80 train and 20 test observations, the function createDataPartition() will strive to keep this ratio.

The parameter p decides the relative share of observations for the new sample, while the parameter list decides, the output format. As we utilize a vector to adapt our previous data set data, we set list to 'FALSE'.

library(caret)

# Generate train_index
train_index = with.random.seed(createDataPartition(data$default, 
                              p = 0.8, list = FALSE), seed = 5690)

# Divide our data set into test (20%) and training observations (80%)
train <- data[train_index,]
test <- data[-train_index,]

With our training and test data sets train and test in place, we can conduct our first logit regression. Emekter et al. (2015) analysed the Lending Club loan data from May 2007 to June 2012. Their findings indicate the loan grade, the debt-to-income-ratio, the FICO score and the revolving credit line utilization as important predictors. Therefore, we employ sub_grades, dti, mean_fico_range and revol_util as predictors in our first regression.

We can perform a logistic regression in R applying the function glm(). Please have a look at the info box below for an introduction. The syntax is related to the previous linear regression in exercise 3.2.

info("glm() and stargazer()") # Run this line (Strg-Enter) to show info

Task: The following code chunk performs a logistic regression related to Emekter et al. (2015). Predictors are the loan grades, as well as the variables dti, fico_mean_range and revol_util.

! addonquizGuess-timation

library(stargazer)

# Logistic regression
# Inspired by Emekter et al. (2015)
log <- glm(default ~ grade_B + grade_C + grade_D + grade_E + grade_F + grade_G + 
                     dti + mean_fico_range + revol_util, 
                     # Perform a logistic regression
                     family = binomial(link=logit), 
                     # Provide the training data set
                     data = train)

# Return the results
stargazer(log,
          # A fitting caption for our respective model(s)
          title="Logit Regression - inspired by Emekter et al. (2015)",
          # Adjust variable names to be more comprehensible
          covariate.labels = c("Loan Grade[B]", "Loan Grade[C]",
                               "Loan Grade[D]", "Loan Grade[E]",
                               "Loan Grade[F]", "Loan Grade[G]",             
                               "Debt-To-Income Ratio", "FICO Score", 
                               "Revolving Credit Utilisation"),
          # Simplify the formula
          dep.var.labels = "default ~ grade + dti + fico mean range + revol util",
          omit.stat = c("aic", "ll"),
          align = TRUE,
          digits = 3,
          digits.extra = 3,
          # Adjust output to html and format the table according to 
          # the American Economic Review standards
          type = "html",
          style = "aer"
    )

All coefficients are significant at the 1 percent level. The sign of the debt-to-income ratio is positive, which is anticipated, as it suggests, that borrowers with a high debt-to-income ratio are associated with a high default risk. To no surprise, the FICO score has a negative sign, which is expected, too. As a high FICO score should equal a low probability of default. However, the negative sign of the revolving credit line utilization seems unusual as Emekter et al. (2015) claim, that an ideal borrower has a zero revolving credit line utilization. This means that the borrower has never overdrawn his balance.

Prediction

As our model log is still simple, we can predict values manually. Please use the following code chunk to compute the probability of default based on the previous coefficients of log. The following formula, as stated in James et al. (2017, p. 134) may help with the task:

[ \begin{eqnarray} \hat{p}(X) &=& \frac{exp\Big(\hat{\beta_0} + \hat{\beta}_1 X_1 + \dots + \hat{\beta}_k X_k\Big)}{1 + exp\Big(\hat{\beta_0} + \hat{\beta}_1 X_1 + \dots + \hat{\beta}_k X_k\Big)} \end{eqnarray} ]

! addonquizProbability of default

Remark: Keep in mind, that binary dummies have a value of either 0 or 1.

# Enter your code here

Task: The following code chunk predicts the probability of default for each loan in test. As before in exercise 3.2, we can utilize the function predict() and pass the fitted model log, as well as the predictors in test as arguments.

# Predict the outcome utilizing our test data 
prediction_log <- predict(log, select(test, -default), type = "response")

# Return the first six probabilities
head(prediction_log)

The result is a vector with two values: the integer indicates the loan number, while the decimal corresponds to the estimated probability. As predict() does not adjust the probabilities, we must order them before calculating the AUC.

The next few tasks provide insights towards plotting the ROC curve and calculating the AUC. There are however multiple ways to calculate or estimate the AUC, as described in Horton (2016).

Task: The following code chunk orders the predictions in prediction_log, extracts the probabilities and their index and adjusts the observations in test accordingly. Furthermore, both the true positive rate and the false positive rate are determined utilizing the following formulae:

[TPR = \frac{TP}{P} = \frac{TP}{FP + FN}\qquad FPR = \frac{FP}{N} = \frac{FP}{FP + TN}]

Besides, the output is assigned to a data.frame called dat and the first six observations are returned.

# Order the probablities decreasingly and maintain the index
prediction_log_ordered <- sort(prediction_log, 
                               decreasing = TRUE, 
                               index.return = TRUE)

# Extract the probabilities and their index separately
probabilities <- as.numeric(prediction_log_ordered$x)
indeces <- prediction_log_ordered$ix

# Adjust the true responses accordingly
true_value <- test$default[indeces]

# Calculate the true and false positive rate:
# True positives divided by the sum of false positives and false negatives    
tpr = cumsum(true_value == "Default") / sum(true_value == "Default") 
# False positives divided by the sum of false positive and true negative
fpr = cumsum(true_value == "Paid") / sum(true_value == "Paid")

# Assign the results to dat
dat <- data.frame(false_positive_rate = fpr, true_positive_rate = tpr)
head(dat)

We can see, that both, the true positive rate and the false positive rate slowly increase. The following task applies dat to plot the ROC curve.

Task: This code chunk creates two plots side-by-side. The right graph is based on the same data and merely a zoomed-in version of the left one. That is to exemplify, that the ROC curve is genuinely a discrete function instead of a continuous one, which is different to before, when we smoothened our ROC curve with the help of a spline and geom_line() instead of geom_step(). This time, we rely on the unsmoothed version, to compute the ROC curve as accurate as possible.

# Create a plot for the ROC curve
ggplot(dat, aes(x = false_positive_rate, y = true_positive_rate)) + 
  # ROC (unsmoothed)
  geom_step(size = 2, color = "#9205F0") +
  # Add the angle bisector as reference
  # Since geom_abline breaks the coordinate system, 
  # utilize geom_path to connect the coordinates
  # (0,0) and (1,1)
  geom_path(data = data.frame(x = c(0,1),
                              y = c(0,1)),
            aes(x = x, y = y),
            color = "#39d9db", size = 1, lty = "dotted") +
  # Limit the coordinate system
  coord_cartesian(xlim = c(0,1), ylim = c(0,1)) +
  labs(x = "False Positive Rate", y = "True Positive Rate") -> g

# Plot the original ROC curve plus the zoomed-in version 
# side-by-side
subplot(g, layout(ggplotly(g),
                  # Specify arbitrary values 
                  xaxis = list(range = c(0.1, 0.1005)),
                  yaxis = list(range = c(0.278, 0.279))), 
                  nrows = 1, titleX = TRUE, titleY = TRUE)

Figure 4.2 This figure depicts the actual ROC curve. The next task calculates the corresponding AUC.

Task: This code chunk calculates the AUC. We can calculate the AUC by summing up the area of the rectangles (width x height).

# Calculate the AUC
# Width: (fpr[2:length(true_value)] - fpr[1:length(true_value)-1])
# Height: tpr[2:length(true_value)]
auc = sum((fpr[2:length(true_value)] - fpr[1:length(true_value)-1]) 
          * tpr[2:length(true_value)])

# Return the AUC
cat("Area under the curve:", auc)

Task: This code chunk calculates the AUC based on the pROC library, which we will use now.

Remark: Due to performance reasons, this is the first task with two code chunks. The first code chunk resembles an excerpt of the prediction.R script, which is hosted on GitHub. That code chunk cannot be executed but highlights the actual code, required to complete the task. The second code chunk imports the prepared results and performs less hardware-hungry tasks.

# Compute the ROC curve via pROC
roc_log <- roc(response = test$default, predictor = as.numeric(prediction_log))

# Import the results via load() for convenience
# A .rdata file can store multiple objects, as otherwise, we would have to load each
# object with readRDS() separately
load("log_ps.rdata")

# Return the AUC via pROC
roc_log$auc

We can see, that the different approaches yield identical values for the AUC. For simplicity reasons, we will opt for the pROC implementation from now on and omit a further visual representation of the ROC curve. Instead, we are solely focusing on the AUC as a measure of prediction accuracy. In our case, the AUC of 0.7 may be improved by employing the whole training data train. As so far, regarding the logistic regression inspired by Emekter et al. (2015), only a sample of our available data was utilized for training. To end this exercise, we develop a full rank model comprising every variable in train as a predictor.

Task: The following code chunks fit a full rank model to our training data train and assigns the resulting model to a variable called log_full. Note, that the formula, is straightforward: if one intends to use every variable in train, he or she may simply provide a '.' as a placeholder. To conclude the task, we compare both models regarding AUC.

# Full rank model
log_full <- glm(default ~ ., 
                family = binomial(link=logit), 
                data = train)

# Predict the outcome utilizing our test data
prediction_log_full <- predict(log_full, select(test, -default), type="response")

# Compute the ROC curve via pROC
roc_log_full <- roc(response = test$default, predictor = as.numeric(prediction_log_full))

# Return the values for the AUC
# Logit Regression - inspired by Emekter et al. (2015)
roc_log$auc
# Full rank model
roc_log_full$auc

As we can see, the full rank model is the winner concerning AUC. In general, we expect a higher prediction accuracy with more available data, which is the case in the full rank model. Regarding the Lending Club data, other studies than Emekter et al. (2015) incorporated more data into their analysis, too. E.g. Li, et. al (2016), Serrano-Cinca et al. (2015) and Carmichael (2014) who developed multiple approaches including multinomial regressions including the prepayment of loans and Cox regressions for survival analysis. This begs the question whether we can improve our results as well. Thus, we keep the logistic model for now and focus on finding the best covariates for our model. The latter procedure is referred to as variable or feature selection and an important aspect of econometrics. Moreover, Fahrmeir et al. (2013) are recommended for further reading concerning variable selection.

For classification purposes, James et al. (2017, pp.210-213) describe different criteria in assessing the model fit: undoubtedly, the best approach would be to develop a model for each possible variable combination. That would, however, result in (2^{\rho}) models for (\rho) variables. Consequently, for ten variables alone, we would have to conduct 1024 regressions. To tackle this problem, James et al. (2017) as well as Burnham and Anderson (2002) describe more sophisticated approaches. A so-called step-wise forward or backward selection for example calculates just (1+\rho(\rho+1)/2) models. This equals 56 models for (\rho = 10) variables. This approach represents a happy medium and applies the idea of slowly increasing or decreasing the number of applied variables without examining every possible combination. To refine the latter methodology, one can resort to hybrid approaches. A hybrid model combines a stepwise forward and backward selection in a single model. Hybrid approaches might add multiple variables at once while others are discarded if they do not tweak the model. Moreover, Bishop (2016, ch. 11) explains a variety of different sampling methods to adjust the data volume.

Regarding this problem set, we will learn more about ridge and LASS'O regularization, as well as cross-validation to avoid overfitting our data in the next chapter. Both are conventional approaches to improve our present results.

Exercise 4.2 -- Resampling and tuning

In this exercise, we shift emphasis on resampling and feature selection: resampling summarizes procedures, which draw multiple subsets from our training data and fit a model on each sample. The variety of output models allows comparing similar models, which reveals variability in the data. Feature selection, on the other hand, deals with choosing the right variables to develop a model. We will start this exercise by discussing cross-validation and the bootstrap, which are according to James et al. (2017) the most common resampling methods. The second part sheds light on shrinkage methods like ridge regression and the least absolute shrinkage and selection operator (LASSO). Finally, the third part proceeds with a tuned model of logistic regression to predict credit default. Our aim for this subchapter is unaltered. We are searching for a model, which maximizes AUC without overfitting our training data.

Cross-validation and the bootstrap

In general, James et al. (2017) describe cross-validation as an approach to assess a model, while the bootstrap serves as measure for model selection.

Cross-validation

A cross-validation is a procedure which divides our training data into (k) folds: imagine a scenario with five folds and 100 observations, which results in 20 observations per fold. We pick our first fold as pseudo-test set. Consequently, the other four folds remain to train the model. The next step is to swap our pseudo-test set. Hence, we choose the second fold as pseudo-test which leaves the first, third, fourth and fifth fold for training the model. If we apply the same procedure three more times, we can average the test error. Cross-validation allows for employing almost all available training data to fit the model without resorting solely to the actual test data set to assess the model. This procedure is computationally more expensive than just randomly selecting our data into training and test data, but offers two advantages:

Our test error rate is highly dependent on our sample distribution. Subsequently, adding variability to estimate a model is a better way to pick up fluctuations.
Plus, a general rule of thumb is the more observations, the better the result. The k-fold cross-validation approach avoids overestimating the test error rate, as we can calculate a pseudo-test error rate in advance.

Regarding the classification setting, we can define the (k)-fold cross-validation estimate as follows:

[CV^{Error}{(k)} = \frac{1}{k}\sum{\tau = 1}^k Error_{\tau} = \frac{1}{k}\sum_{\tau= 1}^k I(y_{{\tau}} \neq \hat{y}_{\tau})]

A particular case of the (k) fold cross-validation is the leave-one-out cross-validation. In this setting (k) equals our number of available observations. This approach requires a considerable amount of computational power and with the bias-variance trade-off in mind, is not efficient at all. Therefore, the question arises, how we decide on (k). If we choose a value of (k), which is close to the number of observations (T), our result will be (k) fits. Unfortunately, the fits are highly correlated, due to the fact, that each of the models was trained with almost the same data. Hence, the variance is minimal which results in a high bias. Following this line of thought, the smaller (k), the less correlated our models. That is an apparent benefit regarding variance, as James et al. (2017, p.183) discuss, that the variance of the mean of correlated quantities declines with a lower degree of correlation. In fact, they claim a (k) of 5 or 10 as two empirically proven values, whose test error rates neither bear from high bias nor high variance.

The bootstrap

The bootstrap resembles a tool to assess the uncertainty associated with different estimators or statistical learning methods (James et al., 2017). To be more precise, bootstrapping creates new sample sets of data to estimate the variability of our estimators within our available data. That is achieved by drawing multiple observations with replacement from our original data set. Each time, the selected observations constitute a new sample yielding multiple different data sets.

Assume a coefficient gamma (\gamma), which is part of any arbitrary statistical model. Gamma could be the amount of time we spend for work, or the proportion of our assets, we want to sell. We further assume we used bootstrapping to create 100 'new' data sets. If we denote our estimate of gamma as (\hat{\gamma}), we can calculate the mean and standard deviation as follows:

[\overline{\gamma} = \frac{1}{100} \sum_{\upsilon = 1}^{100} \hat{\gamma_{\upsilon}}\qquad\qquad\qquad \sigma(\hat{\gamma}) = \sqrt{\frac{1}{100 - 1}\sum_{\upsilon = 1}^{100}\big(\hat{\gamma}_{\upsilon} - \overline{\gamma}\big)^2}]

Utilizing both formulae, we can calculate the average difference of our estimate (\hat{\gamma}) from (\gamma): assume we created 100 "new" data set samples and computed a mean of (\overline{\gamma} = 0.77) with a standard deviation of (\sigma(\hat{\gamma}) = 0.05). This would mean, that on average, we expect (\hat{\gamma}) to deviate from (\gamma) by 0.05 according to James et al. (2017, pp. 187-188).

Shrinkage methods and logistic regression

Two well-known shrinkage methods are the ridge regression and the LASSO. As stated in Hastie, Friedman and Tibshirani (2017, pp.61-73), shrinkage methods have three advantages: they are less computationally expensive and may bear a lower prediction error. Moreover, the interpretability of the model may increase due to a limited number of variables.

Ridge regression

Ridge regression penalizes regression coefficients based on their size. This approach is efficient if the underlying model suffers from high variance. In this case, ridge regression can decrease the variance in favor of sacrificing a low bias in return. Hence, ridge regression prevents a model from overfitting the training data set by implementing a shrinkage parameter (\lambda \geq 0). In general, a shrinkage method shrinks coefficients subject to (\lambda). The higher (\lambda), the smaller the coefficients. This means the coefficients have less influence (one could also say, they become weaker). The resulting model fits the training data set worse than without shrinking the parameters. As these approaches distort the results in favor of future prediction accuracy, employing a shrinkage method may backfire if predictors are shrunk too much.

To present the idea mathematically, we resort to a simple logistic model analogous to Le Cessie and Van Houwelingen (1992), Hastie, et. al (2017, pp.57-61), Tibshirani (2013) and Guerzhoy (2016). Recall our previous definition of the probability with (k) predictors:

[ \begin{eqnarray} p(X) &=& \frac{exp\Big(\beta_0 + \beta_1 X_1 + \dots + \beta_k X_k\Big)}{1 + exp\Big(\beta_0 + \beta_1 X_1 + \dots + \beta_k X_k\Big)} \\ &=& \frac{exp\big(X\beta \big)}{1 + exp\big(X\beta \big)} \end{eqnarray} ]

The upper function is solely depending on (\beta). If we apply the log-likelihood function and some manipulations in accordance with Duffy and Santner (1989, p.963), the result looks as follows:

[\ell(\beta) = \sum_{\forall k}\Big[Y_k \log p(X_k) + (1-Y_k) log\big(1-p(X_k)\big)\Big]]

[ \begin{eqnarray} \ell^{\lambda}(\beta) &=& \ell(\beta) - \lambda\mid\mid\beta\mid\mid^2 &=& \ell(\beta) - \lambda \sum_{\forall k}\beta_k^2 \end{eqnarray} ]

The usual approach maximizes the log-likelihood of the estimates, which begs the question, how one approaches the maximization problem, as the log-likelihood function involves the tuning parameter (\lambda). The answer to this question is simple. We need to find a reasonable value (\lambda) to plug into our function. That is a similar problem to a previous one; we already managed to tackle. Thus, the following quiz should be easy to solve:

! addonquizEstimating the optimal tuning parameter lambda

Nonetheless, keeping the continuous technical progress in mind, we can leave the task of choosing an optimal tuning parameter (\lambda) to R. In a nutshell, the ridge regression avoids a scenario, in which a sizeable positive coefficient would cancel out a similarly large negative coefficient by penalizing their power. It further avoids overfitting the training observations and is especially useful for models suffering from high variance.

The LASSO

Like the ridge, the least absolute shrinkage and selection operator (LASSO) shrinks our coefficients. The LASSO is especially good at tuning parameters, as it ensures that some coefficients become 0 and are therefore discarded. The basic principle remains the same as previously. Only the approach is different. Due to the complexity of the calculus, we omit the mathematical part of applying the LASSO to classification models. An interested reader may have a look at Friedman, et. al (2008, pp.8-15) for the application to binary logistic regression as well as a variety of other models. Alternatively, see also Krishnapuram et al. (2005) for a detailed derivation in the multinomial logit case.

The LASSO is still under continuous research. Meier, Van De Geer and Buehlmann (2008) for example, worked on an extension of the LASSO for variable selection. By using predefined groups of variables in a logistic regression setting, they concluded that predictors may still be consistent even though they extend the number of observations.

Application

With the theoretical foundation in place, we can apply these methods in our analysis. We will utilize the caret library with its train() function to implement a (10)-fold cross-validation. We will proceed to use this cross-validation approach to estimate two tuning parameters alpha (\alpha) and lambda (\lambda). We will then supply both tuning parameters to a function called glmnet(), which is part of a package with an equivalent name. This function applies a hybrid between ridge and LASSO to train our model. Before we continue, let's check, whether we memorized everything so far:

! addonquizDouble-checking fundamentals

With our newly acquired expertise, we are ready to tune our logistic regression model.

Task: This code chunk imports our data set data_b.rds and splits the observations into our respective data sets. Note, that this time, we utilize three different data sets: a tuning data set referred to as tune which comprises 10% of our observations. As well as the training and test data set, which account for 70% and the remaining 20% of observations respectively. The variables remain identical, as we refer to the training data set as train and the test data set as test.

library(caret)

# Import the data set data_b.rds
data = readRDS("data_b.rds")

# Generate the train_index
train_index <- with.random.seed(createDataPartition(data$default, 
                              p = 0.8, list = FALSE), seed = 5690)

# Divide our data set into test (20%) and remaining observations (80%)
test <- data[-train_index,]
left <- data[train_index,]

# Generate the tune_index
tune_index <- with.random.seed(createDataPartition(left$default, 
                              p = 0.125, list = FALSE), seed = 5690)

# Split our remaining observations into tune (10%) and training data (70%)
# of total observations
tune <- left[tune_index,]
train <- left[-tune_index,]

We omit a discussion of the data sets, as the data sets remain unaltered. Only the individual amount of observations in each data set changed. Before we proceed to estimate our model, have a look at the info box below, which introduces both functions mentioned in the opening section: train() and glmnet().

info("Info: train() and glmnet()") # Run this line (Strg-Enter) to show info

Task: This code chunk showcases the code to tune logistic regression with the support of the LASSO and the ridge regression. As tuning, in general, is computationally expensive or cannot be performed due to limitations in hardware, the code has been provided separately.

If you are solving this problem set online instead of your personal computer, please import log_tuned.rdata as suggested. You may still opt to run the code yourself. However, it is advised to resort to the prediction.R script. This script computes all models presented in the problem set as well as supplementary models. Please keep in mind, that the required hardware should have at least 32 gigabytes of available random-access memory (RAM) and eight threads. If the device does not meet these requirements, a computation of the models may be unfeasible or take multiple days. Nonetheless, in this case, one may always reduce the calculations by restricting the data set or tuning parameters.

library(caret)
library(glmnet)
library(pROC)

# Utilize our tuning data to determine the tuning parameters alpha and lambda
log_tuned_param <- with.random.seed(
                   train(# Drop the default column from tune, as default is our
                         # binary response
                         x = select(tune, -default),
                         y = tune$default,
                         method = "glmnet",
                         tuneGrid = expand.grid(alpha = seq(0, 1, .05),
                                                lambda = seq(0, 1, .05)),
                         metric = "ROC",
                         trControl = trainControl(method = "cv",
                                                  summaryFunction = twoClassSummary,
                                                  classProbs = TRUE,
                                                  number = 10)), seed = 5690)

# Estimate our model utilizing our training data set
log_tuned <- with.random.seed(
             glmnet(x = as.matrix(select(train, -default)),
                    y = train$default,
                    # Pass the estimated alpha and lambda stored in
                    # log_tuned_param$bestTune
                    alpha = log_tuned_param$bestTune$alpha,
                    lambda = log_tuned_param$bestTune$lambda,
                    family = "binomial",
                    standardize = FALSE), seed = 5690)

# Predict the outcome utilizing our test data set
prediction_log_tuned <- predict(log_tuned, newx = as.matrix(select(test, -default)),
                                type="response")

# Compute the ROC curve
roc_log_tuned <- roc(response = test$default, predictor = as.numeric(prediction_log_tuned))

# Import the results via load()
load("log_tuned.rdata")

# Return the best combination of alpha and lambda
log_tuned_param$bestTune

# Return the AUC 
roc_log_tuned$auc

Regarding our Lending Club data set data_b.rds, neither ridge regression nor the LASSO could improve our results, as lambda equals 0. Consequently, the coefficients of our logistic model have not been shrunk and our AUC is similar to the full rank model of exercise 4.1.

As we could not tune our model to improve our results, another venue for improvement is to search for a different model. Hence, the next exercise introduces decision trees as an alternative to logistic regression.

Exercise 4.3 -- Decision tree

As we could not improve our prediction accuracy by tuning logistic regression, it might be worthwhile to try something different. In our case, we will fit a decision tree.

Definition

Decision trees can be applied for both, classification and regression problems. Due to the credit default setting, we omit the discussion of regression trees and opt for an applied approach rather than discussing mathematical details. For an in-depth breakdown of decision trees, please see Breiman (1984) which covers the topic of constructing trees for both regression and classification problems in greater detail.

As before, let (k) be the number of available predictors (X_k) and (x_t) a realization of (X_k). A decision tree, or in our case, a classification tree will choose one of the predictors and divide the data set regarding a condition based on this predictor. That could, for example, mean, that the first set of observations all meet the condition (x_t = 0 \ , \ x_t\in X_k), while the remaining observations meet the condition (x_t = 1). A classification tree will apply this step over and over for other predictors (X_k) to create different regions (R_\ell\ \textrm{,} \ \ell \in \mathrm{N}). Each region (R_\ell) is distinct and does not overlap with another region. We further define the conditions of our predictors (X_k) as internal nodes and if at some point, no further split is pursued we refer to this node as terminal node or leaf. By reaching a leaf, a classification tree predicts, that every observation in this region belongs to the majority class. To put it differently, any test observation is classified as the predominant class regarding the training observations in this region. If for example, a leaf node has 20 training observations, of which 15 are 'Default', every observation in the test set - located within this exact region - is classified as 'Default'.

Binary splitting

That begs two questions: how is the decision on splitting the training data set in regions (R_\ell) made and what are possible truncation conditions? Most commonly, a decision tree resorts to a method called recursive binary splitting. This method is a greedy top-down approach: initially, all observations belong to a single region (R_0) (the whole data set). Next, we split our data set into two new regions (R_1) and (R_2). At this point, we ignore (R_0), as we produced two new regions which divide the observations amongst themselves. The next step is to repeat the previous splitting step-by-step until the whole training set is divided into regions (R_\ell). We refer to recursive binary splitting as greedy, as each decision is made unattached to future splits. That means the algorithm confines itself to assess the current state only. The algorithm neither evaluates prior splits nor tries to predict future split operations, which could eventually lead to a higher prediction power (through 'better' splitting). James et al. (2017) present three criteria to select the condition, which splits the regions:

The classification error rate (E) resembles the number of training observations, which are falsely assigned to the majority class. In this case (\widehat{\textrm{p}}{\ell j}) corresponds to the proportion of training observations in the region (R\ell), which belong to class (j).

[E = 1 - \max_{k}\Big(\widehat{\textrm{p}}_{\ell j}\Big)]

However, as stated in James et al. (2017, p.311), the classification error rate is superior in pruning (reducing regions from an existing tree) compared to its role as decision criterion (developing the tree). While pruning will be discussed in more detail later, James et al. (2017) introduce two additional coefficients, the Gini index, and entropy.

The Gini index measures the total variance across (J) classes. With (\widehat{\textrm{p}}{\ell j}) equaling the proportion of training observations in the region (R\ell) belonging to class (j). The Gini index can be defined as follows:

[G = \sum_{j=1}^J \widehat{\textrm{p}}{\ell j}(1 - \widehat{\textrm{p}}{\ell j})]

The Gini index is minimal if the proportion of training observations from the respective class is either very large or close to (0). This characteristic is referred to as node purity, as a small Gini index ensures that the majority of observations in a region (R_\ell) belongs to the same class.

The alternative coefficient is called entropy, which can be defined as follows:

[D = - \sum_{j=1}^J \widehat{\textrm{p}}{\ell j} log\big(\widehat{\textrm{p}}{\ell j}\big)]

Like the Gini index, entropy is a measure of node purity. If (\widehat{\textrm{p}}_{\ell j}) is close to (0) or (1), the resulting entropy will be close to (0), too.

Truncation condition

Pertaining the truncation conditions, one can think of two reasonable scenarios. On the first hand, we could opt to establish a threshold before building the tree. Consequently, this threshold must be met, each time the algorithm decides whether to split or to keep the current regions. We can implement such a condition by demanding a Gini index less than (0.1) from both new regions if the current region is split. The problem with the latter approach is that we can think of scenarios, which involve rather high Gini values initially and very low Gini values later. Thus, we would never achieve distinct boundaries between classes, and the prediction accuracy would be rather low. That is since we made the error to employ a strict truncation condition which keeps our tree too small. Therefore, on the second hand, the common approach is to set a tolerant truncation condition allowing the tree to grow deep initially and prune afterward. James et al. (2017) describe weakest link pruning (WLP) as one of the more efficient methods to prune a decision tree. The WLP algorithm is simple. Our tree can be viewed as a multitude of subtrees. We denote a subtree as (\Gamma) and introduce a tuning parameter (\nu). Nu weighs the trade-off between complexity and the fit to the training observations. If we set (\nu = 0), the resulting decision tree tends to overfit. That is since the tree is not pruned at all and equals the original tree (\Gamma_0) (high variance). On the contrary, a high (\nu) prunes our original tree extensively resulting in a small tree. That could resolve in a favorable variance, yet high bias. James et al. (2017) claim, that the whole methodology is closely related to the LASSO described in exercise 4.2. As an optimal value of our tuning, respectively our pruning parameter must be chosen since the actual relationship between predictors and response variable is distorted.

Application

Task: This code chunk imports our data set data_b.rds and divides the observations into tune, train and test set. Moreover, we import our results before computing. This way, we can present the models step-by-step.

library(caret)

# Import the data set data_b.rds
data = readRDS("data_b.rds")

# Generate the train_index
train_index <- with.random.seed(createDataPartition(data$default,
                              p = 0.8, list = FALSE), seed = 5690)

# Divide our data set into test (20%) and remaining observations (80%)
test <- data[-train_index,]
left <- data[train_index,]

# Generate the tune_index
tune_index <- with.random.seed(createDataPartition(left$default,
                              p = 0.125, list = FALSE), seed = 5690)

# Split our remaining observations into tune (10%) and training data (70%)
# of total observations
tune <- left[tune_index,]
train <- left[-tune_index,]

# Import the results via load()
load("tree.rdata")

We will grow a classification tree utilizing the rpart package. For more information, please consult the info box below:

info("Info: rpart()") # Run this line (Strg-Enter) to show info

Task: With the purpose of simplification in mind, we limit ourselves to a small number of predictors for the first tree. The following code chunks grow a tree and plot the model utilizing the rattle package with its fancyRpartPlot() function. The predictors match the variables selected by Emekter et al. (2015) similar to the logit model of exercise 4.1.

library(rpart)

# Grow the tree utilizing rpart()
tree <- with.random.seed(
        rpart(default ~ grade_B + grade_C + grade_D + grade_E + 
                        grade_F + grade_G + dti + mean_fico_range + 
                        revol_util,
              # Without tuning the tree, the data set left
              # corresponds to the training data set
              data = left, 
              control = rpart.control(minsplit = 20, 
                                      minbucket = 100, 
                                      cp = 0.0001,
                                      method = "class")), 
                                      seed = 5690)

library(RColorBrewer)
library(rattle)

# Plot the classification tree
fancyRpartPlot(tree, 
               # Remove title and subtext
               main = "", sub = "",
               # Provide a color palette
               palette = "YlOrRd"
               )

We can see, that our resulting classification tree has (10) internal and (11) terminal nodes or leaves. We can also decode the numbers shown. The top node, which is indicated by the (1) is referred to as root. It shows that regarding our response variable default, 80% of all loans were paid. The 100% indicate that our root node contains 100% of the available population. Underneath the root node, the first split is conducted. We divide our observations based on the loan grade. To be more precise, the algorithm determined a threshold of 0.5 for the loan grade 'E', which theoretically makes little sense. That is since grade_E is a binary dummy, which can either take a value of 0 or 1. Hence we can interpret the threshold as follows: if the loan grade is not 'E' (< 0.5 and thereby 0), we travel down the left branch. If the loan grade is 'E' (> 0.5 and thereby 1), we travel down the right branch. We choose the left branch. Means as mentioned above we split our observations based on the fact, whether the loans belong to the other loan grades ('A','B','C','D','F' and 'G'). Hence, as we travel down the tree branch to the left, all loans which were not rated 'E' remain. The next condition is, whether the loan was assigned to grade 'D'. If the loan belongs to grade 'D', we travel down to right branch to arrive at node (5). This node is a terminal node and hence a so-called leaf. This leaf accounts for 16 percent of our observations of which 70 percent were paid. Following the classic majority vote, the classification tree will predict any observation of the test set, which belongs to grade 'D' as 'Paid'. The remaining tree may be interpreted analogously.

To summarize, essential findings are that both loan grade 'D', as well as loan grade 'C' seem to be strong predictors in our classification tree. Regarding loan grade 'C', a 22% chance of default remains. Hence, the classification tree predicts each loan rated 'C' as 'Paid'. Moreover, any loan rated 'A' or 'B' (see node (32)) is expected to be paid in the future. Before calculating the AUC, we will attempt to prune the tree.

Task: We can prune the classification tree by setting a higher complexity parameter 'cp'. The following code chunk extracts the possible complexity parameters with their according number of splits from tree and returns the results.

# Extract the complexity parameters and the number of splits from tree
cp <- data.frame(splits = tree$cptable[,2],
                 complexity = tree$cptable[,1])

# Return the results
cp

We can see, that four possible complexity parameters cp exist. Recall, that the previous classification tree tree has a complexity parameter of 0.0005 with ten splits. By forcing a higher complexity parameter, the resulting tree becomes less deep, as fewer splits are conducted. A complexity parameter above 0.0069841 maintains the root node only. Since we intend to visualize our results, viable complexity parameters are situated between 0.0005 and 0.006984102.

Task: This code chunk prunes tree. A complexity parameter of 0.00067 is specified in the function, which retains (4) splits. Besides, the resulting tree is plotted.

# Prune the previous classification tree
tree_pruned <- prune(tree , cp = 0.00067)

# Plot the pruned tree
fancyRpartPlot(tree_pruned, 
               # Remove title and subtext
               main = "", sub = "",
               # Provide a color palette
               palette = "YlOrRd"
               )

The previous model tree has been pruned. The remaining subtree tree_pruned retains only (4) internal nodes, and (5) leaves. We can see, that setting a higher complexity parameter resulted in a more streamlined design. Remember, all loans with grades 'A' and 'B' were formerly predicted as paid. Now all loans assigned to loan grade 'A', 'B', 'C' and 'G' are classified as paid. Besides, node (8) alone accounts for 73 percent of total observations. Please consider these new findings, when solving the quiz below.

! addonquizComparing classification trees

Remember, opposed to utilizing the tuning data tune to determine the complexity parameter, we explicitly specified a value of 0.0001. The following task estimates the complexity parameter with the help of the train() function. Afterwards, we fit a tuned tree to our training observations and compare the AUC between models.

Task: This code chunk estimates the complexity parameter cp and fits a new classification tree employing all variables as predictors. We further return the AUC of all three models.

# Utilize our tune data set to determine the complexity parameter
tree_param <- with.random.seed(
              train(x = select(tune, -default),
                    y = tune$default,
                    method = "rpart",
                    tuneGrid = expand.grid(cp = seq(0.0001, 0.0024, 0.0001)),
                    metric = "ROC",
                    trControl = trainControl(method = "cv",
                                             summaryFunction = twoClassSummary,
                                             classProbs = TRUE,
                                             number = 10)), seed = 5690)

# Apply the training data set
tree_tuned <- with.random.seed(
              rpart(default ~ .,
                    data = train, 
                    control = rpart.control(minsplit = 20, 
                                            minbucket = 100, 
                                            cp = tree_param$bestTune$cp),
                    method = "class"), seed = 5690)

# Predict the outcome utilizing our test data set
prediction_tree <- predict(tree, type = "prob", newdata = select(test, -default))
prediction_tree_pruned <- predict(tree_pruned, type = "prob", newdata = select(test, -default))
prediction_tree_tuned <- predict(tree_tuned, type = "prob", newdata = select(test, -default))

# Compute the ROC curve
roc_tree <- roc(test$default, prediction_tree[,1])
roc_tree_pruned <- roc(test$default, prediction_tree_pruned[,1])
roc_tree_tuned <- roc(test$default, prediction_tree_tuned[,1])

# Return the AUC
roc_tree$auc
roc_tree_pruned$auc
roc_tree_tuned$auc

We can see, that the difference in AUC is striking. The original tree tree with grade, dti, mean_fico_range and revol_util as predictors achieves the highest prediction accuracy concerning AUC. The pruned version of this classification tree tree_pruned would still perform better than chance, if we negated the results. Finally, the last classification tree tree_tuned, which was tuned and trained on the whole data sample achieves a lower AUC than the variables proposed by Emekter et al. (2015) alone. That is another indicator for treating tuning with caution, as the simpler solution yielded a higher prediction accuracy.

To end this exercise, we shortly discuss the difference between decision trees and the more commonly known regression models presented in this problem set.

Comparing decision trees to our previous models

A tree-based approach does not attempt to fit a function, which is a standard feature inherited by regression models. In our case, both, the linear regression from exercise 3.2, as well as the logistic regressions from chapter 4, attempted to fit a function to the available training data. Decision trees divide the training observations into different regions (R_\ell). That can be beneficial if the data set does not have a linear structure, and thus, linear regression does not provide a good fit. For example, assume a relationship between the response variable (Y) and the predictors (X_k) which is by nature slightly complicated. In this case, a tree-based method should provide a sound solution. In our case of predicting credit default, with a plethora of different variables at our disposal, a tree-based approach should perform well compared to conventional models. Our results, however, show that Lending Club's loan grades alone account for the majority of observations. Therefore, the relationship between our response variable and our predictors is remarkably simple, which explains our solid results. James et al. (2017) mention, that a decision tree has two major drawbacks. A single tree often lacks prediction accuracy and a deeply grown tree inherits a high bias. While the first detriment stems partly from the second one, the latter renders a tree susceptible to minor changes in the training data. Consequently, the prediction accuracy should suffer, too, as the test observations differ from the training observations.

Since, in our case, the selected variables seem to be a driver behind poor prediction accuracy, the next exercise introduces approaches to tackle this problem by growing multiple trees to average the results. This should increase the prediction accuracy compared to a single tree.

Exercise 4.4 -- Bagging and boosting

Bagging and boosting have the purpose of achieving a lower variance between models resulting in a lower test error and a more precise prediction. They further share a common name, as both approaches are referred to as ensemble methods. An ensemble method combines the results of multiple applications to achieve a better result.

Bagging

Bagging is short for bootstrap aggregation, which should sound familiar, as we already discussed the bootstrap in exercise 4.2. Let's shortly recap. The bootstrap creates new sample sets of data to estimate the variability of our estimators within our available data. That is achieved by drawing multiple observations with replacement from our original data set. Each time, the selected observations constitute a new sample yielding multiple different data sets. James et al. (2017) describe bagging as a 'general-purpose approach' to reduce the variance of statistical learning methods and motivate the procedure with the following explanation. Assume a total of (n) independent observations (Z_1, ..., Z_n) with a variance (\sigma^2). Then the variance of the mean (Var(\bar{Z})) of these observations is just (\sigma^2/n). Hence, bagging may significantly reduce the variance.

Regarding decision trees, bagging is exceptionally efficient. As discussed earlier, a deeply grown tree may inherit a high variance and a low bias. Utilizing bagging, we can skip pruning our decision tree, since the deeper the tree grows, the weaker the bias will become. Consequently, we can grow multiple trees with very low bias and high variance. Keeping the bias-variance trade-off in mind, averaging over these decision trees significantly reduces the variance of the final tree, while the bias stays low.

Out-of-bag error estimation

James et al. (2017, pp.318-319) describe another useful feature which is called out-of-bag error estimation. Let's make an example. Assume a total of (T = 10) observations and (B = 2) bagged trees. In the first tree (B_1), observation 1, 3 and 10 (\big( t \in {1,3,10} \big)) were out-of-bag, as they were not drawn into the training data to build this tree. In the second tree (B_2), observation 3, 6 and 7 were omitted from training the model. Hence, we can predict the response to the 3rd observation utilizing both bagged trees. In general, each bagged tree utilizes on average two-thirds of the observations, which leaves a third for in-development testing. To put it in other words, we can predict a response to every observation which is out-of-bag employing the other bagged trees.
Following James et al. (2017), this approach yields (B/3) predictions for every observation (t). They further claim that the classification error for the out-of-bag samples serves as a valid approximation for the test error.

Random forest

James et al. (2017, pp.319-321) state that a random forest is preferred to the classic bagging approach, as bagging produces highly correlated predictions. This is since samples are drawn with replacement, while the set of predictors remains unregarded. Hence, a strong predictor will always constitute the root node, resulting in akin trees. A random forest tackles this problem by randomizing the utilized predictors amongst bagged training data sets. Thus, a different subset of predictors is applied to each tree. They further elaborate that averaging correlated quantities is not as effective as averaging uncorrelated ones and suggest a predictor sample of (q = \sqrt{k}) where (k) corresponds to the number of predictors. Despite their recommendation, one may still try different combinations of (q) to determine the best fit via train().

Application

Task: As earlier, this code chunk imports our data set data_b.rds and splits the observations into tune, train and test set.

library(caret)

# Import the data set data_b.rds
data = readRDS("data_b.rds")

# Generate the train_index
train_index <- with.random.seed(createDataPartition(data$default,
                              p = 0.8, list = FALSE), seed = 5690)

# Divide our data set into test (20%) and remaining observations (80%)
test <- data[-train_index,]
left <- data[train_index,]

# Generate the tune_index
tune_index <- with.random.seed(createDataPartition(left$default,
                              p = 0.125, list = FALSE), seed = 5690)

# Split our remaining observations into tune (10%) and training data (70%)
# of total observations
tune <- left[tune_index,]
train <- left[-tune_index,]

Task: These code chunks tune a random forest with the help of the randomForest library. Moreover, we import our results. This way, we can present the models step-by-step.

library(randomForest)

# Utilize our tuning data to determine our tuning parameter mtry
# (number of selected predictors) 
randomForest_param <- with.random.seed(
                      train(x = select(tune, -default),
                            y = tune$default,
                            method = "rf",
                            tuneGrid = expand.grid(mtry = seq(2, 20, 2)), 
                            metric = "Accuracy",
                            trControl = trainControl(method = "oob",
                                                     classProbs = TRUE)), 
                                                     seed = 5690)

# Grow the random forest
randomForest <- randomForest(y = train$default,
                             x = select(train, -default),
                             # Utilizing OOB requires a lot of available RAM,
                             # hence the small amount of trees
                             ntree = 750,
                             mtree = randomForest_param$bestTune$mtry,
                             importance = TRUE,
                             data = train)

# Predict the outcome utilizing our test data set
prediction_randomForest <- predict(randomForest, type = "prob", 
                                   newdata = select(test, -default))

# Compute the ROC curve
roc_randomForest <- roc(test$default, prediction_randomForest[,1])

# Extract the Mean Decrease in Gini Index
mdg_randomForest <- randomForest::importance(randomForest)

# Import the results via load()
load("rf_ps.rdata")

# Return the AUC
roc_randomForest$auc

Our results show an improvement compared to the single classification tree from exercise 4.3. That proves that the random forest with 750 trees achieved a higher prediction accuracy than the unpruned tree inspired by Emekter et al. (2015). To be more exact, the difference in AUC is five percentage points. Nonetheless, the time required to grow the forest is 1440 times higher and accounts for almost 8 hours. Compared to the logistic models, the random forest performs slightly better concerning the AUC while the computation time trade-off persists.

Another intriguing question is, which variables served best as predictors across all trees. The following task creates a variable importance plot ordered by the mean decrease in the Gini index. Recall, that the Gini index is a measure of node purity. The more a variable contributes to a smaller Gini index, the better. This is since, the purer the node, the higher the prediction accuracy.

Task: This code chunk transforms the matrix mdg_randomForest, which stores all variables and their respective mean decrease in Gini index to a data.frame. Moreover, a variable importance plot depicts our results.

# Extract the predictors and their importance from mdg_randomForest
mdg <- data.frame(variable = row.names(mdg_randomForest),
                  importance = mdg_randomForest[,4])
mdg %>%
  # Select the top 15 predictors 
  top_n(15) %>%
  # Sort them decreasingly by their mean decrease in gini
  arrange(-importance) %>%
  # Factorize with the according levels
  mutate(variable = factor(variable, levels = variable)) -> mdg

# Plot the results
ggplot(mdg, aes(x = variable,
                y = importance)) +
  geom_bar(stat="identity", position = position_dodge(),
           fill = c("#FFC100", "#463296", "#FA8C2D", "#460564", "#0072B2", 
                    "#6E1E7D", "#CC79A7", "#ED5973", "#39d9db", "#980043",
                    "#FFC100", "#BE1E4B", "#009E73", "#E6A901", "#039646")) + 
  # Tilt the ordinate labels and adjust the abscissa
  theme(axis.text.x = element_text(angle = 90, hjust = .5),
        axis.text.y = element_text(size = 8, hjust = 1)) +
  # Adjust the abscissa tick labels
  scale_y_continuous(labels = c(0, 1, 2, 3, 4, 5)) +
  # Labels
  labs(x = "Variables", y = "Mean Decrease in Gini [Thousand]") -> g
  ggplotly(g)

Figure 4.3 Regarding the variables utilized to grow the forest, int_rate is the key player. This should not be surprising as the interest rate is closely tied to the loan grade assigned by Lending Club. Moreover, in accordance with the logistic model from Emekter et al. (2015), the debt-to-income ratio dti and the revolving credit line utilization revol_util are important predictors, too. Interestingly, the FICO score does not belong to the top 15 predictors, as instead the installment and annual income annual_inc are part of the model. Besides, mostly credit limits regarding the revolving balance or bankcard are part of the model as well as some time-related variables concerning the difference between the oldest account and issue date of the loan.

Boosting

Boosting continues the idea of ensemble learning by attempting to merge multiple weak predictors to a reliable predictor. It is considered the most refined version of bagging since it utilizes the information achieved from previous trees to improve future results. To be more precise, the "best possible" tree is grown by shifting weight to observations which were falsely predicted in the past.

Even though, there are a variety of different boosting algorithms, this problem set focuses on the gradient boosting approach developed by Friedman (2001;2002). His work was inspired by Freund and Schapire (1997) as well as Kearns and Valiant (1989) who introduced the idea of an optimization algorithm with a cost function. The latter established the concept of weak learning, which begged the question, whether a set of additional weak predictors can yield the same or even better results than a strong predictor alone. Regarding decision trees, gradient boosting computes tree by tree iteratively in order to develop a better model. Each tree is trained to predict the errors of prior trees resulting in a final tree with the highest prediction accuracy. To put it differently, weak predictors are combined in an ensemble and trained on the remaining error (falsely classified predictions) of a strong predictor. Afterwards, the information of the current tree is added to the previous tree. Hence, the regions in which the algorithm yielded worse results in the past become more significant at each iteration. That results in a higher prediction accuracy as the overall error is reduced.

We will implement our gradient boosting approach with the help of the xgboost library in R. The latter improves the gradient boosting algorithm introduced in Friedman (2001). To be more detailed, xgboost() avoids overfitting and optimizes the run time, as a truncation condition has been added, which ends the algorithm if the result cannot be further improved. Chen (2014) in conjunction with Chen and Guestrin (2016) provide an in-depth discussion and the mathematical details.

Task: The following code chunk implements a gradient boosting model utilizing xgboost(), which according to (DMLC, 2017) is a way to create state of the art machine learning solutions. The distributed machine learning community (DMLC) further lists multiple first prize winners in machine learning competitions who utilized the xgboost implementation of gradient boosting.

As previously, our model is tuned based on our tune data set to estimate the interaction depth, which represents the maximum depth of variable interactions and the learning rate eta. The learning rate decides, how much each tree contributes to the current model. A low learning rate should correspond to a higher number of trees. That is since each tree contributes a tiny part while the final model is developed very slowly. Besides, the maximum tree depth, the splitting criterion gamma, and the subsampling ratio must be specified. Restricting the subsampling ratio is optional and increases performance, as not all of the available training data is applied. To minimize runtime, both, the number of trees and the minimum child weight, which prevents tiny trees have been fixed to 1000 and 1 respectively.

library(xgboost)

# Utilize the tuning data to determine the tuning parameters 
gbm_param <- with.random.seed(
             train(y = tune$default,
                   x = as.matrix(select(tune, -default)),
                   method = "xgbTree",
                   trControl = trainControl(method = "cv", 
                                            number = 10, 
                                            # Save losses across models
                                            returnResamp = "all",
                                            classProbs = TRUE),
                   tuneGrid = expand.grid(# Number of trees to grow
                     nrounds = 1000,
                     # Learning rate
                     eta = c(0.01, 0.001, 0.0001),
                     # Maximum tree depth
                     max_depth = c(5, 10, 20),
                     # Minimum loss reduction (the larger gamma,
                     # the less restricted)
                     gamma = c(0.33, 0.66, 1),
                     # Sub sample ratio of columns in each tree
                     colsample_bytree = c(0.5, 1),
                     # Avoids very small nodes
                     min_child_weight = 1,
                     # Ratio of data utilized to train the model
                     # A lower ratio may prevent overfitting, we 
                     # utilize the whole tuning data set
                     subsample = 1)
                     ), seed = 5690)


# Estimate our model utilizing our training data set
gbm <- with.random.seed(
       xgboost(data = as.matrix(select(train, -default)),
               # Coerce the response variable to numeric
               label = ifelse(train$default == "Default", 1, 0),
               params = list(# binary classification
                 objective = "binary:logistic",
                 eta = gbm_param$bestTune$eta,
                 gamma = gbm_param$bestTune$gamma,
                 max_depth = gbm_param$bestTune$max_depth,
                 min_child_weight = gbm_param$bestTune$min_child_weight,
                 subsample = gbm_param$bestTune$subsample,
                 colsample_bytree = gbm_param$bestTune$colsample_bytree,
                 # Metric
                 eval_metric = "auc"),
               nrounds = gbm_param$bestTune$nrounds,
               # Return progress
               verbose = TRUE,
               # Truncation condition, if no further improvement after 250 trees
               early_stopping_rounds = 250), seed = 5690)

# Predict the outcome utilizing our test data set
prediction_gbm <- predict(gbm,
                          # Transform the data.frame to a sparse matrix
                          xgb.DMatrix(as.matrix(select(test, -default)), 
                          # Coerce the response variable to numeric
                          label = ifelse(test$default == "Default", 1, 0)))

# Compute the ROC curve
roc_gbm <- roc(test$default, prediction_gbm)

# Compute the importance matrix
importance_mat <- xgb.importance(colnames(train), model = gbm)

# Import the results via load()
load("xgbm.rdata")

# Return the AUC
roc_gbm$auc

The gradient boosting machine is by far the most time-consuming algorithm presented in this problem set. However, the required time is subject to the supplied tuning parameters and hence, must be treated with care. It is nevertheless appropriate to have high expectations towards the results since a logistic regression is performed in a few minutes. As anticipated, the gradient boosting machine achieved the highest prediction accuracy. Compared to logistic regression, the prediction accuracy is 2-5 percentage points higher. Still, as suggested earlier, one must weigh cost and benefits to determine in advance, whether computation time or prediction accuracy has to be prioritized.

Similar to the randomForest package, xgboost takes variable importance into account. Tong (2016) describes, that the gain on each node is calculated based on the contribution of a predictor across all trees. He reports this procedure as useful if the number of predictors is too large to deduct sensible patterns within the training data otherwise. We can utilize the computed quantity gain to compare the selected variables of the random forest and the gradient boosting model. This allows us to assess which variables were the best predictors across both models.

Task: This code chunk utilizes the importance matrix importance_mat of xgboost to plot the relative share of the top 15 variables, which contributed the most to all trees.

library(Ckmeans.1d.dp)

# Extract the predictors and their importance from mdg_randomForest
imp <- data.frame(variable = importance_mat$Feature,
                  gain = importance_mat$Gain)

imp %>%
  # Select the top 15 predictors 
  top_n(15) %>%
  # Order them decreasingly by their importance
  arrange(-gain) %>%
  # Factorize the variable column with the according levels
  mutate(variable = factor(variable, levels = variable)) -> imp

# Plot the results
ggplot(imp, aes(x = variable,
                y = gain)) +
  geom_bar(stat="identity", position = position_dodge(),
           fill = c("#FFC100", "#463296", "#FA8C2D", "#460564", "#0072B2", 
                    "#6E1E7D", "#CC79A7", "#ED5973", "#39d9db", "#980043",
                    "#FFC100", "#BE1E4B", "#009E73", "#E6A901", "#039646")) + 
  # Tilt the ordinate labels
  theme(axis.text.x=element_text(angle = 90, hjust = .5)) +
  # Labels
  labs(x = "Variables", y = "Gain / Importance [%]") -> g
  ggplotly(g)

Figure 4.4 We can see, that similar to the random forest, int_rate is the most important predictor. Plus, the loan term term which can either be 36 or 60 months seems to be a key determinant regarding credit default. Apart from that, most variables are congruent with the random forest including the debt-to-income ratio dti and the total credit limit tot_hi_cred_limit. It is striking that the average FICO score mean_fico_range, which was a strong predictor in our past models is again under the top 5. Besides, the year_2015 variable is a notable predictor, which could stem from the fact, that default rates were high during this time (see figure 2.2).

The final exercise summarizes our findings and provides a short outlook.

Exercise 5 -- Conclusion

This thesis aimed to introduce and present statistics and the application of machine learning, in an accessible and reproducible manner with regard to credit default. We began in exercise 2 by exploring the Lending Club data set at hand. We gained insights into the distribution of the credit scores of borrowers as well as the loan volume and the total amount of loans issued per state. The critical takeaway from exercise 2.1 was the fast and clean approaches to derive knowledge. Subchapter 2.2 continued this idea by introducing text mining as an alternative approach. We discovered that approximately 80% of loans were taken out to consolidate old liabilities and to refinance credit cards.

Exercise 3 laid out the foundation for chapter 4, as the first subchapter introduced the necessary vocabulary in combination with some mathematical prerequisites. The second subchapter shed light on the bias-variance trade-off which is a critical determinant of prediction accuracy. Vividly, polynomial regressions were applied to a simulated data set illustrating the significance of both bias and variance when fitting a model.

Finally, exercise 4 utilized logistic regression as well as tree-based approaches to predict credit default. Our findings suggest that even though the prediction accuracy increased through the application of bagging and boosting, both, time and computational power required, magnified dramatically. In this case, standard logistic regression approaches retain their right to exist, as they may even outperform more sophisticated methods.

The following table presents a summary of our algorithms:


Model	AUC	Run time	Prediction Accuracy	Run time [*]	Interpretablitiy

Logit model (Emekter)	0.70	40 s	+++	++++	++++
Logit model (Full rank)	0.73	100 s	++++	++++	++++
Logit model (Tuned)	0.73	25 min	++++	+++	+++
Classification Tree (Emekter)	0.68	20 s	+++	+++++	+++++
Classification Tree (Emekter, Pruned)	0.38	20 s	+	+++++	+++++
Classification Tree (Tuned)	0.64	9 min	++	++++	+++++
Random Forest	0.74	8 h	++++	++	+++
Regularized Gradient Boosting	0.75	15 h	+++++	+	+++

The table comprises all algorithms showcased in the problem set. Additionally, the prediction.R script provides a plethora of alternatives, including the use of different packages to implement the models. E.g., utilizing ranger or caret in combination with rpart to grow a random forest. Moreover, prediction.R presents optional machine learning algorithms. E.g., Adaptive Boosting (AdaBoost). This boosting algorithm shifts more emphasis to wrongly predicted observations by assigning weights at each iteration. Hence, the sample distribution changes as weights of falsely predicted observations are increased, while those of correctly predicted observations are decreased.

Outlook

As discussed earlier, predictions should be challenged further. Since this problem set solely focuses on a single measure of prediction accuracy, other supplementary measures could be computed. This involves an adaption of the probability threshold or cost-based approaches as described in Smits, Dellepiane and Schowengerdt (2010) to improve results. Apart from that, Domingos (1989) introduces a general approach called MetaCost. MetaCost is about relabelling classes subject to their optimal classification costs. That means the prediction measure is modified by applying the error costs after generating a probabilistic classifier. Additionally, Kruchten (2016) describes an approach of modifying the AUC by adding costs to the different classifications. E.g., a true positive yields a certain net benefit, while a false negative may involve costs as discussed earlier. By linking costs and their probabilities, one can maximize the expected utility rather than the probability of default. The aforementioned algorithm represents not only the economic point of view but also the economic efficiency and hence, a valid measure of profitability.

Continuing with the economic standpoint, the problem set only discusses the question, whether a borrower defaults or not. Therefore, the time of default remains entirely unregarded. Lenders, however, are not only exposed to the risk of default, as other risks exist, such as the risk of prepayment. Becketti (1989) as well as Homburger and Phillips (1989) claim, that prepayment renders investments less attractive due to two reasons. Movements in interest rates may incline borrowers to take out a new loan to consolidate the old one. Hence the old lenders lose their interest. And secondly, the cash-flows become unpredictable, which is a huge drawback in times of economic turmoil. As Lending Club does not impose prepayment penalties on their borrowers, prepayment would be an interesting research opportunity. Moreover, Bluhm, et. al (2002) and Lenz (2016) decompose the credit risk not only into the probability of default but also the recovery rate and the exposure at default. While the first quantity does not affect our analysis as only terminated loans have been analyzed, the second one does. That is because an early default implies a higher exposure at default, which remains unregarded as the time of default is not incorporated in our analysis. In addition, Milne and Parboteeah (2016) discuss liquidity risks as well as ways to avoid fraud in peer-to-peer lending. All in all, several questions regarding credit default remain open and are thus, subject to further investigation.

Besides a further research, as hinted before, the next logical step is to determine loans by their profitability. Apart from establishing the profitability by proposing a utility function, other frameworks can be applied. We could, for example, ascertain the costs involved and follow up with an investment strategy. This however, often requires the cost structure which is not publicly available and is best suited for a corporate environment. Such a strategy could serve as a criterion to select loans subject to their profitability instead of a low probability of default. An example of this approach is Singh, Gopal and Li (2008), which based their investment strategy on a decision tree in conjunction to the return on investment (ROI) of loans issued by Prosper.

Awards

The following task returns all your acquired achievements throughout solving the problem set.

Task: To show all your achievements, proceed with edit and then check. Regarding the whole problem set, a total of (9) awards could be obtained.

awards(as.html = TRUE)

If you enjoyed solving this problem set, please consider a visit at RTutor/problemsets for additional problem sets regarding other econometric problems.

Exercise 6 -- References and credits

R and packages in R

Chamberlain, S. and Ooms, J. (2017): geojson. "Classes for 'GeoJSON'", R package version 0.1.2 https://cran.r-project.org/web/packages/geojson/index.html
Chamberlain, S. and Teucher, A. (2017): geojsonio. "Convert Data from and to 'GeoJSON' or 'TopoJSON'", R package version 0.3.8 https://cran.r-project.org/web/packages/geojsonio/index.html
Chen, T., He, T., Benesty, M., Khotilovich, V. and Tang, Y. (2017): xgboost. "Extreme Gradient Boosting", R package version 0.6-4 https://cran.r-project.org/web/packages/xgboost/index.html
Cheng, J., et al. (2017): leaflet. "Create Interactive Web Maps with the JavaScript 'Leaflet' Library", R package version 1.1.0 https://cran.r-project.org/web/packages/leaflet/index.html
De Queiroz, G., Keyes, O., Robinson, D. and Silge, J. (2017): tidytext. "Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools", R package version 0.1.4 https://cran.r-project.org/web/packages/tidytext/index.html
Fellows, I. (2014): wordcloud. "Word Clouds", R package version 2.5 https://cran.r-project.org/web/packages/wordcloud/index.html
Friedman, J., Hastie, T., Simon, N., Qian, J. and Tibshirani, R. (2017): glmnet. "Lasso and Elastic-Net Regularized Generalized Linear Models", R package version 2.0-10 https://cran.r-project.org/web/packages/glmnet/index.html
Hlavac, Marek (2015): stargazer. "Well-Formatted Regression and Summary Statistics Tables", R package version 5.2. http://CRAN.R-project.org/package=stargazer - Kranz, S. (2015): RTutor. "Creating R problem sets with automatic assessment of student's solutions", R package version 2015.12.16 https://github.com/skranz/RTutor
Kuhn, M., et al. (2017): caret. "Classification and Regression Training", R package version 6.0-76 https://cran.r-project.org/web/packages/caret/index.html
Neuwirth, E. (2014): RColorBrewer. "ColorBrewer Palettes", R package version, 1.1-2 https://cran.r-project.org/web/packages/RColorBrewer/index.html
R Development Core Team (2017): R. "A language and environment for statistical computing, R Foundation for Statistical Computing", Vienna, Austria. http://www.r-project.org
R Development Core Team and contributors worldwide (2017): stats. "R statistical functions", R package version 3.5.2
Ridgeway, G., et al. (2017): gbm. "Generalized Boosted Regression Models", R package version 2.1.3 https://cran.r-project.org/web/packages/gbm/index.html
Robinson, D. (2017): widyr. "Widen, Process, then Re-Tidy Data", R package version 0.1.0 https://cran.r-project.org/web/packages/widyr/index.html
Robin, X., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J., Mueller, M. and Siegert, S. (2017): pROC. "Display and Analyze ROC Curves", R package version 1.10.0 https://cran.r-project.org/web/packages/pROC/index.html
RStudio (2017): htmltools. "Tools for HTML", R package version 0.3.6 https://cran.r-project.org/web/packages/htmltools/index.html
Sievert, C., Parmer, C., Hocking, T., Chamberlain, S., Ram, K., Corvellec, M., Despouy, P., and Plotly Technologies Inc. (2017): plotly. "Create Interactive Web Graphics via 'plotly.js'", R package version 4.7.1 https://cran.r-project.org/web/packages/plotly/index.html
Song, J. and Wang, H. (2017): Ckmeans.1d.dp."Optimal and Fast Univariate Clustering", R package version 4.2.1 https://cran.r-project.org/web/packages/Ckmeans.1d.dp/index.html
Therneau, T., Atkinson, B. and Ripley, B. (2017): rpart. "Recursive Partitioning and Regression Trees", R package version 4.1-11 https://cran.r-project.org/web/packages/rpart/index.html
Vanderkam, D., et al. (2017): dygraphs. "Interface to 'Dygraphs' Interactive Time Series Charting Library", R package version 1.1.1.4 https://cran.r-project.org/web/packages/dygraphs/index.html
Wickham, H. and Francois, R. (2017): dplyr. "A Grammar of Data Manipulation", R package version 0.7.1 http://cran.r-project.org/web/packages/dplyr/index.html
Wickham, H., Chang, W. and RStudio (2016): ggplot2. "An Implementation of the Grammar of Graphics", R package version 2.2.1 http://cran.r-project.org/web/packages/ggplot2/index.html
Wickham, H., Henry, L. and RStudio (2017): tidyr "Easily Tidy Data with 'spread()' and 'gather()' Functions", R package version 0.7.1 https://cran.r-project.org/web/packages/tidyr/index.html
Williams, G., et al. (2017): rattle. "Graphical User Interface for Data Science in R", R package version 5.1.0 https://cran.r-project.org/web/packages/rattle/index.html
Wright, M. (2017): ranger. "A Fast Implementation of Random Forests", R package version 0.8.0 https://cran.r-project.org/web/packages/ranger/index.html

Bibliography

Bair, E. and Tibshirani, R. (2004). Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data. PLoS Biol. 2(4): e108.
Becketti, S. (1989). The prepayment risk of mortgage-backed securities. Economic Review. Feb: pp.43-57.
Bishop, C. (2006). Pattern Recognition and Machine Learning. New York: Springer
Bluhm, C., Overbeck, L. and Wagner, C. (2002). An Introduction to Credit Risk Modeling. Chapman & Hall/CRC Financial Mathematics Series. CRC Press.
Breiman, L. (1984). Classification and regression trees. Belmont, CA: Wadsworth International Group.
Brignall, M. (2017). Peer-to-peer lending promised 6%, but I've been left red-faced and in the red. The Guardian, [online]. Available at: https://theguardian.com/ [Accessed 08 Aug. 2017].
Burnham, K. and Anderson, D. (2002). Model selection and multimodel inference: a practical information-theoretic approach. Second edition. New York: Springer.
Carmichael, D. (2014). Modeling Default for Peer-to-Peer Loans. Social Science Research Network Working Paper Series.
Chen, T. (2014). Introduction to Boosted Trees. [online] Available at: https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf [Accessed 28 Nov. 2017].
Chen T, Guestrin C. (2016). XGBoost: A Scalable Tree Boosting System. ArXiv e-prints. Available from http://www.kdd.org/kdd2016/papers/files/rfp0697-chenAemb.pdf.
Clements, N. (2015). The Risks of Refinancing Student Loans, Credit Cards and Mortgages. Forbes, [online]. Available at: https://www.forbes.com/ [Accessed 20 Aug. 2017].
Cohen, I., Cozman, F., Sebe, N., Cirelo, M. and Huang, T. (2004). Semisupervised learning of classifiers: theory, algorithms, and their application to human-computer interaction. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 26, Iss. 12: pp.1553-1566.
Czepiel, S. (n.d.). Maximum likelihood estimation of logistic regression models: theory and implementation. Available from https://czep.net/stat/mlelr.pdf.
Dalpiaz, D. (2017). R for Statistical Learning. https://daviddalpiaz.github.io/r4sl/
Dickler, J. (2017). Average FICO score hits an all-time high. CNBC. [online] Available at: https://www.cnbc.com/ [Accessed 19 Dec. 2017]
Distributed (Deep) Machine Learning Community. (2017). Awesome XGBoost. [online]. Available at: https://github.com/dmlc/xgboost/tree/master/demo#machine-learning-challenge-winning-solutions/ [Acessed 28 Nov. 2017]
Domingos, P. (1999). MetaCost: A general method for making classifiers costsensitive. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 155-164.
Duffy, E. and Santner, T. (1989). On the small sample properties of norm-restricted maximum likelihood estimators for logistic regression models. Communications in Statistics-Theory and Methods, Vol. 18, Iss. 3, pp.959-980.
Emekter, R., Tu, Y., Jirasakuldech, B., and Lu, M. (2015). Evaluating credit risk and loan performance in online peer-to-peer (p2p) lending. Applied Economics, Vol. 47(1), pp.54-70.
Experian.com. (2017). Infographic: What are the different Credit Scoring Ranges. [online] Available at: http://experian.com [Accessed 13 Aug. 2017]
Fahrmeir, L., T. Kneib, S. Lang, and B. Marx (2013). Regression: Models, Methods and Applications. Berlin Heidelberg: Springer.
Fielding, A. and Bell, J. (1996). A review of methods for the assessment of prediction errors in conservation presence/absence models. Environmental Consideration. Vol. 24, Iss. 1: pp.38-49
Fitzmaurice, G., Laird, N. and Ware, J. (2011). Applied longitudinal analysis. Hoboken, Wiley. http://public.eblib.com/choice/publicfullrecord.aspx?p=1051443.
Freund, Y. and Schapire, R. (1997) A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences. 55(1): pp.119-139.
Friedman, J. (2001). Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics 29(5): pp.1189-1232.
Friedman, J. (2002). Stochastic Gradient Boosting. Computational Statistics and Data Analysis 38(4): pp.367-378.
Friedman, J., Hastie, T. and Tibshirani, R. (2008). Regularized paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1).
Greene, W. (2002). Econometric analysis. Upper Saddle River, NJ: Pearson.
Guerzhoy, M. (2016). Ridge Logistic Regression for Preventing Overfitting. STA303/STA1002: Methods of Data Analysis II, Summer 2016 [online] Available at: http://www.utstat.toronto.edu/~guerzhoy/303/lec/lec8/ridge.pdf [Accessed 15 Nov. 2017].
Hastie, T., Friedman, J. and Tibshirani, R. (2017). The elements of statistical learning. New York: Springer.
Homburger, C. and Phillips, M. (1989). What You See Is Not Always What You Get: The Enforceability of Loan Prepayment Penalties. The John Marshall Law Review. Vol. 23 | Iss. 1(2), 1989.
Horton, B. (2016). Calculating AUC: the area under a ROC Curve. Revolutions, [blog] 22 November. Available at: http://blog.revolutionanalytics.com/2016/11/calculating-auc.html [Accessed 01 Dec. 2017].
Hosmer, D., Lemeshow, S. and Sturdivant, R. (2013). Applied Logistic Regression. Hoboken, Wiley. pp.434-438.
James, G., Witten, D., Hastie, T. and Tibshirani, R. (2017). An introduction to statistical learning. New York: Springer.
Kearns, M. and Valiant, L. (1989). Cryptographic limitations on learning Boolean formulae and finite automata. Symposium on Theory of computing. ACM. 21: pp.433-444.
Krishnapuram, B., Carin, L., Figueiredo, M. A. and Hartemink, A. J. (2005) Sparse multinomial logistic regression: fast algorithms and generalization bounds. IEEE Transactions on pattern analysis and machine intelligence, Vol. 27, Iss. 6, pp.957-968.
Kruchten, N. (2016). Machine Learning Meets Economics. Machine Learning Database, [blog] 27 January. Available at: http://blog.mldb.ai/blog/posts/2016/01/ml-meets-economics/ [Accessed 05 Dec. 2017].
Laurent, L. (2016). Peer to Fear. Bloomberg, [online]. Available at: https://bloomberg.com/ [Acessed 08 Aug. 2017]
Le Cessie, S. and Van Houwelingen, J. (1992). Ridge Estimators in Logistic Regression. Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 41, Iss. 1: pp.191-201
Lendingclub.com. (2017). Peer to Peer Lending & Alternative Investing. [online] Available at: http://lendingclub.com [Accessed 19 July 2017].
Lenz, R. (2016). Peer-to-Peer Lending: Opportunities and Risks. European Journal of Risk Regulation 7(4). pp.688-700.
Li, Z., Yao, X., Wen, Q., and Yang, W. (2016). Prepayment and default of consumer loans in online lending. Social Science Research Network. Available at SSRN: https://ssrn.com/abstract=2740858
Loh, P., and Wainwright, M. (2011). High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. Advances in Neural Information Processing Systems, Vol. 24. pp. 2726-2734.
Meier, L., Van De Geer, S. and Buehlmann, P. (2008). The group lasso in logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 70, Iss. 1: pp.53-71
Milne, A. and Parboteeah, P. (2016): The Business Models and Economics of Peer-to-peer Lending. ecri research report Vol.17
Rind, V. (2017). Pros and Cons of Peer-to-Peer Lending. GOBankingRates, [online]. Available at: https://gobankingrates.com/ [Accessed 07 Aug. 2017].
Ruppert, D., Wand, M. and Carroll, R. (2010). Semiparametric regression. Cambridge: Cambridge University Press.
Serrano-Cinca, C., Gutierrez-Nieto, B. and Lopez-Palacios, L. (2015). Determinants of default in P2P lending. PloS one, 10(10), e0139427.
Shaw, E. (2015). After 10 years, has it paid to be a peer-to-peer lender. The Guardian, [online]. Available at: https://theguardian.com/ [Accessed 25 Dec. 2017].
Silge, J. and Robinson, D. (2017). Text Mining with R - A Tidy Approach. O'Reilly Media
Singh, H., Gopa, R. and Li, X. (2008). Risk and Return of Investments in Online Peer-to-Peer Lending. University of Texas. pp.1-6.
Smits, P., Dellepiane, S. and Schowengerdt, R. (2010). Quality assessment of image classification algorithms for land-cover mapping: A review and a proposal for a cost-based approach. [online] Available at: http://www.tandfonline.com/doi/abs/10.1080/014311699212560 [Accessed 5 Dec. 2017].
Tibshirani, R. (2013). Modern regression 1: Ridge regression. Data Mining: 36-462/36-662 [online] Available at: http://www.stat.cmu.edu/~ryantibs/datamining/lectures/16-modr1.pdf [Accessed 18 Nov. 2017].
Tong, H. (2016). An Introduction to XGBoost R package. Distributed (Deep) Machine Learning Community, [online]. Available at: http://dmlc.ml/rstats/2016/03/10/xgboost.html [Accessed 02 Dec. 2017].
Van Buuren, S. and Groothuis-Oudshoorn, K. (2011). Mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, Vol. 45, Iss. 3.
Verstein, A. (2011). The Misregulation of Person-to-Person Lending. UC Davis Law Review, Vol. 45, No. 2, 2011.
Wang H., Greiner M., Aronson J.E. (2009). People-to-People Lending: The Emerging E-Commerce Transformation of a Financial Market. Value Creation in E-Business Management, pp. 182-195
Wickham, H. (2009). Ggplot2: Elegant Graphics for Data Analysis. New York: Springer.
Wickham, H. (2015). Advanced R. Boca Raton, FL: CRC Press.
Wickham, H. and Grolemund, G. (2017). R for data science. Sebastopol, CA: O'Reilly Media.

Credits

As one of the personal acknowledgments, I would like to thank my mentoring professor Sebastian Kranz, who continuously provided me with new ideas and input. As a student, I only failed to attend a single lecture of your portfolio. I am confident that I would not be as interested in economics if it were not for your enthusiasm, which you never fail to express in your lectures. Thank you.

Furthermore, I would like to thank my twin brother Dominik for both emotional support throughout my entire studies at the University of Ulm plus proofreading my thesis. To be a part of your life has always been a pleasure so far and still will be. Thank you for your advice, your ear, your ideas and your energy.

Another word of thanks belongs to Martin Kies who went the extra mile with his profound and very insightful feedback. Thank you for sharing your expertise and time. Finally, I would like to thank my family and all my reviewers. Thank you for your time and effort, as I do not take your support for granted.

Miscellaneous

License
Author: Patrick Rotter

Content of this work is licensed under a Creative Commons Attribution 4.0 International license.

Contact
For further questions or supplementary material, please do not hesitate to contact me at rotterpatrick@gmail.com.

Exercise 7 -- Appendix

Bonus: further dive into the data

If you feel like something has not been adequately addressed or discussed in one of the recent exercises, please use this solely optional task. You may run any code; the only limitation is the installed packages and the data itself. Please also keep in mind, that computationally expensive tasks might take a short while to process.

Task: This code chunk loads the available data sets. They are already categorized by their respective appearance in the problem set. Feel free to uncomment single lines of code, if you have a specific data set in mind. You may also uncomment everything at once.

# [Info] If you want to rerun models, please resort to the 
#        prediction.R script, which involves implementations
#        of all algorithms with proper logging. 
#        The prediction.R script can be downloaded from github:
#        https://github.com/rotterp/RTutorMachineLearningAndCreditDefault/

### Lending Club data 
# [Info] The Lending Club data has been modified; please see
#        the preparation.R script for details. The preparation.R 
#        script can be downloaded from GitHub.

# Chapter 2.1 
# data = readRDS("data.rds")

# Chapter 2.2
# tm = select(data, title, purpose, desc)

# Chapter 4
# [Info] If the model does not involve tuning, utilize left instead
#        of train. As train only comprises 70% of the total observations.

# data = readRDS("data_b.rds")
# train_index <- with.random.seed(createDataPartition(data$default, 
#                                 p = 0.8, list = FALSE), seed = 5690)
# test <- data[-train_index,]
# left <- data[train_index,]
# tune_index <- with.random.seed(createDataPartition(left$default, 
#                                p = 0.125, list = FALSE), seed = 5690)
# tune <- left[tune_index,]
# train <- left[-tune_index,]


### User-generated data

# Figure 3.1 
# random = genr("random")

# Figure 3.2
# spline = genr("spline")

# Figure 3.3
# dat = data.frame(x = seq(-5, 5, 0.33), 
#                  y = c(rep(0,7),1,0,1,rep(0,3),1,rep(0,4),rep(1,13)))

# Figure 3.4
# f = function(x){2*x + 5*x^2 - 0.3*x^3}
# x = with.random.seed(runif(n = 25, min = 0, max = 30), seed)
# eps = with.random.seed(rnorm(n = 25, mean = 0, sd = 500), seed)
# y = f(x) + eps
# dat = data.frame(x = x, y = y)

# Figure 3.5
# f = function(x){2*x + 5*x^2 - 0.3*x^3}
# x = runif(n=5000, min=0, max=30)
# eps = rnorm(n = 5000, mean = 0, sd = 500)
# y = f(x) + eps
# .data = data.frame(instance = rep(1:25, each = 50),
#                    model = rep(c("constant", "septic"), 
#                            each = 25, times = 100),
#                    obs = rep(1:25, times = 100),
#                    x = x,
#                    y = y)

# Figure 4.1
# roc <- data.frame(false_positive_rate = c(0, 0.1, 0.2, 0.4, 0.8, 1.0), 
#                   true_positive_rate = c(0, 0.65, 0.85, 0.99, 1, 1))
# sp <-  spline(roc, n = 20)
# sp$y[10:20] <- 1
# roc <- data.frame(false_positive_rate = sp$x, 
#                   true_positive_rate = sp$y)

Task: This code chunk is solely for your research. Feel free to edit as often as you like. Do not worry about reassigning a variable to any of the data sets above, as this task is optional and no progress is saved.

rotterp/RTutorMachineLearningAndCreditDefault documentation built on May 19, 2019, 1:47 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

rotterp/RTutorMachineLearningAndCreditDefault Machine Learning and Credit Default: An Interactive Analysis in R

In rotterp/RTutorMachineLearningAndCreditDefault: Machine Learning and Credit Default: An Interactive Analysis in R

Exercise contents

Exercise 1 -- Agenda and overview

Lending Club

Peer-to-peer lending

Exercise 2 -- Data science with R

Exercise 2.1 -- Visual analysis

Data

Loan grades

FICO score

Loan book

! addonquizUnderstanding dplyr and tidyr A

! addonquizUnderstanding dplyr and tidyr B

Exercise 2.2 -- Text mining

! addonquizUnderstanding word clouds

! addonquizInterpreting results and careful reading

Exercise 3 -- Statistical learning

Exercise 3.1 -- Definition and area of application

Definition

Area of application

Prediction

Estimating a function f

Regression and classification

Exercise 3.2 -- Model accuracy

Model accuracy

Application

Bias-variance trade-off

! addonquizVariance and bias

! start_note "Info: Over- and underfitting"

! end_note

! addonquizBias-variance trade-off

Exercise 4 -- Predicting credit default rates

Classification

Bayes classifier

Area under the curve

Exercise 4.1 -- Logistic regression

Logistic model

Odds

Fitting the model

Application

! addonquizGuess-timation

Prediction

! addonquizProbability of default

Exercise 4.2 -- Resampling and tuning

Cross-validation and the bootstrap

Cross-validation

The bootstrap

Shrinkage methods and logistic regression

Ridge regression

! addonquizEstimating the optimal tuning parameter lambda

The LASSO

Application

! addonquizDouble-checking fundamentals

Exercise 4.3 -- Decision tree

Definition

Binary splitting

Truncation condition

Application

! addonquizComparing classification trees

Comparing decision trees to our previous models

Exercise 4.4 -- Bagging and boosting

Bagging

Out-of-bag error estimation

Random forest

Application

Boosting

Exercise 5 -- Conclusion

Outlook

Awards

Exercise 6 -- References and credits

R and packages in R

Bibliography

Credits

Miscellaneous

Exercise 7 -- Appendix

Bonus: further dive into the data

R Package Documentation

Browse R Packages

We want your feedback!

rotterp/RTutorMachineLearningAndCreditDefault
Machine Learning and Credit Default: An Interactive Analysis in R