In kuhnrl30/LendingClubData: Dataset of Historical Loans Issued by Lending Club

knitr::opts_chunk$set(cache=FALSE, fig.height=3, fig.width = 7, comment=NULL, eval=F, tidy=F, width=80, message = F, warning= F)

Introduction

Lending Club (LC) publishes data on all of the loans issued on their platform since 2007. The data is updated on a quarterly basis- roughly 6 weeks after the quarter close. The LendingClubData package contains a dataset with this historical data. It will be updated as new data is published. The following is the code used to obtain the public data from Lending Club.

library(LendingClubData)
library(dplyr)

Links<- character()
  Links[1]<-  "LoanStats3a.csv" 
  Links[2]<-  "LoanStats3b.csv"
  Links[3]<-  "LoanStats3c.csv"
  Links[4]<-  "LoanStats3d.csv"
  Links[5]<-  "LoanStats_2016Q1.csv"
  Links[6]<-  "LoanStats_2016Q2.csv"
  Links[7]<-  "LoanStats_2016Q3.csv"
  Links[8]<-  "LoanStats_2016Q4.csv"
  Links[9]<-  "LoanStats_2017Q1.csv"
  Links[10]<- "LoanStats_2017Q2.csv"
  Links[11]<- "LoanStats_2017Q3.csv"

The member data includes additional variables but Lending Club put those files behind a signed URL. I haven't yet figured out if its possible to get past the signature barrier much less get the code together. In order to assemble the member data, I've downloaded it manually to a folder I've called downloads.

zips<- dir("../downloads") 
lapply(paste0("../downloads/",zips), unzip, exdir="../downloads")
datfiles<- grep("csv$",dir("../downloads"), value=T)

df_list<- lapply(paste0("../downloads/",datfiles), Prep_Loan_Data)

dat<- plyr::rbind.fill(df_list)

There are non-ASCII characters that are converted to ASCII encoding.

dat$emp_title<- stringi::stri_trans_general(dat$emp_title, "Any-Null")
dat$desc<- stringi::stri_trans_general(dat$desc, "Any-Null")

The dataset is too big to be hosted on Github as a single file, even compressed as an .rda file. To get around this limitation, I have split the dataset into seperate data frames by year and then created the \code{IssuedLoans()} function which combines them again. This also affords the opportunity to provide a subset of years to retrieve instead of pulling the entire historical record.

dat$issue_year<- format(dat$issue_d, "%Y")
dat_by_year<- split(dat, dat$issue_year)
names(dat_by_year)<- paste0("LendingClubData_", names(dat_by_year))

no_years<- 1:length(dat_by_year)

for (i in no_years){
    var_name<- names(dat_by_year)[[i]]
    assign(eval(var_name), dat_by_year[[i]])
    save(list=c(var_name), file=paste0("../data/", var_name, ".rda"))
    }

rm(list=paste0("LendingClubData_",2007:2017))
rm(dat_by_year, dat, df_list)