In kuhnrl30/RandomActsofPizza: Documenting an Analysis for Kaggle's Random Acts of Pizza Competition

knitr::opts_chunk$set(cache=FALSE, fig.height=3, fig.width = 7, comment=NULL, eval=T, tiy=T, message = F, warning = F, width=80)

Introduction

This vignette covers to process to train the models and predict against the test dataset. This is the continuation from the Exploratory Analysis already performed. The Exploratory Analysis vignette covers loading the raw data, converted to tabular format, and exploration to develop hypothesis. To see the Exploratory Analysis, read the vignette.

vignette("ExploratoryAnalysis")

Setup the Environent

library(RandomActsofPizza)
library(dplyr)
library(caret)
library(rpart.plot)
data(train)
data(test)

Naive Model

The naive model sets the baseline to which all other models are compared. It is the simple percentage of how many of the lines in the training dataset were successful. If the percentage is greater than 50% then we'd assume that all requests are successful. If less than 50%, then we'd assume none were successful. In this case, the success rate is only 24.5% so we'd assume that all requests for pizza would fail and would be correct 75.5% of the time.

train %>%
    summarise(N=length(received_pizza),
              Success=sum(received_pizza)) %>%
    mutate(Percent=paste(round(Success/N,3)*100,"%",sep=""))

Train the Models

Setup parallel processing

library(doParallel)
cl <- makeCluster(3)
registerDoParallel(cl)

Logistic Regression Model

The target in this challenge is binary and the values can only be values of 1 or 0. The requester either successfully receives a pizza or does not receive a pizza. We will use logistic regression because it will produce probability that the target value is 1.

````r train <- train %>% mutate(received_pizza= factor(received_pizza, labels=c("N","Y")))

glm_ctrl<- trainControl(method="repeatedCV", number=10, repeats=10, classProbs=TRUE, summaryFunction = twoClassSummary, allowParallel = TRUE)

LogMdl <- train(y=train$received_pizza, x=subset(train, select=-received_pizza), method="glm", metric="ROC", trControl=glm_ctrl, family= "binomial")

stopCluster(cl)

We see from the model summary below that there are a few variables with significant impact on the results. It is interesting that none of the days of the week had a significant impact. I would have expected it to have some difference between the weekdays or weekends. Also interesting is that the word 'Thank' has a high significance suggesting that a request saying thank you may have higher success rate. 


```r
summary(LogMdl)

Classification Model

Classification and Regression (CART) models split the the dataset by variable to minimize the variance in each split. In the plot below, you see that the right fork has no failed requests and only 287 successful requests.

cl <- makeCluster(3)
registerDoParallel(cl)

Cart_ctrl<- trainControl(method="cv",
                        number=10,
                        classProbs=TRUE,
                        summaryFunction = twoClassSummary,
                        allowParallel = TRUE)

CartMdl<- train(y=train$received_pizza,
                x=subset(train, select=-received_pizza),
                metric="ROC",
                method="rpart",
                trControl= Cart_ctrl,
                cp=.05)



stopCluster(cl)


prp(CartMdl$finalModel,
    main= "RAOP Classification Tree",
    extra=1,
    box.col=c("pink","palegreen")[CartMdl$frame$yval],
    leaf.round=2)

Score the Models

With the models trained, we'll use confusion matrices to understand how accurate the models are. Both models have similar accuracy scores. The logistic regression and CART models showed 83 accuracy%. Both of these are significant improvements over the naive model with 75% accuracy.

We'll take the 60/40 weighted average of the predicted probabilities from these two models. This should give a better result on the test dataset.

LogScore<- predict(LogMdl, data=train, type="prob")
confusionMatrix(LogScore[,2]>.5, train$received_pizza=="Y", positive="TRUE")

CartScore<- predict(CartMdl, data=train, type="prob")
confusionMatrix(CartScore[,2]>.5, train$received_pizza=="Y", positive="TRUE")

MergedScore<- MergeModels(cbind(LogScore[,2],CartScore[,2]),c(.6,.4))
confusionMatrix(MergedScore>.5, train$received_pizza=="Y", positive="TRUE")

Make Predictions Using the Test Data

Finally, we'll apply the models to the test set and create the final dataframe for submission.

LogPred  <- predict(LogMdl,newdata=test, type="prob")
CARTPred <- predict(CartMdl, newdata=test, type="prob")
Merged<-MergeModels(cbind(LogPred[,2],CARTPred[,2]), c(.6,.4))


Submit<- data.frame(request_id=test$request_id,
                    received_pizza=Merged)