knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

\section*{Introduction} Often it is hard to determine what model equation is best for a given data set. We are proposing an ensemble of modeling functions for classification of data. This will allow users to input a formula and data into a function. From this input we will then run four different modeling approaches.

To help with the difficulty that comes with SVM we will be using the package ``EZtune'' which will help in simplifying the process as it will find the hyper-parameters needed (Lundell, 2017).

\section*{Data} The dataset we will use is called ``Water Quality - Drinking water potability'' (Kadiwal, 2021). It is Open Source and available on the online platform \url{Kaggle.com}. It was uploaded by Aditya Kadiwal, a Senior Data Engineer from Pune, Maharashtra, India. The dataset consists of 10 measured metric values which are used to estimate the quality from 3276 different water bodies:

The data is synthetically generated and was last updated on April, 25 2021.

\section*{Functions} The following functions are available to the user:

\section*{Analysis} We begin our analysis by first installing our package sumcat into your R session.

library(sumcat)

We can then load the water_potability data set that we will be running our analysis on along with the water_test data set.

data("water_potability")
data("water_test")

After loading these data sets into our R session, we need to set a seed to allow for reproducibility, allowing the user to follow along with this example. We will also be making the models for our analysis using the model_cat() function. We are interested in a starting point for this categorical analysis so our formula statement for the function will be our response variable Potability with all of our explanatory variables.

set.seed(121521)
model <- model_cat(Potability ~ ., water_potability, water_test)

We can view our model statements by using the package summary function to display these results.

summary(model)

We can then see how the accuracy compares among the four models: logistic regression, random forest, support vector machines, and linear discriminant analysis. This is done by calling the rank_methods function which shows a dataframe of the models and accuracies.

rank_methods(model)

\newpage To visualize how the residuals are distributed, we overwrote the plot function to show the residuals.

plot(model)
model_timing <- function(formula, data, test_data){
  model_frame <- model.frame(formula, data = data)
  y <- model_frame[, 1]
  x <- model_frame[, 2:ncol(model_frame)]

  test_model_frame <- model.frame(formula, data = test_data)
  fitted.predict <- as.integer(test_model_frame[, 1]) - 1

  microbenchmark::microbenchmark(log <- function(){

  log_model <- glm(formula, data = data, family = "binomial")
  log_pred <- predict(log_model, newdata = test_data, type = "response")
  log_pred <- ifelse(log_pred < .5, 0, 1)
  log_model$Prediction <- log_pred
  log_model$Fitted <- fitted.predict
}, rf <- function() {
  rf_model <- randomForest(formula, data = data)
  rf_pred <- as.integer(predict(rf_model, newdata = test_data)) - 1
  rf_model$Prediction <- rf_pred
  rf_model$Fitted <- fitted.predict
}, svm <- function() {
  svm_model <- eztune(x, y, method = "svm")$model
  svm_pred <- as.integer(predict(svm_model, newdata = test_data)) - 1
  svm_model$Prediction <- svm_pred
  svm_model$Fitted <- fitted.predict
}, lda <- function() {
  lda_model <- lda(formula, data = data)
  lda_pred <- as.integer(predict(lda_model, newdata = test_data)$class) - 1
  lda_model$Prediction <- lda_pred
  lda_model$Fitted <- fitted.predict
}, times = 1000)
}
mbm <- summary(model_timing(Potability ~. , water_potability, water_test))
mbm$expr <- c("log_model", "rf_model", "svm_model", "lda_model" )
mbm

\section*{Conclusion} In conclusion, we notice from our analysis at least with the data set found, water_potability, was best predicted by Random Forest with Support Vector Machines close behind, which is shown in Figure 1. When we look at the model timing of each we see that each fits it about the same average speed, with the Random Forest coming in second but having the greatest predictability. We would suggest looking at fine tuning the Random Forest Model as this predicted the best but did not cost so much computationally relatively to the other models.

\section*{References} Kadiwal, A. (2021, April). Water quality: Drinking water. Retrieved from https://www.kaggle .com/adityakadiwal/water-potability (Accessed: 11-12-21) \vspace{.1in} \ Lundell, J. F. (2017). There has to be an easier way: a simple alternative for parameter tuning of supervised learning methods. JSM Proceedings, Statistical Computing Section. Alexandria, VA: American Statistical Association, 3028–3036. 2



samhaycock/sumcat documentation built on Dec. 22, 2021, 10:11 p.m.