knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
\section*{Introduction} Often it is hard to determine what model equation is best for a given data set. We are proposing an ensemble of modeling functions for classification of data. This will allow users to input a formula and data into a function. From this input we will then run four different modeling approaches.
To help with the difficulty that comes with SVM we will be using the package ``EZtune'' which will help in simplifying the process as it will find the hyper-parameters needed (Lundell, 2017).
\section*{Data} The dataset we will use is called ``Water Quality - Drinking water potability'' (Kadiwal, 2021). It is Open Source and available on the online platform \url{Kaggle.com}. It was uploaded by Aditya Kadiwal, a Senior Data Engineer from Pune, Maharashtra, India. The dataset consists of 10 measured metric values which are used to estimate the quality from 3276 different water bodies:
The data is synthetically generated and was last updated on April, 25 2021.
\section*{Functions} The following functions are available to the user:
model_cat
, which will create a list object that contains the four models inside that were mentioned earlier. This will also perform the predictions and attach them to each individual model to be called on later to save computation time.plot
function, which will plot the distribution of the error terms standardized on the same scale to be able to compare all the models next to each other.summary
function, with which we will decide as we see what the summary function returns for each model type and generalize them to be consistent with one another.rank_methods
function, which will calculate how accurate each model is. Then display to the user what we believe the best modeling type in decreasing order.\section*{Analysis}
We begin our analysis by first installing our package sumcat
into your R session.
library(sumcat)
We can then load the water_potability
data set that we will be running our analysis on along with the water_test
data set.
data("water_potability") data("water_test")
After loading these data sets into our R session, we need to set a seed to allow for reproducibility, allowing the user to follow along with this example. We will also be making the models for our analysis using the model_cat()
function. We are interested in a starting point for this categorical analysis so our formula
statement for the function will be our response variable Potability
with all of our explanatory variables.
set.seed(121521) model <- model_cat(Potability ~ ., water_potability, water_test)
We can view our model statements by using the package summary
function to display these results.
summary(model)
We can then see how the accuracy compares among the four models: logistic regression, random forest, support vector machines, and linear discriminant analysis. This is done by calling the rank_methods
function which shows a dataframe of the models and accuracies.
rank_methods(model)
\newpage To visualize how the residuals are distributed, we overwrote the plot function to show the residuals.
plot(model)
model_timing <- function(formula, data, test_data){ model_frame <- model.frame(formula, data = data) y <- model_frame[, 1] x <- model_frame[, 2:ncol(model_frame)] test_model_frame <- model.frame(formula, data = test_data) fitted.predict <- as.integer(test_model_frame[, 1]) - 1 microbenchmark::microbenchmark(log <- function(){ log_model <- glm(formula, data = data, family = "binomial") log_pred <- predict(log_model, newdata = test_data, type = "response") log_pred <- ifelse(log_pred < .5, 0, 1) log_model$Prediction <- log_pred log_model$Fitted <- fitted.predict }, rf <- function() { rf_model <- randomForest(formula, data = data) rf_pred <- as.integer(predict(rf_model, newdata = test_data)) - 1 rf_model$Prediction <- rf_pred rf_model$Fitted <- fitted.predict }, svm <- function() { svm_model <- eztune(x, y, method = "svm")$model svm_pred <- as.integer(predict(svm_model, newdata = test_data)) - 1 svm_model$Prediction <- svm_pred svm_model$Fitted <- fitted.predict }, lda <- function() { lda_model <- lda(formula, data = data) lda_pred <- as.integer(predict(lda_model, newdata = test_data)$class) - 1 lda_model$Prediction <- lda_pred lda_model$Fitted <- fitted.predict }, times = 1000) } mbm <- summary(model_timing(Potability ~. , water_potability, water_test)) mbm$expr <- c("log_model", "rf_model", "svm_model", "lda_model" ) mbm
\section*{Conclusion}
In conclusion, we notice from our analysis at least with the data set found, water_potability
, was best predicted by Random Forest with Support Vector Machines close behind, which is shown in Figure 1. When we look at the model timing of each we see that each fits it about the same average speed, with the Random Forest coming in second but having the greatest predictability. We would suggest looking at fine tuning the Random Forest Model as this predicted the best but did not cost so much computationally relatively to the other models.
\section*{References} Kadiwal, A. (2021, April). Water quality: Drinking water. Retrieved from https://www.kaggle .com/adityakadiwal/water-potability (Accessed: 11-12-21) \vspace{.1in} \ Lundell, J. F. (2017). There has to be an easier way: a simple alternative for parameter tuning of supervised learning methods. JSM Proceedings, Statistical Computing Section. Alexandria, VA: American Statistical Association, 3028–3036. 2
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.