In samhaycock/sumcat: Categorical Summary

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

\section*{Introduction} Often it is hard to determine what model equation is best for a given data set. We are proposing an ensemble of modeling functions for classification of data. This will allow users to input a formula and data into a function. From this input we will then run four different modeling approaches.

Logistic Regression
Linear Discriminant Analysis
Random Forest
Support Vector Machines (SVM)

To help with the difficulty that comes with SVM we will be using the package ``EZtune'' which will help in simplifying the process as it will find the hyper-parameters needed (Lundell, 2017).

\section*{Data} The dataset we will use is called ``Water Quality - Drinking water potability'' (Kadiwal, 2021). It is Open Source and available on the online platform \url{Kaggle.com}. It was uploaded by Aditya Kadiwal, a Senior Data Engineer from Pune, Maharashtra, India. The dataset consists of 10 measured metric values which are used to estimate the quality from 3276 different water bodies:

ph: pH of 1. water (0 to 14).
Hardness: Capacity of water to precipitate soap in mg/L.
Solids: Total dissolved solids in ppm.
Chloramines: Amount of Chloramines in ppm.
Sulfate: Amount of Sulfates dissolved in mg/L.
Conductivity: Electrical conductivity of water in $\mu s$/cm.
Organiccarbon: Amount of organic carbon in ppm.
Trihalomethanes: Amount of Trihalomethanes in $\mu g$/L.
Turbidity: Measure of light emiting property of water in NTU.
Potability: Indicates if water is safe for human consumption. Potable -1 and Not potable -0

The data is synthetically generated and was last updated on April, 25 2021.

\section*{Functions} The following functions are available to the user:

A model_cat, which will create a list object that contains the four models inside that were mentioned earlier. This will also perform the predictions and attach them to each individual model to be called on later to save computation time.
A plot function, which will plot the distribution of the error terms standardized on the same scale to be able to compare all the models next to each other.
A summary function, with which we will decide as we see what the summary function returns for each model type and generalize them to be consistent with one another.
A rank_methods function, which will calculate how accurate each model is. Then display to the user what we believe the best modeling type in decreasing order.

\section*{Analysis} We begin our analysis by first installing our package sumcat into your R session.

library(sumcat)

We can then load the water_potability data set that we will be running our analysis on along with the water_test data set.

data("water_potability")
data("water_test")

After loading these data sets into our R session, we need to set a seed to allow for reproducibility, allowing the user to follow along with this example. We will also be making the models for our analysis using the model_cat() function. We are interested in a starting point for this categorical analysis so our formula statement for the function will be our response variable Potability with all of our explanatory variables.

set.seed(121521)
model <- model_cat(Potability ~ ., water_potability, water_test)

We can view our model statements by using the package summary function to display these results.

summary(model)

We can then see how the accuracy compares among the four models: logistic regression, random forest, support vector machines, and linear discriminant analysis. This is done by calling the rank_methods function which shows a dataframe of the models and accuracies.

rank_methods(model)

\newpage To visualize how the residuals are distributed, we overwrote the plot function to show the residuals.

plot(model)

model_timing <- function(formula, data, test_data){
  model_frame <- model.frame(formula, data = data)
  y <- model_frame[, 1]
  x <- model_frame[, 2:ncol(model_frame)]

  test_model_frame <- model.frame(formula, data = test_data)
  fitted.predict <- as.integer(test_model_frame[, 1]) - 1

  microbenchmark::microbenchmark(log <- function(){

  log_model <- glm(formula, data = data, family = "binomial")
  log_pred <- predict(log_model, newdata = test_data, type = "response")
  log_pred <- ifelse(log_pred < .5, 0, 1)
  log_model$Prediction <- log_pred
  log_model$Fitted <- fitted.predict
}, rf <- function() {
  rf_model <- randomForest(formula, data = data)
  rf_pred <- as.integer(predict(rf_model, newdata = test_data)) - 1
  rf_model$Prediction <- rf_pred
  rf_model$Fitted <- fitted.predict
}, svm <- function() {
  svm_model <- eztune(x, y, method = "svm")$model
  svm_pred <- as.integer(predict(svm_model, newdata = test_data)) - 1
  svm_model$Prediction <- svm_pred
  svm_model$Fitted <- fitted.predict
}, lda <- function() {
  lda_model <- lda(formula, data = data)
  lda_pred <- as.integer(predict(lda_model, newdata = test_data)$class) - 1
  lda_model$Prediction <- lda_pred
  lda_model$Fitted <- fitted.predict
}, times = 1000)
}
mbm <- summary(model_timing(Potability ~. , water_potability, water_test))
mbm$expr <- c("log_model", "rf_model", "svm_model", "lda_model" )
mbm

\section*{Conclusion} In conclusion, we notice from our analysis at least with the data set found, water_potability, was best predicted by Random Forest with Support Vector Machines close behind, which is shown in Figure 1. When we look at the model timing of each we see that each fits it about the same average speed, with the Random Forest coming in second but having the greatest predictability. We would suggest looking at fine tuning the Random Forest Model as this predicted the best but did not cost so much computationally relatively to the other models.

\section*{References} Kadiwal, A. (2021, April). Water quality: Drinking water. Retrieved from https://www.kaggle .com/adityakadiwal/water-potability (Accessed: 11-12-21) \vspace{.1in} \ Lundell, J. F. (2017). There has to be an easier way: a simple alternative for parameter tuning of supervised learning methods. JSM Proceedings, Statistical Computing Section. Alexandria, VA: American Statistical Association, 3028–3036. 2

samhaycock/sumcat documentation built on Dec. 22, 2021, 10:11 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com