library(rmarkdown) library(SmartEDA) library(DriveML) library(mlr) library(knitr) library(ggplot2) library(tidyr)
The document introduces the DriveML package and how it can help you to build effortless machine learning binary classification models in a short period.
DriveML is a series of functions such as AutoDataPrep
, AutoMAR
, autoMLmodel
. DriveML automates some of the complicated machine learning functions such as exploratory data analysis, data pre-processing, feature engineering, model training, model validation, model tuning and model selection.
This package automates the following steps on any input dataset for machine learning classification problems
Data cleaning
Feature engineering
Binary classification - Model training and validation
Model Explanation
Model report
Additionally, we are providing a function SmartEDA for Exploratory data analysis that generates automated EDA report in HTML format to understand the distributions of the data. Please note there are some dependencies on some other R packages such as MLR, caret, data.table, ggplot2, etc. for some specific task.
To summarize, DriveML package helps in getting the complete Machine learning classification model just by running the function instead of writing lengthy r code.
Algorithm: Missing at random features
The DriveML R package has three unique functions
Data Pre-processing and Data Preparation
autoDataPrep
function to generate a novel features based on the functional understanding of the datasetBuilding Machine Learning Models
autoMLmodel
function to develop baseline machine learning models using regression and tree based classification techniquesGenerating Model Report
autoMLReport
function to print the machine learning model outcome in HTML formatThis database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.
Data Source https://archive.ics.uci.edu/ml/datasets/Heart+Disease
Install the package "DriveML" to get the example data set.
library("DriveML") library("SmartEDA") ## Load sample dataset from ISLR pacakge data(heart)
more detailed attribute information is there in DriveML
help page
For data exploratory analysis used SmartEDA
package
Understanding the dimensions of the dataset, variable names, overall missing summary and data types of each variables
# Overview of the data - Type = 1 ExpData(data=heart,type=1) # Structure of the data - Type = 2 ExpData(data=heart,type=2)
ovw_tabl <- ExpData(data=heart,type=1) ovw_tab2 <- ExpData(data=heart,type=2)
kable(ovw_tabl, "html")
kable(ovw_tab2, "html")
snc = ExpNumStat(heart,by="GA",gp="target_var",Qnt=seq(0,1,0.1),MesofShape=2,Outlier=TRUE,round=2) rownames(snc)<-NULL
ExpNumStat(heart,by="GA",gp="target_var",Qnt=seq(0,1,0.1),MesofShape=2,Outlier=TRUE,round=2)
paged_table(snc)
Box plots for all numerical variables vs categorical dependent variable - Bivariate comparison only with classes
Boxplot for all the numerical attributes by each class of the target variable
plot4 <- ExpNumViz(heart,target="target_var",type=1,nlim=3,fname=NULL,Page=c(2,2),sample=8) plot4[[1]]
et100 <- ExpCTable(heart,Target="target_var",margin=1,clim=10,nlim=3,round=2,bin=NULL,per=F) rownames(et100)<-NULL
Cross tabulation with target_var variable
Custom tables between all categorical independent variables and the target variable
ExpCTable(Carseats,Target="Urban",margin=1,clim=10,nlim=3,round=2,bin=NULL,per=F)
kable(et100,"html")
Stacked bar plot with vertical or horizontal bars for all categorical variables
plot5 <- ExpCatViz(heart,target = "target_var", fname = NULL, clim=5,col=c("slateblue4","slateblue1"),margin=2,Page = c(2,1),sample=2) plot5[[1]]
ana1 <- ExpOutliers(heart, varlist = c("oldpeak","trestbps","chol"), method = "boxplot", treatment = "mean", capping = c(0.1, 0.9)) outlier_summ <- ana1[[1]]
ExpOutliers(heart, varlist = c("oldpeak","trestbps","chol"), method = "boxplot", treatment = "mean", capping = c(0.1, 0.9))
kable(outlier_summ,"html")
autoDataprep
Data preparation using DriveML autoDataprep function with default options
dateprep <- autoDataprep(data = heart, target = 'target_var', missimpute = 'default', auto_mar = FALSE, mar_object = NULL, dummyvar = TRUE, char_var_limit = 15, aucv = 0.002, corr = 0.98, outlier_flag = TRUE, uid = NULL, onlykeep = NULL, drop = NULL) train_data <- dateprep$master_data
We can use different types of missing imputation using mlr::impute function
myimpute <- list(classes=list(factor = imputeMode(), integer = imputeMean(), numeric = imputeMedian(), character = imputeMode())) dateprep <- autoDataprep(data = heart, target = 'target_var', missimpute = myimpute, auto_mar = FALSE, mar_object = NULL, dummyvar = TRUE, char_var_limit = 15, aucv = 0.002, corr = 0.98, outlier_flag = TRUE, uid = NULL, onlykeep = NULL, drop = NULL) train_data <- dateprep$master_data
Adding Missing at Random features using autoMAR function
marobj <- autoMAR (heart, aucv = 0.9, strataname = NULL, stratasize = NULL, mar_method = "glm") dateprep <- autoDataprep(data = heart, target = 'target_var', missimpute = myimpute, auto_mar = TRUE, mar_object = marobj, dummyvar = TRUE, char_var_limit = 15, aucv = 0.002, corr = 0.98, outlier_flag = TRUE, uid = NULL, onlykeep = NULL, drop = NULL) train_data <- dateprep$master_data
autoMLmodel
Automated training, tuning and validation of machine learning models. This function includes the following binary classification techniques
+ Logistic regression - logreg + Regularised regression - glmnet + Extreme gradient boosting - xgboost + Random forest - randomForest + Random forest - ranger + Decision tree - rpart
mymodel <- autoMLmodel( train = heart, test = NULL, target = 'target_var', testSplit = 0.2, tuneIters = 100, tuneType = "random", models = "all", varImp = 10, liftGroup = 50, maxObs = 4000, uid = NULL, htmlreport = FALSE, seed = 1991)
mymodel <- heart.model
Model performance
performance <- mymodel$results kable(performance, "html")
Randomforest model Receiver Operating Characteristic (ROC) and the variable Importance
Training dataset ROC
TrainROC <- mymodel$trainedModels$randomForest$modelPlots$TrainROC TrainROC
Test dataset ROC
TestROC <- mymodel$trainedModels$randomForest$modelPlots$TestROC TestROC
Variable importance
VarImp <- mymodel$trainedModels$randomForest$modelPlots$VarImp VarImp
Threshold
Threshold <- mymodel$trainedModels$randomForest$modelPlots$Threshold Threshold
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.