knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(ada.classification)
set.seed(123) kmeansPlus(diabetes, 2, exclude = "Outcome", plot = FALSE)
kmeansPlus(diabetes, 2, exclude = "Outcome", full.output = FALSE, x = "Glucose", y = "BMI")
p1 <- kmeansPlus(diabetes, 2, exclude = "Outcome", full.output = FALSE, x = "Glucose", y = "BMI") p2 <- kmeansPlus(diabetes, 2, exclude = "Outcome", full.output = FALSE, x = "Glucose", y = "BloodPressure") p3 <- kmeansPlus(diabetes, 2, exclude = "Outcome", full.output = FALSE, x = "Insulin", y = "BMI") p4 <- kmeansPlus(diabetes, 2, exclude = "Outcome", full.output = FALSE, x = "Insulin", y = "BloodPressure") gridExtra::grid.arrange(p1, p2, p3, p4, nrow = 2)
kmeansPlus(diabetes, 3, exclude = "Outcome", x = "Glucose", y = "BMI")
p1 <- kmeansPlus(diabetes, 3, exclude = "Outcome", full.output = FALSE, x = "Glucose", y = "BMI") p2 <- kmeansPlus(diabetes, 3, exclude = "Outcome", full.output = FALSE, x = "Glucose", y = "BloodPressure") p3 <- kmeansPlus(diabetes, 3, exclude = "Outcome", full.output = FALSE, x = "Insulin", y = "BMI") p4 <- kmeansPlus(diabetes, 3, exclude = "Outcome", full.output = FALSE, x = "Insulin", y = "BloodPressure") gridExtra::grid.arrange(p1, p2, p3, p4, nrow = 2)
# Viewing the last few columns of the original dataset. diabetes[, 4:ncol(diabetes)] # Viewing the last few columns after running a simple implementation of kmeansPlus result <- kmeansPlus(diabetes, 2, exclude = "Outcome", full.output = FALSE, plot = FALSE) result[, 4:ncol(result)]
randomForest
from the package by the same name. Specific information regarding this function can be found at the following link: https://www.rdocumentation.org/packages/randomForest/versions/4.6-14/topics/randomForestrF_testing
function included in this package works to aid the user in a more simple use of the randomForest
function trained to a defined subset of a dataset to produce a confusion matrix that includes the predicted classification from the function in terms of the original classification as defined by the indicated dataset. In other words, the result is a table that shows the user where the model made the correct and incorrect classifying decisions on the data. The function then uses the resulting model on the remaining subset of the data (the test/validation set) to produce another table that tells the user how well the classification worked with these additional observations.randomForest
for classification. This can be performed by defining the variable within the as.factor
function and overwriting the original version of the variable.d <- diabetes d$Outcome <- as.factor(d$Outcome)
rF_testing
function. To run this function, the user must provide:randomForest
. The function will produce a model that will predict Outcome from all of the other variables (this is what the use of the period in the place of the predictor variable names means).set.seed(123) rF_diabetes <- rF_testing(dat = d, s = 0.75, x = Outcome, model = Outcome ~ .) rF_diabetes
randomForest
model and the second uses the model built from the training set to predict the classifications of the test/validation set. Additionally, the first confusion matrix includes the classification errors, which describes the proportion of observations that were classified into the wrong category. So in other words, referring to the output above, the proportion of incorrectly classified negative observations was 0.148 and the proportion of incorrectly classified positive observations was 0.434. The better the model is at predicting the classification of the observations, the closer to zero these proportions will be. The model can potentially be improved by a few different methods, including altering the variables used as predictors in the model (see demonstration of this below) or changing the proportion of data that gets used as the training set.rF_testing
function as demonstrated below using the confusion matrix for the training dataset.(rF_diabetes[[1]][1,1]+rF_diabetes[[1]][2,2])/(rF_diabetes[[1]][1,1]+rF_diabetes[[1]][1,2]+rF_diabetes[[1]][2,1]+rF_diabetes[[1]][2,2]) #Accuracy (correct classifications/total observations) rF_diabetes[[1]][1,1]/(rF_diabetes[[1]][1,1]+rF_diabetes[[1]][1,2]) #specificity (true negative rate) rF_diabetes[[1]][2,2]/(rF_diabetes[[1]][2,1]+rF_diabetes[[1]][2,2]) #sensitivity (true positive rate)
rF_testing
function could then be returned to to obtain the output described above. The following example employs parameters identical to the ones used in the example above but with a reduced model including only 'Insulin', 'Glucose', and 'BMI' to predict 'Outcome'set.seed(123) rF_diabetes1 <- rF_testing(dat = d, s = 0.75, x = Outcome, model = Outcome ~ Insulin + Glucose + BMI) rF_diabetes1 (rF_diabetes1[[1]][1,1]+rF_diabetes1[[1]][2,2])/(rF_diabetes1[[1]][1,1]+rF_diabetes1[[1]][1,2]+rF_diabetes1[[1]][2,1]+rF_diabetes1[[1]][2,2])
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.