knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The project3part1package
contains various tools for statistical inference and prediction. This package can be used for one sample t-tests, developing linear models, and running k-fold cross-validation with both the k-nearest neighbors and random forest algorithms.
To install the package, simply run the following code:
devtools::install_github("thomson3uw/project3part1package")
Following that step, we then load the package with the standard call to library()
.
# load the package library(project3part1package) # load the other packages used in this vignette library(ggplot2) library(kableExtra)
For this demonstration, we will be using the lifeExp
data from the data my_gapminder
included in this package. As such we load in the data as shown below:
# load the Gapminder data set data(my_gapminder) my_data <- my_gapminder$lifeExp
We then test the following hypotheses using the my_t.test
function.
\begin{align} H_0: \mu &= 60,\ H_a: \mu &\neq 60. \end{align}
hypothesis1 <- my_t.test(x = my_data, alternative = "two.sided", mu = 60) kable(data.frame(hypothesis1))
In the table above, we see that the p-value of our test is above our cutoff \alpha = 0.05 so we fail to reject the null hypothesis and cannot conclude that the true mean life expectancy is not equal to 60.
\begin{align} H_0: \mu &= 60,\ H_a: \mu &< 60. \end{align}
hypothesis2 <- my_t.test(x = my_data, alternative = "less", mu = 60) kable(data.frame(hypothesis2))
In the above table we see that the p-value of our test is below our cutoff of \alpha = 0.05 so we reject the null hypothesis and conclude that the true mean life expectancy is below 60.
Test 3
\begin{align} H_0: \mu &= 60,\ H_a: \mu &> 60. \end{align}
hypothesis3 <- my_t.test(x = my_data, alternative = "greater", mu = 60) kable(data.frame(hypothesis3))
In the above table, we see that the p-value from our test is significantly larger than \alpha = 0.05, so we fail to reject the null hypothesis and cannot conclude that the true mean life expectancy is greater than 60. Considering the mean of lifeExp
is less than 60 and as such does not provide any evidence towards the true mean being greater than 60, this conclusion seems reasonable.
To demonstrate the the use of the function my_lm
, we will be using gdpPercap
and continent
from the my_gapminder
data set as explanatory variables and lifeExp
as the response in a linear regression model. The code below creates such a model and the following table shows the predicted values of each of its coefficients.
my_fit <- my_lm(lifeExp ~ gdpPercap + continent, data = my_gapminder) signif(my_fit, 4) gdpPercap_coeff <- signif(my_fit[2, 1], 3)
In the output above, we see that gdpPercap
has a value of r gdpPercap_coeff
, which suggests that as long as continent
remains constant, an increase of 1 GDP per Capita in a given country corresponds with an increase of r gdpPercap_coeff
years in its average life expectancy. This suggest that people in countries with high GDPs per Capita tend to live longer.
However, it is possible that this positive value of gdpPercap
is not actually significant. So, we would like to test whether said coefficient is truly non-zero. We use the following hypotheses:
\begin{align} H_0: gdpPercap &= 0,\ H_a: gdpPercap &\neq 0. \end{align}
In the table above, we see that the p-value that corresponds to gdpPercap
is significantly below \alpha = 0.05, so for that significance level, we conclude that gdpPercap
is not equal to zero.
We now plot the Actual v. Fitted values from our data and model.
X <- model.matrix(lifeExp ~ gdpPercap + continent, my_gapminder) beta <- my_fit[, 1] fitted_data <- X %*% beta actual_data <- my_gapminder$lifeExp actual_v_fitted <- data.frame("Actual" = actual_data, "Fitted" = fitted_data, "Continent" = as.factor(my_gapminder$continent)) actual_v_fitted_plot <- ggplot(actual_v_fitted, aes(x = Actual, y = Fitted, fill = Continent)) + geom_point(shape = 21, size = 2, color = "black", alpha = 0.9) + theme_bw(base_size = 15) + labs(title = "Actual v. Fitted Life Expectancy") + theme(plot.title = element_text(hjust = 0.5)) actual_v_fitted_plot
In the Actual v. Fitted plot above, we see that in our linear fit, the continent of a country seems to play a more significant role in the predicted average life span of said country than is present in the true relationship between those two variables. This is especially aparent amongst the African and Asian data points where we predict that most of those countries have a life expectancy of 50 or 60 respectively, yet when we look at the actual life expectancies of those countries, we see that there a many countries in those regions with significantly lower and higher average life expectancies. As for the other explanatory variable, hgih GDP per Capita does appear to correspond with a higher average life expectancy. It may be worthwhile editing the variable gdpPercapita
so that it displays how much a country is above or below average so that a low GDP per Capita might also affect our results rather than simply being ignored in favor of continent. Our fitted model's predictions appear to somewhat accurate for countries in Europe, Oceania, and America.
For the tutorial, we first must set up the penguins data set that we will use to demonstrate the my_knn_cv
function.
# load the penguins data data(my_penguins) # remove NAs from penguins penguins_no_NA <- na.omit(my_penguins) # separate the covariates and the true classification my_data <- data.frame("bill_length" = penguins_no_NA$bill_length_mm, "bill_depth" = penguins_no_NA$bill_depth_mm, "flipper_length" = penguins_no_NA$flipper_length_mm, "body_mass" = penguins_no_NA$body_mass_g) my_cl <- penguins_no_NA$species
In the code below we demonstrate the use of my_knn_cv
by calculating the cross validation error and training error for 10 different values of k-nearest neighbors parameter.
train_miss <- c() cv_miss <- c() # find the training and cv misclassification rates for k_nn from 1 through 10 for (k_nn in 1:10) { my_knn_cv_out <- my_knn_cv(data = my_data, cl = my_cl, k_nn = k_nn, k_cv = 5) train_miss[k_nn] <- mean(my_knn_cv_out[[1]] != my_cl) cv_miss[k_nn] <- my_knn_cv_out[[2]] } # create a table with the results my_table <- data.frame(k_nn = 1:10, cv_err = round(cv_miss, 4), train_err = round(train_miss, 4)) kable_styling(kable(my_table))
Looking at the table above, we see that the training error is zero when we set k_nn
to 1. If we are aiming to minimize training error this is clearly the optimal choice as training error tends to increase alongside k_nn
. However, selecting the value of k_nn
for our model based purely off of training error could very lead to us using a model that overfits our data. Instead, we should use the k-fold cross validation misclassification rate. This error is calculated by first splitting the original data up into k folds, five in this case, and looping through each of those folds. For each fold, we then seperate the fold into the test set and the rest of the data into the training set and run and evaluate the knn algorithm off of that split. For the final misclassification rate, we then simply need to average all of the misclassification rates calculated for each fold. This method provides us with a metric as to how well the algorithm performs on data that it has not seen and as such is more representative of how well our algorithm will perform on unclassified data not included in the original data set. Based on the cross validation misclassification rate, we would still choose to set k_nn
to 1 in our model as it minimize said error, however, now that misclassification rate should be closer to what we would expect for new data (unlike the zero training misclassifcations rate).
For this demonstration of the my_rf_cv
function, we run the function 30 times for values of k
in c(2, 5, 10)
as shown below.
# create the matrix to store all the simulated MSEs my_cv_MSEs <- matrix(NA, nrow = 30, ncol = 3) counter <- 1 for (k in c(2, 5, 10)) { k_MSE <- c() for (i in 1:30) { k_MSE[i] <- my_rf_cv(k) } # store the 30 simulated values for this k in its own row of the overall matrix my_cv_MSEs[, counter] <- k_MSE counter <- counter + 1 } # the matrix of the generated average MSEs head(my_cv_MSEs)
In the code below we generate three boxplots of the 30 average MSEs we generated in the section above.
k2_cv_MSEs <- data.frame("MSE" = my_cv_MSEs[, 1]) k2_box_plot <- ggplot(data = k2_cv_MSEs, aes(x = "", y = MSE)) + geom_boxplot(fill = "lightblue") + theme_classic(base_size = 15) + labs(title = "A boxplot of the average MSE from k-fold cross validation \n of random forests run 30 times using k = 2", x = "k = 2", y = "Average MSE") + theme(plot.title = element_text(hjust = 0.5)) k2_box_plot
k5_cv_MSEs <- data.frame("MSE" = my_cv_MSEs[, 2]) k5_box_plot <- ggplot(data = k5_cv_MSEs, aes(x = "", y = MSE)) + geom_boxplot(fill = "red") + theme_classic(base_size = 15) + labs(title = "A boxplot of the average MSE from k-fold cross validation \n of random forests run 30 times using k = 5", x = "k = 5", y = "Average MSE") + theme(plot.title = element_text(hjust = 0.5)) k5_box_plot
k10_cv_MSEs <- data.frame("MSE" = my_cv_MSEs[, 3]) k10_box_plot <- ggplot(data = k10_cv_MSEs, aes(x = "", y = MSE)) + geom_boxplot(fill = "lightgreen") + theme_classic(base_size = 15) + labs(title = "A boxplot of the average MSE from k-fold cross validation \n of random forests run 30 times using k = 10", x = "k = 10", y = "Average MSE") + theme(plot.title = element_text(hjust = 0.5)) k10_box_plot
combined_cv_MSEs <- data.frame("Values" = c(my_cv_MSEs[, 1], my_cv_MSEs[, 2], my_cv_MSEs[, 3]), "k" = c(rep(2, 30), rep(5, 30), rep(10, 30))) combined_box_plot <- ggplot(data = combined_cv_MSEs, aes(x = as.factor(k), y = Values, fill = as.factor(k))) + geom_boxplot() + theme_classic(base_size = 15) + labs(title = "A boxplot of the average MSE from k-fold cross validation \n of random forests run 30 times using k = 2, 5, 10", x = "k", y = "Average MSE", fill = "k") + theme(plot.title = element_text(hjust = 0.5)) + scale_fill_manual(values = c("lightblue", "red", "lightgreen")) combined_box_plot
We then also compile a table of summary statistics for each value of k
.
# create a table with each value of k and the corresponding summary statistics my_MSE_table <- matrix(NA, nrow = 3, ncol = 3) # k my_MSE_table[1, 1] <- 2 my_MSE_table[2, 1] <- 5 my_MSE_table[3, 1] <- 10 # means my_MSE_table[1, 2] <- mean(my_cv_MSEs[, 1]) my_MSE_table[2, 2] <- mean(my_cv_MSEs[, 2]) my_MSE_table[3, 2] <- mean(my_cv_MSEs[, 3]) # standard deviation my_MSE_table[1, 3] <- sd(my_cv_MSEs[, 1]) my_MSE_table[2, 3] <- sd(my_cv_MSEs[, 2]) my_MSE_table[3, 3] <- sd(my_cv_MSEs[, 3]) my_MSE_table <- data.frame(my_MSE_table) colnames(my_MSE_table) <- c("k", "mean CV estimate", "standard deviation CV essimate") # display the table kable_styling(kable(my_MSE_table))
Looking over the plots and table above, we see that the median and average CV estimate drops slightly from k = 2
to k = 5
, yet remains fairly constant between k = 5
and k = 10
. It is possible that part of this difference simply comes from variance as the standard deviation of the CV estimates for k = 2
is noticeably higher than for the other two values of k
. However, it might also be the case that by restricting the size of the training set to half of its original size in each fold of crossvalidation, the random forest models in that case may be performing consistently worse than the models that train with a larger portion of data set. That logic would still then apply to the difference of the mean for k = 5
and k = 10
as well, though said is small enough to not be overly concerning.
As for the standard deviations for the three values of k
, that difference is likely explained by the number of MSEs that get averaged in order to produce each iteration of the CV estimated MSE. For k = 2
we are averaging the results of two seperate models while for k = 10
we are averaging the results of ten. We would expect the average of ten models to vary less than two. Additionally, there is more overlap in the training data for k = 10
in each fold, which might contribute to its models being more similar on average than k = 2
or even k = 5
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.