knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

The project3part1package contains various tools for statistical inference and prediction. This package can be used for one sample t-tests, developing linear models, and running k-fold cross-validation with both the k-nearest neighbors and random forest algorithms.

To install the package, simply run the following code:

devtools::install_github("thomson3uw/project3part1package")

Following that step, we then load the package with the standard call to library().

# load the package
library(project3part1package)
# load the other packages used in this vignette
library(ggplot2)
library(kableExtra)

Student's T-test

For this demonstration, we will be using the lifeExp data from the data my_gapminder included in this package. As such we load in the data as shown below:

# load the Gapminder data set
data(my_gapminder)
my_data <- my_gapminder$lifeExp

We then test the following hypotheses using the my_t.test function.

Test 1

\begin{align} H_0: \mu &= 60,\ H_a: \mu &\neq 60. \end{align}

hypothesis1 <- my_t.test(x = my_data, alternative = "two.sided", mu = 60)
kable(data.frame(hypothesis1))

In the table above, we see that the p-value of our test is above our cutoff \alpha = 0.05 so we fail to reject the null hypothesis and cannot conclude that the true mean life expectancy is not equal to 60.

Test 2

\begin{align} H_0: \mu &= 60,\ H_a: \mu &< 60. \end{align}

hypothesis2 <- my_t.test(x = my_data, alternative = "less", mu = 60)
kable(data.frame(hypothesis2))

In the above table we see that the p-value of our test is below our cutoff of \alpha = 0.05 so we reject the null hypothesis and conclude that the true mean life expectancy is below 60.

Test 3

\begin{align} H_0: \mu &= 60,\ H_a: \mu &> 60. \end{align}

hypothesis3 <- my_t.test(x = my_data, alternative = "greater", mu = 60)
kable(data.frame(hypothesis3))

In the above table, we see that the p-value from our test is significantly larger than \alpha = 0.05, so we fail to reject the null hypothesis and cannot conclude that the true mean life expectancy is greater than 60. Considering the mean of lifeExp is less than 60 and as such does not provide any evidence towards the true mean being greater than 60, this conclusion seems reasonable.

Fitting Linear Models

To demonstrate the the use of the function my_lm, we will be using gdpPercap and continent from the my_gapminder data set as explanatory variables and lifeExp as the response in a linear regression model. The code below creates such a model and the following table shows the predicted values of each of its coefficients.

my_fit <- my_lm(lifeExp ~ gdpPercap + continent, data = my_gapminder)
signif(my_fit, 4)

gdpPercap_coeff <- signif(my_fit[2, 1], 3)

In the output above, we see that gdpPercap has a value of r gdpPercap_coeff, which suggests that as long as continent remains constant, an increase of 1 GDP per Capita in a given country corresponds with an increase of r gdpPercap_coeff years in its average life expectancy. This suggest that people in countries with high GDPs per Capita tend to live longer.

However, it is possible that this positive value of gdpPercap is not actually significant. So, we would like to test whether said coefficient is truly non-zero. We use the following hypotheses:

\begin{align} H_0: gdpPercap &= 0,\ H_a: gdpPercap &\neq 0. \end{align}

In the table above, we see that the p-value that corresponds to gdpPercap is significantly below \alpha = 0.05, so for that significance level, we conclude that gdpPercap is not equal to zero.

We now plot the Actual v. Fitted values from our data and model.

X <- model.matrix(lifeExp ~ gdpPercap + continent, my_gapminder)
beta <- my_fit[, 1]

fitted_data <- X %*% beta
actual_data <- my_gapminder$lifeExp

actual_v_fitted <- data.frame("Actual" = actual_data, 
                              "Fitted" = fitted_data,
                              "Continent" = as.factor(my_gapminder$continent))

actual_v_fitted_plot <- ggplot(actual_v_fitted, aes(x = Actual, y = Fitted, fill = Continent)) +
  geom_point(shape = 21, 
             size = 2,
             color = "black",
             alpha = 0.9) +
  theme_bw(base_size = 15) +
  labs(title = "Actual v. Fitted Life Expectancy") +
  theme(plot.title = element_text(hjust = 0.5))
actual_v_fitted_plot

In the Actual v. Fitted plot above, we see that in our linear fit, the continent of a country seems to play a more significant role in the predicted average life span of said country than is present in the true relationship between those two variables. This is especially aparent amongst the African and Asian data points where we predict that most of those countries have a life expectancy of 50 or 60 respectively, yet when we look at the actual life expectancies of those countries, we see that there a many countries in those regions with significantly lower and higher average life expectancies. As for the other explanatory variable, hgih GDP per Capita does appear to correspond with a higher average life expectancy. It may be worthwhile editing the variable gdpPercapita so that it displays how much a country is above or below average so that a low GDP per Capita might also affect our results rather than simply being ignored in favor of continent. Our fitted model's predictions appear to somewhat accurate for countries in Europe, Oceania, and America.

K-Nearest Neighbors Cross Validation

For the tutorial, we first must set up the penguins data set that we will use to demonstrate the my_knn_cv function.

# load the penguins data
data(my_penguins)

# remove NAs from penguins
penguins_no_NA <- na.omit(my_penguins)
# separate the covariates and the true classification
my_data <- data.frame("bill_length" = penguins_no_NA$bill_length_mm,
                      "bill_depth" = penguins_no_NA$bill_depth_mm,
                      "flipper_length" = penguins_no_NA$flipper_length_mm,
                      "body_mass" = penguins_no_NA$body_mass_g)
my_cl <- penguins_no_NA$species

In the code below we demonstrate the use of my_knn_cv by calculating the cross validation error and training error for 10 different values of k-nearest neighbors parameter.

train_miss <- c()
cv_miss <- c()

# find the training and cv misclassification rates for k_nn from 1 through 10
for (k_nn in 1:10) {
  my_knn_cv_out <- my_knn_cv(data = my_data, cl = my_cl, k_nn = k_nn, k_cv = 5)
  train_miss[k_nn] <- mean(my_knn_cv_out[[1]] != my_cl)
  cv_miss[k_nn] <- my_knn_cv_out[[2]]
}

# create a table with the results
my_table <- data.frame(k_nn = 1:10,
                       cv_err = round(cv_miss, 4),
                       train_err = round(train_miss, 4))
kable_styling(kable(my_table))

Looking at the table above, we see that the training error is zero when we set k_nn to 1. If we are aiming to minimize training error this is clearly the optimal choice as training error tends to increase alongside k_nn. However, selecting the value of k_nn for our model based purely off of training error could very lead to us using a model that overfits our data. Instead, we should use the k-fold cross validation misclassification rate. This error is calculated by first splitting the original data up into k folds, five in this case, and looping through each of those folds. For each fold, we then seperate the fold into the test set and the rest of the data into the training set and run and evaluate the knn algorithm off of that split. For the final misclassification rate, we then simply need to average all of the misclassification rates calculated for each fold. This method provides us with a metric as to how well the algorithm performs on data that it has not seen and as such is more representative of how well our algorithm will perform on unclassified data not included in the original data set. Based on the cross validation misclassification rate, we would still choose to set k_nn to 1 in our model as it minimize said error, however, now that misclassification rate should be closer to what we would expect for new data (unlike the zero training misclassifcations rate).

Random Forests Cross Validation

For this demonstration of the my_rf_cv function, we run the function 30 times for values of k in c(2, 5, 10) as shown below.

# create the matrix to store all the simulated MSEs
my_cv_MSEs <- matrix(NA, nrow = 30, ncol = 3)
counter <- 1
for (k in c(2, 5, 10)) {
  k_MSE <- c()
  for (i in 1:30) {
    k_MSE[i] <- my_rf_cv(k)
  }
  # store the 30 simulated values for this k in its own row of the overall matrix
  my_cv_MSEs[, counter] <- k_MSE
  counter <- counter + 1
}
# the matrix of the generated average MSEs
head(my_cv_MSEs)

In the code below we generate three boxplots of the 30 average MSEs we generated in the section above.

k2_cv_MSEs <- data.frame("MSE" = my_cv_MSEs[, 1])
k2_box_plot <- ggplot(data = k2_cv_MSEs, aes(x = "", y = MSE)) +
  geom_boxplot(fill = "lightblue") +
  theme_classic(base_size = 15) +
  labs(title = "A boxplot of the average MSE from k-fold cross validation \n of random forests run 30 times using k = 2",
       x = "k = 2", y = "Average MSE") +
  theme(plot.title = element_text(hjust = 0.5))

k2_box_plot
k5_cv_MSEs <- data.frame("MSE" = my_cv_MSEs[, 2])
k5_box_plot <- ggplot(data = k5_cv_MSEs, aes(x = "", y = MSE)) +
  geom_boxplot(fill = "red") +
  theme_classic(base_size = 15) +
  labs(title = "A boxplot of the average MSE from k-fold cross validation \n of random forests run 30 times using k = 5",
       x = "k = 5", y = "Average MSE") +
  theme(plot.title = element_text(hjust = 0.5))

k5_box_plot
k10_cv_MSEs <- data.frame("MSE" = my_cv_MSEs[, 3])
k10_box_plot <- ggplot(data = k10_cv_MSEs, aes(x = "", y = MSE)) +
  geom_boxplot(fill = "lightgreen") +
  theme_classic(base_size = 15) +
  labs(title = "A boxplot of the average MSE from k-fold cross validation \n of random forests run 30 times using k = 10",
       x = "k = 10", y = "Average MSE") +
  theme(plot.title = element_text(hjust = 0.5))

k10_box_plot
combined_cv_MSEs <- data.frame("Values" = c(my_cv_MSEs[, 1], my_cv_MSEs[, 2], my_cv_MSEs[, 3]),
                               "k" = c(rep(2, 30), rep(5, 30), rep(10, 30)))
combined_box_plot <- ggplot(data = combined_cv_MSEs, aes(x = as.factor(k), y = Values, fill = as.factor(k))) +
  geom_boxplot() +
  theme_classic(base_size = 15) +
  labs(title = "A boxplot of the average MSE from k-fold cross validation \n of random forests run 30 times using k = 2, 5, 10",
       x = "k", y = "Average MSE", fill = "k") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_manual(values = c("lightblue", "red", "lightgreen"))

combined_box_plot

We then also compile a table of summary statistics for each value of k.

# create a table with each value of k and the corresponding summary statistics
my_MSE_table <- matrix(NA, nrow = 3, ncol = 3)

# k
my_MSE_table[1, 1] <- 2
my_MSE_table[2, 1] <- 5
my_MSE_table[3, 1] <- 10

# means
my_MSE_table[1, 2] <- mean(my_cv_MSEs[, 1])
my_MSE_table[2, 2] <- mean(my_cv_MSEs[, 2])
my_MSE_table[3, 2] <- mean(my_cv_MSEs[, 3])

# standard deviation
my_MSE_table[1, 3] <- sd(my_cv_MSEs[, 1])
my_MSE_table[2, 3] <- sd(my_cv_MSEs[, 2])
my_MSE_table[3, 3] <- sd(my_cv_MSEs[, 3])

my_MSE_table <- data.frame(my_MSE_table)
colnames(my_MSE_table) <- c("k", "mean CV estimate", "standard deviation CV essimate")

# display the table
kable_styling(kable(my_MSE_table))

Looking over the plots and table above, we see that the median and average CV estimate drops slightly from k = 2 to k = 5, yet remains fairly constant between k = 5 and k = 10. It is possible that part of this difference simply comes from variance as the standard deviation of the CV estimates for k = 2 is noticeably higher than for the other two values of k. However, it might also be the case that by restricting the size of the training set to half of its original size in each fold of crossvalidation, the random forest models in that case may be performing consistently worse than the models that train with a larger portion of data set. That logic would still then apply to the difference of the mean for k = 5 and k = 10 as well, though said is small enough to not be overly concerning.

As for the standard deviations for the three values of k, that difference is likely explained by the number of MSEs that get averaged in order to produce each iteration of the CV estimated MSE. For k = 2 we are averaging the results of two seperate models while for k = 10 we are averaging the results of ten. We would expect the average of ten models to vary less than two. Additionally, there is more overlap in the training data for k = 10 in each fold, which might contribute to its models being more similar on average than k = 2 or even k = 5.



thomson3uw/project3part1package documentation built on Dec. 23, 2021, 9:58 a.m.