knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(STAT302PACKAGE)

Introduction

The STAT302PACKAGE contains data and four main functions (in addition to some other minor ones):

To install and use this package, run the following commands in console:

# install and load package
devtools::install_github("dzeng8/STAT302PACKAGE")
library(STAT302PACKAGE)

my_t.test Tutorial

We will demonstrate the my_t.test function with lifeExp data from the my_gapminder dataset.

The input data are measurements of people's life expectancy at birth, in years.

two-sided Test

We run a t-test with the null hypothesis being that the mean life expectancy is 60 years and with the alternative hypothesis being that the life expectancy is not 60 years.

# run two-sided t-test with null hypothesis of mean as 60
results <- my_t.test(my_gapminder$lifeExp, "two.sided", 60)
results

Using a p-value cut-off at $\alpha = 0.05$, we can see that a p-value of r results[[4]] implies that we cannot reject the null hypothesis. In other words, there is not enough evidence to reject the null hypothesis that the mean life expectancy is 60 in support of the alternative hypothesis that the mean life expectancy is not 60.

one-sided test, less

We run a t-test with the null hypothesis being that the mean life expectancy is 60 years and with the alternative hypothesis being that the life expectancy is less than 60 years.

# run one-sided t-test with null hypothesis of mean as 60, alternative less than 60
results <- my_t.test(my_gapminder$lifeExp, "less", 60)
results

Using a p-value cut-off at $\alpha = 0.05$, we can see that a p-value of r results[[4]] implies that we can reject the null hypothesis. In other words, we have sufficient evidence to reject the null hypothesis that the mean life expectancy is 60 years in support of the alternative hypothesis that the mean life expectancy is less than 60 years.

one-sided test, greater

We run a t-test with the null hypothesis being that the mean life expectancy is 60 years and with the alternative hypothesis being that the life expectancy is greater than 60 years.

# run one-sided t-test with null hypothesis of mean as 60, alternative greater than 60
results <- my_t.test(my_gapminder$lifeExp, "greater", 60)
results

Using a p-value cut-off at $\alpha = 0.05$, we can see that a p-value of r results[[4]] implies that we cannot reject the null hypothesis. In other words, we do not have sufficient evidence to reject the null hypothesis that the mean life expectancy is 60 years in support of the alternative hypothesis that the mean life expectancy is greater than 60 years.

my_lm Tutorial

We demonstrate the my_lm function by fitting a model from the my_gapminder dataset. The model will predict lifeExp using gdpPercap and continent as the explanatory variables.

# fit the linear model lifeExp ~ gdpPercap + continent on the my_gapminder dataset
results <- my_lm(lifeExp ~ gdpPercap + continent, data = my_gapminder)
results

See that the coefficient on the gdpPercap variable is r results[2, 1], which can be interpreted as meaning life expectancy is expected to increase by r results[2, 1] years per unit increase in gdpPercap.

We can see that the hypothesis test associated with the gdpPercap covariate is $H_0 : \beta_{gdpPercap} = 0$ and $H_a : \beta_{gdpPercap} \neq 0$. In other words, the null hypothesis is that the coefficient of gdpPercap is equal to 0 while the alternative hypothesis is that the coefficient is not equal to 0.

Using a p-value cut-off of $\alpha = 0.05$, we can see that the p-value of r results[2, 4] allows us to reject the null hypothesis. In other words, the gdpPercap predictor is statistically significant.

We plot the actual vs fitted values of this linear model.

# create data frame of actual vs fitted values, also adding continent column
X <- model.matrix(lifeExp ~ gdpPercap + continent, data = my_gapminder)
beta <- results[, 1]
fitted_values <- X %*% beta
actual_vs_fitted_df <- data.frame(fitted = fitted_values,
                                  actual = my_gapminder$lifeExp,
                                  continent = my_gapminder$continent)
# plot actual vs fitted values
ggplot2::ggplot(actual_vs_fitted_df, 
                ggplot2::aes(x = fitted, 
                             y = actual,
                             col = continent)) +
  ggplot2::labs(title = "Actual vs Fitted Values", 
                x = "Fitted Values",
                y = "Actual values") + 
  ggplot2::geom_point() + 
  ggplot2::geom_abline() + 
  ggplot2::theme_bw()

When we plot the actual vs fitted values with a reference line y = x, we can conclude that points on the line are points where our model correctly predicts the actual value. We can see that our model does not do well in predicting the life expectancy for the continents Africa, Americas, and Asia but does somewhat well in predicting the life expectancy for the continents Europe and Oceania. We conclude that the model fits well to the data for Europe and Oceania, but not to the data for Africa, Americas, and Asia.

my_knn_cv Tutorial

Here we demonstrate the my_knn_cv function using the my_penguins dataset.

We will predict the output class species using covariates bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g using 5-fold cross validation. We train 10 different models using k-nearest neighbors ranging from 1 to 10. We display the training misclassification rate and the CV misclassification rate in a table.

# load my_penguins data and extract relevant variables
penguin_data <- my_penguins[c("bill_length_mm",
                  "bill_depth_mm",
                  "flipper_length_mm",
                  "body_mass_g",
                  "species")]
penguin_data <- as.data.frame(penguin_data)
# remove rows containing NA values
data_noNA <- na.omit(penguin_data)
# split data into predictors and outcome
train <- data_noNA[, 1:4]
cl <- data_noNA[, 5]

# demonstrate function with 5-fold cv with k-nearest neighbors ranging from
# 1 to 10
training_misclass_rate <- rep(NA, 10)
cv_misclass_rate <- rep(NA, 10)
for (i in (1: 10)) {
  results <- my_knn_cv(train, cl, i, 5)
  # compute training misclassification rate
  predicted <- results[[1]]
  misclass_rate <- sum(predicted != cl) / length(cl)
  training_misclass_rate[i] = misclass_rate
  # compute cv misclassification rate
  cv_misclass_rate[i] <- results[[2]]
}

# create table of training misclassification rate and cv misclassification rate
# for k-nearest neighbors ranging from 1 to 10
results_table <- cbind(training_misclass_rate, cv_misclass_rate)
colnames(results_table) <- c("Training Misclassification Rate", 
                             "CV Misclassification Rate")
rownames(results_table) <- c("1-nn", "2-nn", "3-nn", "4-nn", "5-nn",
                               "6-nn", "7-nn", "8-nn", "9-nn", "10-nn")
results_table

Based on the training misclassification rate, we would choose the 1-nn model because this has the lowest misclassification rate. Based on the CV misclassification rate, we would choose the 1-nn model because this also has the lowest misclassification rate.

In practice, we would want to choose the model with the lowest CV misclassification rate. This is because the process of cross-validation involves training a model on a subset of the data and then testing the model on the other part of the data which the trained model has not seen before. This process repeats $k$ times (in k-fold cross validation), each time on different subsets of the data. This is useful because it allows us to evaluate the performance of the model and ability to generalize on unseen data. We do not want to choose a model based on the training misclassification rate because evaluating the model on the same data set used to train it would give a more optimistic (and unrealistic) evaluation of the model. Therefore we would choose the 1-nn model in practice based on cross-validation misclassification rate.

my_rf_cv Tutorial

This function uses the my_penguins data internally and trains a random forest model which predicts body_mass_g using covariates bill_length_mm, bill_depth_mm, and flipper_length_mm. It utilizes k-fold cross validation, so we will run our function using 2-fold, 5-fold, and 10-fold cross validation. We generate 30 iterations of a random forest model for each value of $k$.

# store results in a data frame
mse_df <- data.frame(matrix(nrow = 0, ncol = length(2))) 
# run model on 2, 5, and 10 folds
for (folds in c(2, 5, 10)) {
  # generate model 30 times and store mse in a vector
  mse_vector <- rep(NA, 30)
  for (i in (1:30)) {
    avg_mse <- my_rf_cv(folds)
    mse_vector[i] <- avg_mse
  }
  # store results in aggregate data frame
  new_df <- data.frame(mse = mse_vector, 
                       fold = rep(as.character(folds), 30))
  mse_df <- rbind(mse_df, new_df)
}

We generate boxplots of the data to visualize the distribution of the average mean squared error for each of the chosen number of folds.

# create boxplot, 2-fold cv
ggplot2::ggplot(data = mse_df[1:30,], ggplot2::aes(x = fold, y = mse)) +
  ggplot2::labs(title = "Average Mean Squared Error, 30 Simulations", 
                x = "Number of Folds",
                y = "Mean Squared Error") +
  ggplot2::geom_boxplot(fill = "lightblue") +
  ggplot2::theme_bw() 

# create boxplot, 5-fold cv
ggplot2::ggplot(data = mse_df[31:60,], ggplot2::aes(x = fold, y = mse)) +
  ggplot2::labs(title = "Average Mean Squared Error, 30 Simulations", 
                x = "Number of Folds",
                y = "Mean Squared Error") +
  ggplot2::geom_boxplot(fill = "lightblue") +
  ggplot2::theme_bw() 

# create boxplot, 10-fold cv
ggplot2::ggplot(data = mse_df[61:90,], ggplot2::aes(x = fold, y = mse)) +
  ggplot2::labs(title = "Average Mean Squared Error, 30 Simulations", 
                x = "Number of Folds",
                y = "Mean Squared Error") +
  ggplot2::geom_boxplot(fill = "lightblue") +
  ggplot2::theme_bw() 

We can see that as the number of folds increases, the range of values for the average mean squared error decreases.

We also create a table which displays the average CV estimate and the standard deviation of the CV estimates for each number of folds used

# create table of mean and standard deviation of MSE across each fold
avg_col <- rbind(mean(mse_df[1:30,1]), mean(mse_df[31:60,1]), mean(mse_df[61:90,1]))
sd_col <- rbind(sd(mse_df[1:30,1]), sd(mse_df[31:60,1]), sd(mse_df[61:90,1]))
results_table <- cbind(avg_col, sd_col)
rownames(results_table) <- c("2-fold CV", "5-fold CV", "10-fold CV")
colnames(results_table) <- c("Mean of MSE", "Standard Deviation of MSE")
results_table

We can see that as the number of folds increases, the mean MSE decreases and the standard deviation of the MSE decreases. This is most likely due to the fact that as we increase the number of folds, we are increasing the size of our training set and decreasing the size of our test set, leading to more precise and accurate predictions.



dzeng8/STAT302PACKAGE documentation built on Dec. 20, 2021, 2:19 a.m.