knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

1. Introduction

This package is created as a class project for Stat302 in UW. It contains four formulas: my_t_test, my_lm, my_knn_cv, and my_rf_cv. This package is collabrated by Celeste Zeng and Simona Liao.

Install the package using:

devtools::install_github("SimonaLiao/Stat302Package", build_vignette = TRUE, build_opts = c())
library(Stat302Package)
# Use this to view the vignette in the Demo HTML help
help(package = "Stat302Package", help_type = "html")
# Use this to view the vignette as an isolated HTML file
utils::browseVignettes(package = "Stat302Package")

To begin, we load our example data set my_gapminder.

library(Stat302Package)
library(ggplot2)
library(magrittr)
library(tidyverse)
data("my_gapminder")

2. Tutorial For my_t.test

All the tests demonstrated below use lifeExp data from my_gapminder.

case I: alternative = "two.sided"

\begin{align} H_0: \mu &= 60,\ H_a: \mu &\neq 60. \end{align}

my_t_test(my_gapminder$lifeExp, mu = 60, alternative = "two.sided")

From the test result, we can know that the p_value is greater than 0.05. So the result is not statistically significant and we do not have enough evidence to reject the null hypothesis, H_0.

case II: alternative = "less"

\begin{align} H_0: \mu &= 60,\ H_a: \mu &< 60. \end{align}

my_t_test(my_gapminder$lifeExp, mu = 60, alternative = "less")

From the test result, we can know that the p_value is less than 0.05. So the result is statistically significant and we have sufficient evidence to reject the null hypothesis, H_0.

case III: alternative = "greater"

\begin{align} H_0: \mu &= 60,\ H_a: \mu &> 60. \end{align}

my_t_test(my_gapminder$lifeExp, mu = 60, alternative = "greater")

From the test result, we can know that the p_value is greater than 0.05. So the result is not statistically significant and we do not have enough evidence to reject the null hypothesis, H_0.

3. Tutorial for my_lm

test <- my_lm(my_fml = lifeExp ~ gdpPercap + continent, my_data = my_gapminder)
estimate <- test[2, 1]
p_val <- test[2, 4]
test

From the statistics above, we know that the expected difference in the response lifeExp between two observations differing by one unit in gdpPercap, with all other covariates identical, is r estimate. Comparing to the coeffiecients of different continents, differences in gdpPercap have less influence on lifeExp than differences in continent.

Now, we will do a two sided hypothesis test associated with the gdpPercap coefficient. The null hypothesis is that there will be no significant prediction of lifeExp by gdpPercap.

\begin{align} H_0: \beta &= 0,\ H_a: \beta &\neq 0. \end{align}

The p-value for the hypothesis test is r p_val, which is smaller than 0.05. The result is statistically significant and we have sufficient evidence to reject the null hypothesis. There will be significant prediction of lifeExp by gdpPercap.

Now, we are going to plot the Actual vs. Fitted values using ggplot2.

my_coef <- test[, 1]
my_matrix <- model.matrix(lifeExp ~ gdpPercap + continent, data = my_gapminder)
y_hat <- my_matrix %*% as.matrix(my_coef)
my_df <- data.frame("actual" = my_gapminder$lifeExp, "fitted" = y_hat, "continent" = my_gapminder$continent)
 ggplot(my_df, aes(x = actual, y = fitted, color = continent)) +
   geom_point() +
   geom_abline(slope = 1, intercept = 0, col = "black") + 
   theme_bw(base_size = 15) +
   labs(x = "Fitted values", y = "Actual values", title = "Actual vs. Fitted") + 
   theme(plot.title = element_text(hjust = 0.5))

From the graph of actual values and fitted values above, we can find out that this model is very inaccurate in the prediction of countries in Africa. But it did a good job of predicting the life expectancy in European countries.

4. Tutorial for my_knn_cv

Predict Using the my_gapminder dataset, we predict output class continent using covariates gdpPercap and lifeExp by utilizing 5-fold cross validation (k_cv = 5).

# For each value of k_nn, record the training misclassification rate and the CV misclassification rate
knn_vector <- c(1 : 10)
tu_knn <- matrix(NA, nrow = 10, ncol = 3)
for (i in 1 : 10) {
  tu_knn[i, 1] <- knn_vector[i]
  # get the cv misclassification rate
  tu_knn[i, 3] <- my_knn_cv(train = my_gapminder[, 3 : 4], 
          cl = my_gapminder$continent, k_nn = knn_vector[i], k_cv = 5)$cv_err
  prediction <- my_knn_cv(train = my_gapminder[, 3 : 4], 
          cl = my_gapminder$continent, k_nn = knn_vector[i], k_cv = 5)$class
  # get tge training misclassification rate
  tu_knn[i, 2] <- sum(prediction != my_gapminder$continent) / length(my_gapminder$continent)
}
tu_knn <- data.frame("k_nn" = tu_knn[, 1], "training_misclassification_rate" = tu_knn[, 2], 
                     "CV_misclassification_rate" = tu_knn[, 3])

I will use k_nn = 1 model you would choose based on the training misclassification rates. I will use k_nn = 10 based on the CV misclassification rates. I will use cross-validation to identify optimal k and choose it as my k_nn(which is 10) because CV splits data into k folds to use as much data as possible to train the model while still having out-of-sample test data.

CV is a great tool to choose values for tuning parameters across a number of algorithms, not just k-nearest neighbors.

5. Tutorial for my_rf_cv

Predict lifeExp using covariate gdpPercap.

k_vector <- c(2, 5, 10)
tu_rf <- matrix(NA, nrow = 90, ncol = 2)
for (i in 1 : length(k_vector)) {
  for (j in 1 : 30) {
    tu_rf[(i - 1) * 30 + j, 1] <- k_vector[i]
    tu_rf[(i - 1) * 30 + j, 2] <- my_rf_cv(k_vector[i])
  }
}

Use boxplots to display data: a boxplot is associated with a k-value(2,5,10), representing 30 simulations each

tu <- data.frame("K_value" = tu_rf[, 1], "MSE" = tu_rf[, 2])
tu_graph <- ggplot(data = tu,
       aes(x = K_value, y = MSE, group = K_value)) +
  geom_boxplot(fill = "lightblue") +
  theme_bw(base_size = 16) +
  labs(title = "CV estimated MSE for different k-values",
       x = "k-values",
       y = "Estimated MSE") + 
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  scale_x_continuous(breaks = c(2, 5, 10)) 
rf_vector <- c(mean(tu_rf[1 : 30, 2]), mean(tu_rf[31 : 60, 2]), mean(tu_rf[61 : 90, 2]), sd(tu_rf[1 : 30, 2]), sd(tu_rf[31 : 60, 2]), sd(tu_rf[61 : 90, 2]))
rf_matrix <- matrix(rf_vector, nrow = 3, ncol = 2)
# Use a table to display the average CV estimate and the standard deviation of the CV estimates across k
rf_table <- data.frame("k_values" = c(2, 5, 10),"mean" = rf_matrix[, 1], "sd" = rf_matrix[, 2])

Results of observation: as k gets bigger the means of CV estimation also get bigger, but the standard deviations get smaller. Also, the ranges of CV estimates also get smaller as k gets bigger.

According to the Bias-Variance Tradeoff, larger K means less bias towards overestimating the true expected error (as training folds will be closer to the total dataset) but higher variance and higher running time (as I am getting closer to the limit case: Leave-One-Out CV). Although the results of rf_table shows decreasing standard deviation/variance, it is because Variance of cross-validation is more u-shaped than linearly increasing. Decreases at first because I am taking the average of more trials (the number of folds). It increases later because as k approaches n, the folds become highly correlated. If I add a k-value of 100, the standard deviation/variance will be larger.



SimonaLiao/Stat302Package documentation built on March 23, 2020, 9:56 p.m.