knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(STAT302package)
library(ggplot2)
library(magrittr)
library(tidyverse)

1. Introduction

This package is created as a class project for Stat302 in UW. It contains four formulas: my_t_test, my_lm, my_knn_cv, and `my_rf_cv``. This package is collabrated by Celeste Zeng and Simona Liao.

Install the package using:

devtools::install_github("celestezeng33/STAT302package", build_vignette = TRUE, build_opts = c())
library(STAT302package)
# Use this to view the vignette in the Demo HTML help
help(package = "STAT302package", help_type = "html")
# Use this to view the vignette as an isolated HTML file
utils::browseVignettes(package = "STAT302package")

2. Tutorial For my_t.test

All the tests demonstrated below use lifeExp data from my_gapminder.

case I: alternative = "two.sided"

\begin{align} H_0: \mu &= 60,\ H_a: \mu &\neq 60. \end{align}

# To begin, we load our example data set `my_gapminder`
data("my_gapminder")
my_t_test(my_gapminder$lifeExp, mu = 60, alternative = "two.sided")

From the test result, we can know that the p_value is greater than 0.05. So the result is not statistically significant and we do not have enough evidence to reject the null hypothesis, H_0.

case II: alternative = "less"

\begin{align} H_0: \mu &= 60,\ H_a: \mu &< 60. \end{align}

my_t_test(my_gapminder$lifeExp, mu = 60, alternative = "less")

From the test result, we can know that the p_value is less than 0.05. So the result is statistically significant and we have sufficient evidence to reject the null hypothesis, H_0.

case III: alternative = "greater"

\begin{align} H_0: \mu &= 60,\ H_a: \mu &> 60. \end{align}

my_t_test(my_gapminder$lifeExp, mu = 60, alternative = "greater")

From the test result, we can know that the p_value is greater than 0.05. So the result is not statistically significant and we do not have enough evidence to reject the null hypothesis, H_0.

3. Tutorial for my_lm

test <- my_lm(my_fml = lifeExp ~ gdpPercap + continent, my_data = my_gapminder)
estimate <- test[2, 1]
p_val <- test[2, 4]
test

From the statistics above, we know that the expected difference in the response lifeExp between two observations differing by one unit in gdpPercap, with all other covariates identical, is r estimate. Comparing to the coeffiecients of different continents, differences in gdpPercap have less influence on lifeExp than differences in continent.

Now, we will do a two sided hypothesis test associated with the gdpPercap coefficient. The null hypothesis is that there will be no significant prediction of lifeExp by gdpPercap.

\begin{align} H_0: \beta &= 0,\ H_a: \beta &\neq 0. \end{align}

The p-value for the hypothesis test is r p_val, which is smaller than 0.05. The result is statistically significant and we have sufficient evidence to reject the null hypothesis. There will be significant prediction of lifeExp by gdpPercap.

my_coef <- test[, 1]
my_matrix <- model.matrix(lifeExp ~ gdpPercap + continent, data = my_gapminder)
y_hat <- my_matrix %*% as.matrix(my_coef)
my_df <- data.frame("actual" = my_gapminder$lifeExp, "fitted" = y_hat, "continent" = my_gapminder$continent)
 ggplot(my_df, aes(x = actual, y = fitted, color = continent)) +
   geom_point() +
   geom_abline(slope = 1, intercept = 0, col = "black") + 
   theme_bw(base_size = 15) +
   labs(x = "Fitted values", y = "Actual values", title = "Actual vs. Fitted") + 
   theme(plot.title = element_text(hjust = 0.5))

From the graph of actual values and fitted values above, we can find out that this model is very inaccurate in the prediction of countries in Africa. But it did a good job of predicting the life expectancy in European countries.

4. Tutorial for my_knn_cv

Predict using the my_gapminder dataset, we predict output class continent using covariates gdpPercap and lifeExp by utilizing 5-fold cross validation (k_cv = 5).

# For each value of k_nn, record the training misclassification rate and the CV misclassification rate
knn_vector <- c(1 : 10)
tu_knn <- matrix(NA, nrow = 10, ncol = 3)
for (i in 1 : 10) {
  tu_knn[i, 1] <- knn_vector[i]
  # get the cv misclassification rate
  tu_knn[i, 3] <- my_knn_cv(train = my_gapminder[, 3 : 4], 
          cl = my_gapminder$continent, k_nn = knn_vector[i], k_cv = 5)$cv_err
  prediction <- my_knn_cv(train = my_gapminder[, 3 : 4], 
          cl = my_gapminder$continent, k_nn = knn_vector[i], k_cv = 5)$class
  # get tge training misclassification rate
  tu_knn[i, 2] <- sum(prediction != my_gapminder$continent) / length(my_gapminder$continent)
}
tu_knn <- data.frame("k_nn" = tu_knn[, 1], "training_misclassification_rate" = tu_knn[, 2], 
                     "CV_misclassification_rate" = tu_knn[, 3])

I will use k_nn = 1 model you would choose based on the training misclassification rates; and I will use # k_nn = 10 based on the CV misclassification rates; I will use cross-validation to identify optimal k and choose it as my k_nn(which is 10) because CV splits data into k folds to use as much data as possible to train the model while still having out-of-sample test data. CV is a great tool to choose values for tuning parameters across a number of algorithms, not just k-nearest neighbors.

5. Tutorial for my_rf_cv

Using the my_gapminder dataset, predict lifeExp using covariate gdpPercap and get the cv MSE.

k_vector <- c(2, 5, 10)
tu_rf <- matrix(NA, nrow = 90, ncol = 2)
for (i in 1 : length(k_vector)) {
  for (j in 1 : 30) {
    tu_rf[(i - 1) * 30 + j, 1] <- k_vector[i]
    tu_rf[(i - 1) * 30 + j, 2] <- my_rf_cv(k_vector[i])
  }
}

Use boxplots to display data: a boxplot is associated with a k-value(2,5,10), representing 30 simulations each.

tu <- data.frame("K_value" = tu_rf[, 1], "MSE" = tu_rf[, 2])
tu_graph <- ggplot(data = tu,
       aes(x = K_value, y = MSE, group = K_value)) +
  geom_boxplot(fill = "lightblue") +
  theme_bw(base_size = 16) +
  labs(title = "CV estimated MSE for different k-values",
       x = "k-values",
       y = "Estimated MSE") + 
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  scale_x_continuous(breaks = c(2, 5, 10)) 
rf_vector <- c(mean(tu_rf[1 : 30, 2]), mean(tu_rf[31 : 60, 2]), mean(tu_rf[61 : 90, 2]), sd(tu_rf[1 : 30, 2]), sd(tu_rf[31 : 60, 2]), sd(tu_rf[61 : 90, 2]))
rf_matrix <- matrix(rf_vector, nrow = 3, ncol = 2)
# Use a table to display the average CV estimate and the standard deviation of the CV estimates across k
rf_table <- data.frame("k_values" = c(2, 5, 10),"mean" = rf_matrix[, 1], "sd" = rf_matrix[, 2])

Results of observation: as k gets bigger the means of CV estimation also get bigger, but the standard deviations get smaller. Also, the ranges of CV estimates also get smaller as k gets bigger. According to the Bias-Variance Tradeoff, larger K means less bias towards overestimating the true expected error (as training folds will be closer to the total dataset) but higher variance and higher running time (as I am getting closer to the limit case: Leave-One-Out CV); Although the results of rf_table shows decreasing standard deviation/variance, it is because Variance of # cross-validation is more u-shaped than linearly increasing. Decreases at first because I am taking the average of more trials (the number of folds). It increases later because as k approaches n, the folds become highly correlated. If I add a k-value of 100, the standard deviation/variance will be larger.



celestezeng33/STAT302package documentation built on March 22, 2020, 2:09 a.m.